An integrated approach for content extraction, word segmentation and information presentation from Thai websites
Thanadechteemapat, Wigrai (2012) An integrated approach for content extraction, word segmentation and information presentation from Thai websites. PhD thesis, Murdoch University.
This thesis presents an integrated approach for the presentation of an overview of key content from Thai websites. This approach is intended to address the information overload issue by presenting an overview to users so that they could assess whether the information meets their needs. This study has proposed rulebased techniques for Web content extraction, and they are capable to extract key content from single and multiple webpages. As there are currently no criteria in assessing the performance of content extraction from Thai websites, this study has proposed evaluation criteria based on the length of the extracted content. Experiment results in this study have demonstrated high accuracy with efficient performance. This study also proposed a Thai word segmentation approach based on the longest matching technique with the utilisation of a corpus to segment Thai words in the extracted key content. The results from the proposed technique have been compared to techniques submitted to the Benchmark for Enhancing the Standard for Thai Language Processing (BEST) contest at Thailand. Results from this work have demonstrated that the performance is consistently better than most of the results from the participants in the contest with an accuracy of between 95 to 97 percent. To select the segmented words for a tag cloud as presentation of the overview, statistical techniques for keyword identification from the key content of single and multiple webpages have been developed, and the techniques are based on the normalisation of the Term Frequency of the keywords. The identified keywords were compared with the key content and tags provided by the websites, and the accuracy of the results was higher than the outputs obtained from the Term Frequency and Inverse Document Frequency (TFIDF) and Term Length Term Frequency (TLTF) techniques. The proposed techniques were evaluated based on Precision, Recall and F‐measure. A Variable Tag Cloud approach has also been developed in order to provide the overview to the users with flexibility and userdetermined number of keywords in the tag cloud. The approach is novel and it is believed that the findings in this research will benefit the Thai community and encourage more efficient access of information from Thai Websites.
|Publication Type:||Thesis (PhD)|
|Murdoch Affiliation:||School of Information Technology|
|Supervisor:||Fung, Lance and Wong, Kevin|
|Item Control Page|
Downloads per month over past year