Catalog Home Page

Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites

Thanadechteemapat, W. and Fung, C.C. (2012) Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites. In: International Conference on Machine Learning and Cybernetics, ICMLC 2012, 15 - 17 July, Xian, Shaanxi pp. 1263-1267.

[img]
Preview
PDF - Authors' Version
Download (508kB)
Link to Published Version: http://dx.doi.org/10.1109/ICMLC.2012.6359546
*Subscription may be required

Abstract

Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.

Publication Type: Conference Paper
Murdoch Affiliation: School of Information Technology
Publisher: IEEE
Copyright: © 2012 IEEE.
URI: http://researchrepository.murdoch.edu.au/id/eprint/12583
Item Control Page Item Control Page

Downloads

Downloads per month over past year