Catalog Home Page

A SOM-based document clustering using frequent max substrings for non-segmented texts

Chumwatana, T., Wong, K.W. and Xie, H. (2010) A SOM-based document clustering using frequent max substrings for non-segmented texts. Journal of Intelligent Learning Systems and Applications, 02 (03). pp. 117-125.

[img]
Preview
PDF - Published Version
Download (364kB)
Free to read: http://dx.doi.org/10.4236/jilsa.2010.23015
*No subscription required

Abstract

This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.

Publication Type: Journal Article
Murdoch Affiliation: School of Information Technology
Publisher: Scientific Research
URI: http://researchrepository.murdoch.edu.au/id/eprint/22859
Item Control Page Item Control Page

Downloads

Downloads per month over past year