A SOM-based document clustering using frequent max substrings for non-segmented texts
Chumwatana, T., Wong, K.W. and Xie, H. (2010) A SOM-based document clustering using frequent max substrings for non-segmented texts. Journal of Intelligent Learning Systems and Applications, 02 (03). pp. 117-125.
*No subscription required
This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.
|Publication Type:||Journal Article|
|Murdoch Affiliation:||School of Information Technology|
|Item Control Page|
Downloads per month over past year