Feature Filtering Methods for Web Documents Clustering


The KIPS Transactions:PartB , Vol. 13, No. 4, pp. 489-498, Aug. 2006
10.3745/KIPSTB.2006.13.4.489,   PDF Download:

Abstract

Clustering results differ according to the datasets and the performance worsens even while using web documents which are manually processed by an indexer, because although representative clusters for a feature can be obtained by statistical feature selection methods, irrelevant features(i.e., non-obvious features and those appearing in general documents) are not eliminated. Those irrelevant features should be eliminated for improving clustering performance. Therefore, this paper proposes three feature-filtering algorithms which consider feature values per document set, together with distribution, frequency, and weights of features per document set : (1) features filtering algorithm in a document (FFID), (2) features filtering algorithm in a document matrix (FFIM), and (3) a hybrid method combining both FFID and FFIM (HFF). We have tested the clustering performance by feature selection using term frequency and expand co link information, and by feature filtering using the above methods FFID, FFIM, HFF methods. According to the results of our experiments, HFF had the best performance, whereas FFIM performed better than FFID.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
H. Park and H. C. Kwon, "Feature Filtering Methods for Web Documents Clustering," The KIPS Transactions:PartB , vol. 13, no. 4, pp. 489-498, 2006. DOI: 10.3745/KIPSTB.2006.13.4.489.

[ACM Style]
Heum Park and Hyuk Chul Kwon. 2006. Feature Filtering Methods for Web Documents Clustering. The KIPS Transactions:PartB , 13, 4, (2006), 489-498. DOI: 10.3745/KIPSTB.2006.13.4.489.