A Study on Spam Document Classification Method using Characteristics of Keyword Repetition


The KIPS Transactions:PartB , Vol. 18, No. 5, pp. 315-324, Oct. 2011
10.3745/KIPSTB.2011.18.5.315,   PDF Download:

Abstract

In Web environment, a flood of spam causes serious social problems such as personal information leak, monetary loss from fishing and distribution of harmful contents. Moreover, types and techniques of spam distribution which must be controlled are varying as days go by. The learning based spam classification method using Bag-of-Words model is the most widely used method until now. However, this method is vulnerable to anti-spam avoidance techniques, which recent spams commonly have, because it classifies spam documents utilizing only keyword occurrence information from classification model training process. In this paper, we propose a spam document detection method using a characteristic of repeating words occurring in spam documents as a solution of anti-spam avoidance techniques. Recently, most spam documents have a trend of repeating key phrases that are designed to spread, and this trend can be used as a measure in classifying spam documents. In this paper, we define six variables, which represent a characteristic of word repetition, and use those variables as a feature set for constructing a classification model. The effectiveness of proposed method is evaluated by an experiment with blog posts and E-mail data. The result of experiment shows that the proposed method outperforms other approaches.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
S. J. Lee, J. B. Baik, C. S. Han, S. W. Lee, "A Study on Spam Document Classification Method using Characteristics of Keyword Repetition," The KIPS Transactions:PartB , vol. 18, no. 5, pp. 315-324, 2011. DOI: 10.3745/KIPSTB.2011.18.5.315.

[ACM Style]
Seong Jin Lee, Jong Bum Baik, Chung Seok Han, and Soo Won Lee. 2011. A Study on Spam Document Classification Method using Characteristics of Keyword Repetition. The KIPS Transactions:PartB , 18, 5, (2011), 315-324. DOI: 10.3745/KIPSTB.2011.18.5.315.