Parallel k-Modes Algorithm for Spark Framework


KIPS Transactions on Software and Data Engineering, Vol. 6, No. 10, pp. 487-492, Oct. 2017
10.3745/KTSDE.2017.6.10.487,   PDF Download:
Keywords: k-Modes, Categorical Data, Clustering, Spark
Abstract

Clustering is a technique which is used to measure similarities between data in big data analysis and data mining field. Among various clustering methods, k-Modes algorithm is representatively used for categorical data. To increase the performance of iterative-centric tasks such as k-Modes, a distributed and concurrent framework Spark has been received great attention recently because it overcomes the limitation of Hadoop. Spark provides an environment that can process large amount of data in main memory using the concept of abstract objects called RDD. Spark provides Mllib, a dedicated library for machine learning, but Mllib only includes k-means that can process only continuous data, so there is a limitation that categorical data processing is impossible. In this paper, we design RDD for k-Modes algorithm for categorical data clustering in spark environment and implement an algorithm that can operate effectively. Experiments show that the proposed algorithm increases linearly in the spark environment.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
J. Chung, "Parallel k-Modes Algorithm for Spark Framework," KIPS Transactions on Software and Data Engineering, vol. 6, no. 10, pp. 487-492, 2017. DOI: 10.3745/KTSDE.2017.6.10.487.

[ACM Style]
Jaehwa Chung. 2017. Parallel k-Modes Algorithm for Spark Framework. KIPS Transactions on Software and Data Engineering, 6, 10, (2017), 487-492. DOI: 10.3745/KTSDE.2017.6.10.487.