Efficient Similarity Joins by Adaptive Prefix Filtering


KIPS Transactions on Software and Data Engineering, Vol. 2, No. 4, pp. 267-272, Apr. 2013
10.3745/KTSDE.2013.2.4.267,   PDF Download:

Abstract

As an important operation with many applications such as data cleaning and duplicate detection, the similarity join is a challenging issue, which finds all pairs of records whose similarities are above a given threshold in a dataset. We propose a new algorithm that uses the prefix filtering principle as strong constraints on generation of candidate pairs for fast similarity joins. The candidate pair is generated only when the current prefix token of a probing record shares one prefix token of an indexing record within the constrained prefix tokens by the principle. This generation method needs not to compote an upper bound of the overlap between two records, which results in reduction of execution time. Experimental results show that our algorithm significantly outperforms the previous prefix filtering-based algorithms on real datasets.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
J. S. Park, "Efficient Similarity Joins by Adaptive Prefix Filtering," KIPS Transactions on Software and Data Engineering, vol. 2, no. 4, pp. 267-272, 2013. DOI: 10.3745/KTSDE.2013.2.4.267.

[ACM Style]
Jong Soo Park. 2013. Efficient Similarity Joins by Adaptive Prefix Filtering. KIPS Transactions on Software and Data Engineering, 2, 4, (2013), 267-272. DOI: 10.3745/KTSDE.2013.2.4.267.