Document Type

Conference Proceeding

Publication Date

10-2023

Publication Title

The 16th International Conference on Similarity Search and Applications (SISAP), A Coruña, Spain

Pages

238-252

Publisher Name

ACM

Abstract

The Similarity Join (SJ) has become one of the most popular and valuable data processing operators in analyzing large amounts of data. Various types of similarity join operators have been effectively used in multiple scenarios. However, these operators usually generate a large output size and many similar output pairs that represent almost the same information. In previous work, a new operator called Diversity Similarity Join (DSJ) has been proposed to address these issues. DSJ generates a smaller scale output and more meaningful and diverse result pairs. This operator, however, was proposed as a single node operator crucially limiting its scalability properties. In this paper, we propose the Distributed Diversity Similarity Join (D2SJ) operator, an approach that enables SJ diversification on big datasets. We present the design guidelines and implementation details on Apache Spark, a popular big data processing framework. Our experimental results with real-world high-dimensional data show that the proposed operator has excellent performance and scalability properties.

Comments

Author Posting © The Authors, 2023. This is the author's version of the work. It is posted here by permission of the ACM for personal use, not for redistribution. The definitive version was published as part of The 16th International Conference on Similarity Search and Applications (SISAP) Proceedings in A Coruña, Spain, Pages 238-252. https://doi.org/10.1007/978-3-031-46994-7_20

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Share

COinS