Document Type
Conference Proceeding
Publication Date
10-2023
Publication Title
The 16th International Conference on Similarity Search and Applications (SISAP), A Coruña, Spain
Pages
238-252
Publisher Name
ACM
Abstract
The Similarity Join (SJ) has become one of the most popular and valuable data processing operators in analyzing large amounts of data. Various types of similarity join operators have been effectively used in multiple scenarios. However, these operators usually generate a large output size and many similar output pairs that represent almost the same information. In previous work, a new operator called Diversity Similarity Join (DSJ) has been proposed to address these issues. DSJ generates a smaller scale output and more meaningful and diverse result pairs. This operator, however, was proposed as a single node operator crucially limiting its scalability properties. In this paper, we propose the Distributed Diversity Similarity Join (D2SJ) operator, an approach that enables SJ diversification on big datasets. We present the design guidelines and implementation details on Apache Spark, a popular big data processing framework. Our experimental results with real-world high-dimensional data show that the proposed operator has excellent performance and scalability properties.
Recommended Citation
Silva, Yasin N.; Martinez, Juan; Castro Cea, Pedro; Razente, Humberto; and Barioni, Maria Camila N.. Diversity Similarity Join for Big Data. The 16th International Conference on Similarity Search and Applications (SISAP), A Coruña, Spain, , : 238-252, 2023. Retrieved from Loyola eCommons, Computer Science: Faculty Publications and Other Works, http://dx.doi.org/10.1007/978-3-031-46994-7_20
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
Copyright Statement
© The Authors, 2023.
Comments
Author Posting © The Authors, 2023. This is the author's version of the work. It is posted here by permission of the ACM for personal use, not for redistribution. The definitive version was published as part of The 16th International Conference on Similarity Search and Applications (SISAP) Proceedings in A Coruña, Spain, Pages 238-252. https://doi.org/10.1007/978-3-031-46994-7_20