Computer Science: Faculty Publications and Other Works

Diversity Similarity Join for Big Data

Document Type

Conference Proceeding

Publication Date

10-2023

Publication Title

The 16th International Conference on Similarity Search and Applications (SISAP), A Coruña, Spain

Pages

238-252

Publisher Name

ACM

Abstract

The Similarity Join (SJ) has become one of the most popular and valuable data processing operators in analyzing large amounts of data. Various types of similarity join operators have been effectively used in multiple scenarios. However, these operators usually generate a large output size and many similar output pairs that represent almost the same information. In previous work, a new operator called Diversity Similarity Join (DSJ) has been proposed to address these issues. DSJ generates a smaller scale output and more meaningful and diverse result pairs. This operator, however, was proposed as a single node operator crucially limiting its scalability properties. In this paper, we propose the Distributed Diversity Similarity Join (D2SJ) operator, an approach that enables SJ diversification on big datasets. We present the design guidelines and implementation details on Apache Spark, a popular big data processing framework. Our experimental results with real-world high-dimensional data show that the proposed operator has excellent performance and scalability properties.

Comments

Author Posting © The Authors, 2023. This is the author's version of the work. It is posted here by permission of the ACM for personal use, not for redistribution. The definitive version was published as part of The 16th International Conference on Similarity Search and Applications (SISAP) Proceedings in A Coruña, Spain, Pages 238-252. https://doi.org/10.1007/978-3-031-46994-7_20

Recommended Citation

Silva, Yasin N.; Martinez, Juan; Castro Cea, Pedro; Razente, Humberto; and Barioni, Maria Camila N.. Diversity Similarity Join for Big Data. The 16th International Conference on Similarity Search and Applications (SISAP), A Coruña, Spain, , : 238-252, 2023. Retrieved from Loyola eCommons, Computer Science: Faculty Publications and Other Works, http://dx.doi.org/10.1007/978-3-031-46994-7_20

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Copyright Statement

Download

Included in

Computer Sciences Commons

COinS

Author Manuscript

This is a pre-publication author manuscript of the final, published article.

Computer Science: Faculty Publications and Other Works

Diversity Similarity Join for Big Data

Document Type

Publication Date

Publication Title

Pages

Publisher Name

Abstract

Comments

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Author Manuscript

Submission Tools

Explore

For Contributors

About eCommons

Computer Science: Faculty Publications and Other Works

Diversity Similarity Join for Big Data

Authors

Document Type

Publication Date

Publication Title

Pages

Publisher Name

Abstract

Comments

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Share

Author Manuscript

Submission Tools

Explore

For Contributors

About eCommons