A Study on Efficient Indexing for Table Search in Data Lakes Full text

Ibraheem Taha, Matteo Lissandrini, Alkis Simitsis, Yannis E. Ioannidis
Abstract. Data lakes store diverse and large volumes of datasets. One of the core challenges in data lakes is dataset discovery, which involves tasks such as finding related tables, domain discovery, and column clustering. In this paper, we focus on a popular approach for finding related tables in public or private data lakes, namely table search. Given the heterogeneity of the tables in a data lake, recent methods adopt table-representation learning and produce dense vector representations for every row, column, or even cell value. This enables advanced indexing techniques, such as HSNW, LSH, and DiskANN, which implement efficient data-structures to speed-up the core operation of approximate k-NN search in such vector spaces. However, while many indexing techniques have been employed so far, their practical value and effectiveness governed by the trade-off of accuracy vs. performance have not been explored yet. In this paper, we aim at shedding light on this gap. We start with an overview of state-of-the-art techniques for table search in data lakes that are based on vector-search operations. Then, we present an in-depth analysis of the performances of the k-ANN indexes and techniques they adopt. This allows us to map for the first time the space of alternative implementations for these techniques when applied to data lakes, revealing strengths and weaknesses of each option, and further delineating exciting novel research directions.