Randomized embeddings with slack, and high-dimensional approximate nearest neighbor Full text

E. Anagnostopoulos, I.Z. Emiris, and I. Psarros
ACM Trans. on Algorithms, 14:18:1–18:21, 2018
Abstract. Approximate nearest neighbor search (ε-ANN) in high dimensions has been mainly addressed by Locality Sensitive Hashing (LSH), which has polynomial dependence in dimension, sublinear query time, but subquadratic space requirement. We introduce a new “low-quality” embedding for metric spaces requiring that, for some query, there exists an approximate nearest neighbor among the pre-images of its k > 1 approximate nearest neighbors in the target space. In Euclidean spaces, we employ random projections to a dimension inversely proportional to k. Our approach extends to the decision problem with witness of checking whether there exists an approximate near neighbor; this also implies a solution for ε-ANN. After dimension reduction, we store points in a uniform grid of side length ε/ d # , where d # is the reduced dimension. Given a query, we explore cells intersecting the unit ball around the query. This data structure requires linear space and query time in O (dn^ρ), ρ ≈ 1 − ε^2 / log(1/ε), where n denotes input cardinality and d space dimension. Bounds are improved for doubling subsets via r -nets. We present our implementation for ε-ANN in C++ and experiments for d ≤ 960, n ≤ 10^6, using synthetic and real datasets, which confirm the theoretical analysis and, typically, yield better practical performance. We compare to FALCONN, the state-of-the-art implementation of multi-probe Locality Sensitive Hashing (LSH): our prototype software is essentially comparable in terms of preprocessing, query time, and storage usage.