Scalable Hybrid Similarity Join over Geolocated Time Series  Full text

Giorgos Chatzigeorgakidis, Kostas Patroumpas, Dimitrios Skoutas, Spiros Athanasiou, and Spiros Skiadopoulos
26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2018)
Abstract. A geolocated time series is a sequence of values associated with a geolocation, such as measurements provided by a sensor installed at a certain location. In this paper, we address the problem of hybrid similarity joins over such geolocated time series. This operation returns all pairs of geolocated time series that exhibit similar behavior in the time series domain while also being closely located in space. First, we propose algorithms for performing such join operations using different types of indices, including spatial-only, time series- only, and hybrid indices. Such centralized indexing schemes can cope well with moderate data volumes, but they face scalability is- sues when the dataset size increases significantly. To overcome this problem, we present a MapReduce-based processing scheme with space-driven partitioning. Our parallel and distributed algorithm leverages our hybrid index for geolocated time series to efficiently execute similarity joins locally within each partition and minimize the amount of data that needs to be shuffled between processing nodes. An extensive experimental evaluation confirms that our approach can efficiently compute all matching pairs even for datasets containing millions of geolocated time series.