Scalable Indexing and Exploration of Big Time Series Data Full text

Georgios Chatzigeorgakidis
PhD Thesis
PhD Thesis

Time series are generated and stored at a vastly increasing rate in many industrial and research applications, including the Web and the Internet of Things, public utilities, finance, astronomy, biology, and many more. A significant portion concerns geolocated time series, i.e., those generated at, or otherwise associated with specific locations. Although several works have focused on efficient time series similarity search, there has been limited attention to the inherent challenge that geolocated time series introduce for hybrid queries, i.e., queries that involve both spatial proximity and time series similarity. Apart from traditional similarity search, we also consider the problem of detecting locally similar pairs and groups, called bundles, over co-evolving time series. These are pairs or groups of subsequences whose values do not differ by more than a predefined threshold for a number of consecutive timestamps. They could represent potentially valuable, concurrent common local patterns and trends among the time series. Time series visualization and visual analytics in general, is another field that has drawn the attention of the scientific community. However, there is a lack of efficient techniques for visual exploration and analysis of geolocated time series. Finally, large-scale time series forecasting has attracted a significant amount of interest, due to the highly complex nature of such data.

In this thesis, we efficiently process hybrid queries through a hybrid index that we propose, called BTSR-tree. Furthermore, we address the problem of hybrid similarity joins over such geolocated time series. We introduce both centralized and MapReduce-based algorithms for performing such join operations using spatial-only, time series-only, and hybrid indices. Then, we tackle the problem of pair and bundle discovery over co-evolving time series, via a filter-verification technique that only examines candidate matches at judiciously chosen checkpoints across time. In the same line of work, we consider hybrid queries for retrieving geolocated time series based on filters that combine spatial distance and time series local similarity. To efficiently support such queries, we introduce the SBTSR-tree index, an extension of BTSR-tree that further optimizes local similarity search. Additionally, we present two approaches that rely on hybrid indices, allowing efficient map-based visual exploration and summarization of geolocated time series data. In particular, we use the BTSR-tree index and we introduce a new variant of the standard iSAX index, called geo-iSAX. We define the structure of the new index and show how both hybrid indices can be directly exploited to produce map-based visualizations of geolocated time series at different levels of granularity. Finally, towards large-scale time series forecasting, we introduce FML-kNN, a novel distributed processing framework for big data that performs probabilistic classification and regression. The framework’s core is consisted of a k-nearest neighbor joins algorithm which, contrary to similar approaches, is executed in a single distributed session and scales on very large volumes of data of variable granularity and dimensionality.

Throughout this thesis, we experimentally and empirically evaluate our work using synthetic and real-world datasets from diverse domains, against baseline and state-of-the-art existing methods, demonstrating the efficiency and superiority of our approaches.