Personalized, Semantic and Exploratory Data Analysis / Εξατομικευμένη, Σημασιολογική και Διερευνητική Ανάλυση Δεδομένων Full text

Nikos Bikakis
National Technical University of Athens, Greece
PhD Thesis

In the Big Data era, systems in several application areas face significant efficiency and effectiveness challenges, due to the ever increasing Volume, Variety and Velocity of data. In this context, systems have to handle vast amounts of data in real time and operate in environments where different users, working on different scenarios, generate, explore and analyse different forms of data. To this direction, this thesis studies the development of personalization, exploration and semantic techniques for facilitating Big Data management and analysis. Specifically, we propose methods for: (a) scalable preference-aware data management and analysis; (b) efficient exploration and visualization over large datasets; and (c) semantic data integration, exploration and retrieval.

In the context of personalized data analysis, we study the following problems. First, we study the problem of finding and ranking objects that are preferable by a group of users based on their preferences. We propose an objective and fair interpretation of this problem. Based on this interpretation, we develop efficient index-based algorithms and we introduce an objective ranking scheme satisfying several theoretical properties. In the next problem, we thoroughly study the performance of some of the most well-known external memory skyline algorithms. Particularly, the considered algorithms are redesigned following a formal external memory model. Then, we propose numerous different design choices and we study the resulted algorithms' variations.

Regarding exploratory data analysis two problems are considered. In the first one we handle efficient on-the-fly visual exploration over large sets of data. For this problem we propose a multilevel framework that exploits a tree-based structure to hierarchically aggregate objects. Considering different exploration scenarios, we enable efficient exploration via incremental hierarchy construction and prefetching based on user interaction. Further, we provide on-the-fly efficient adaptation of the hierarchies based on user preferences. The second problem considers the exploration and visualization of very large graphs. We propose a new paradigm that allows efficient large graph visual exploration, similar to the exploration paradigm used in maps. Also, we present a disk-based scheme in order to index and store the visualized graph. In this setting, user's interactions are translated to efficient spatial operations. Finally, in order to visualize very large graphs, a partition-based visualization approach is introduced.

With respect to semantic data analysis, we focus on three problems. The first problem regards the integration between XML and Semantic Web. We present an interoperability framework that bridges the heterogeneity gap by exploiting a model for the expression of OWL-RDF/S to XML Schema mappings, a method for SPARQL to XQuery translation, and model which transforms XML Schemas into OWL ontologies. The second problem regards the use of semantics in document annotation and retrieval. For this problem we propose a semantic-based annotation model, as well as a learning method for recommending annotations. Finally, we introduce an effective retrieval method that enriches information retrieval techniques with semantics. In the last problem, we study the modelling and the exploration of evolving data, adopting the Linked Data paradigm. As a result, we propose a RDF-based change model and we develop a Linked Data infrastructure that allows exploration and retrieval over evolving data.