Managing, Querying and Analyzing Big Data on the Web Full text

Marios Meimaris
University of Thessaly (supervisor prof. I. Anagnostopoulos)
PhD Thesis
Abstract. In this thesis, we study information management problems that arise in the Semantic Web, focusing on the Resource Description Framework (RDF) model and its associated SPARQL query language. To this end, we focus in three directions, namely (i) RDF data evolution, (ii) storage, indexing and query optimization in RDF/SPARQL engines, and (iii) efficient and scalable information retrieval from multidimensional RDF datasets. We present efficient and scalable methods focused on specific problems in the aforementioned directions, with the ultimate aim to propose advancements in the relevant state of the art. In the first direction (chapters 2 and 3), we study the problem of representing, storing and querying evolving RDF data. To this end, a novel data model and query language are proposed, that address representation of versioning in heterogeneous domains,. Furthermore, in order to assist evaluation of RDF versioning and evolution management engines and frameworks, a novel synthetic dataset generator is introduced. In the second direction (chapters 4, 5 and 6), we tackle the problem of indexing and query optimization, specifically focusing on heavy query workloads in loosely-structured RDF datasets. To this end, we propose a novel indexing and storage scheme for RDF data that relies on the underlying graph schema of the data, as well as query optimization algorithms that take advantage of the underlying schema in order to accelerate processing of complex SPARQL queries that traditional systems fail to address. Furthermore, we provide a method for logical query optimization by triple pattern reordering, in order to further optimize the query processing tasks commonly adopted by database systems. Finally, we introduce a series of algorithms that aim to efficiently transform and compact the underlying RDF schema in order to optimize both storage and query processing. Finally, in the third direction (chapter 7), we define several types of relationships for multidimensional RDF data cubes, and we propose a series of computational algorithms that target efficient retrieval of these relationships. Extensive experimental evaluations of our methods indicate significant performance improvements with respect to the state of the art.