Merging RDF Characteristic Sets to Optimize SPARQL Queries

Marios Meimaris, George Papastefanatos
17th Hellenic Data Management Symposium (HDMS’19)
Abstract. RDF is nowadays a well-established standard for publishing data on the web. More than a decade of publishing and interlinking has made available RDF datasets that exhibit very large sizes (>1 billion triples) and semantically bridge together information from very different domains. To address the volume and heterogeneity of the online datasets, recent works have shown that extraction and exploitation of the implicit schema of the data can be beneficial in both storage and SPARQL query performance. Schema-based storage and query optimization in RDF heavily on two structural components of an RDF dataset, namely (i) the notion of characteristic sets (CS), i.e., different property sets that characterize subject nodes, and (ii) the join links between CSs, i.e., Extended Characteristic Sets (ECS)[2], which capture objectsubject joins between triples. Storing and indexing RDF datasets based on the CSs and ECSs has proven to yield significant performance benefits in heavy SPARQL workloads[1–4]. However, a trade-off of this approach is that it fails to address schema heterogeneity in loosely-structured datasets, which exhibit a large number of CSs and, consequently, ECSs (e.g., Geonames contains 851 CSs and 12136 CS links), and thus, skewed data distributions that impose large overheads in the extraction, storage and disk-based retrieval. In relational settings, this can lead to large numbers of tables and joins between them. To reduce the number of tables from an implicit schema, in this paper we propose a method to merge together tables that correspond to related CSs, that is, CSs that describe subjects with similar or overlapping properties. An example merge of two tables can be seen in 1. In this context, we are interested in meaningful merges that yield densely populated tables (i.e., few tables with many rows instead of many tables with few rows) and at the same time reduce NULL values. In this context, we exploit the hierarchical relationships between CSs, as captured by subsumption of their respective property sets, in order to merge related CSs. To this end, we present a novel relational system, named raxonDB, that exploits these hierarchies in order to merge together hierarchically related CSs and decrease the number of relational tables and joins between them, resulting in a more compact schema with better data distribution. We follow a relational implementation approach by storing all triples corresponding to a set of merged CSs into separate relational tables and by executing queries through a SPARQL-to-SQL transformation. Although, alternative storage technologies can be considered (key-value, graph stores, etc.), we have selected well-established technologies and database systems for the implementation of our approach, in order to take advantage of existing indexing and query processing methods that have been proven to scale efficiently in complex datasets