Merging RDF Characteristic Sets to Optimize SPARQL Queries
17th Hellenic Data Management Symposium (HDMS’19)
2019
Conference/Workshop
- Contact persons: Marios Meimaris , George Papastefanatos
- Relevant research project: VisualFacts
Abstract.
RDF is nowadays a well-established standard for publishing data
on the web. More than a decade of publishing and interlinking
has made available RDF datasets that exhibit very large sizes
(>1 billion triples) and semantically bridge together information
from very different domains. To address the volume and heterogeneity
of the online datasets, recent works have shown that
extraction and exploitation of the implicit schema of the data can
be beneficial in both storage and SPARQL query performance.
Schema-based storage and query optimization in RDF heavily
on two structural components of an RDF dataset, namely (i) the
notion of characteristic sets (CS), i.e., different property sets that
characterize subject nodes, and (ii) the join links between CSs,
i.e., Extended Characteristic Sets (ECS)[2], which capture objectsubject
joins between triples. Storing and indexing RDF datasets
based on the CSs and ECSs has proven to yield significant performance
benefits in heavy SPARQL workloads[1–4]. However,
a trade-off of this approach is that it fails to address schema heterogeneity
in loosely-structured datasets, which exhibit a large
number of CSs and, consequently, ECSs (e.g., Geonames contains
851 CSs and 12136 CS links), and thus, skewed data distributions
that impose large overheads in the extraction, storage and
disk-based retrieval. In relational settings, this can lead to large
numbers of tables and joins between them.
To reduce the number of tables from an implicit schema, in
this paper we propose a method to merge together tables that
correspond to related CSs, that is, CSs that describe subjects
with similar or overlapping properties. An example merge of
two tables can be seen in 1. In this context, we are interested in
meaningful merges that yield densely populated tables (i.e., few
tables with many rows instead of many tables with few rows)
and at the same time reduce NULL values.
In this context, we exploit the hierarchical relationships between
CSs, as captured by subsumption of their respective property
sets, in order to merge related CSs. To this end, we present a
novel relational system, named raxonDB, that exploits these hierarchies
in order to merge together hierarchically related CSs and
decrease the number of relational tables and joins between them,
resulting in a more compact schema with better data distribution.
We follow a relational implementation approach by storing all
triples corresponding to a set of merged CSs into separate relational
tables and by executing queries through a SPARQL-to-SQL
transformation. Although, alternative storage technologies can
be considered (key-value, graph stores, etc.), we have selected
well-established technologies and database systems for the implementation
of our approach, in order to take advantage of existing
indexing and query processing methods that have been proven
to scale efficiently in complex datasets