Integrating incomplete and possibly inconsistent data from various sources is a challenge that arises in several application areas, especially in the management of scientific data. A rising trend for data integration is to model the data as axioms in the Web Ontology Language (OWL) and use inference rules to identify new facts. Whereas there are several approaches that employ OWL for data integration, there is little work on scalable algorithms able to handle large datasets that do not fit in main memory.
The main contribution of the paper is an algorithm that allows using OWL rules for integrating data in an environment with limited memory. We propose a technique that exhaustively applies a set of inferences rules on large disk-resident datasets. To the best of our knowledge, this is the first work that proposes an I/O-aware method for evaluating such an expressive subset of OWL. Previous approaches considered either simpler models (e.g. RDF) or main memory algorithms. In the paper we detail the proposed algorithm, prove its correctness, and experimentally evaluate it on real and synthetic data.