Data provenance and preservation

Recently a lot of attention is drawn to the provenance and preservation of information. Provenance deals with tracing the origins and transformations of information in order to be able to assess its quality. Preservation deals with ensuring that data will be maintained and be always available, despite modifications in the data itself or evolution in storage technology. From an information system point of view provenance and preservation are closely related, as they both deal with change: provenance can be used to interpret data, an element which is essential in the preservation of knowledge.

The objective is to support data management in evolving interrelated databanks, in such a way that it is always possible to step back (support for schema & data evolution) and examine how and why changes took place (provenance support). This is especially useful for databanks published on the Web, which are copied or referenced by other databanks, thus forming a web of interrelated and evolving systems. Such databanks may contain, for example, scientific information (such as the biological databanks UniProt, Rfam, IUPHAR), and are very important since they constitute a record of the evolution of a scientific field.

In order to incorporate provenance and preservation in such systems we need data models that can tolerate and record change, query languages that can reveal the data trails, and tools that can be adapted to and enhance existing operational systems.

Relevant research activity: C2D