Privacy preservation in the publication of sparse multidimensional data.

Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis
In Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques, Francesco Bonchi, Elena Ferrari, Eds, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, CRC press

The aim of this chapter is to investigate the preservation of privacy in the publication of sparse multidimensional data. Miltidimensional data stem from several application areas in the form of set-values, sequences, trajectories, time series, web logs and even multirelation database schemas. Multiple dimensions pose significant challenges in the sanitation of the data. The most profound challenge is the inherent difficulty of creating equivalence classes with common values in a large number of dimensions. The high dimensionality undermines the efforts to group the data in a way that it prohibits the attacker from identifying sensitive information and it requires data transformations that compromise substantially the quality of the data. In the general case this is a hard problem for highly dimensional data. Nevertheless, when data are sparse high dimensional vectors, it is possible to devise solutions that permit publishing the data and preserving the privacy of the associated individuals, without an unacceptable information loss.

In this chapter we explain the difficulties in sanitizing multidimensional data and we explain how sparsity can be exploited to find efficient anonymizing techniques. We present methods that anonymize sparse multidimensional data for a general purpose publication, i.e., without any specific assumptions on how the data will be used. We offer a survey of the best-known techniques in this direction, focusing on three basic methods that provide different privacy guarantees: k-anonymity, l-diversity and km/i>-anonymity. We conclude with an outlook of the most important challenges in this area, including the anonymization of web log data.