Privacy Preservation by Disassociation Full text

Manolis Terrovitis, John Liagouris, Nikos Mamoulis, Spiros Skiadopoulos
Technical Report
Abstract. In this work, we focus on the preservation of user privacy in the publication of sparse multidimensional data. Existing works protect the users' sensitive information by generalizing or suppressing quasi identifiers in the original data. In many real world cases, neither generalization nor the distinction between sensitive and non-sensitive items is appropriate. For example, web search query logs contain millions of terms that are very hard to categorize as sensitive or non-sensitive. At the same time, a generalization-based anonymization would remove the most valuable information in the dataset; the original terms. Motivated by this problem, we propose an anonymization technique termed disassociation that preserves the original terms but hides the fact that two or more different terms appear in the same record. Up to now, such techniques were used to sever the link between quasi-identifiers and sensitive values in settings with a clear distinction between these types of values. Our proposal generalizes these techniques for sparse multidimensional data, where no such distinction holds. We protect the users' privacy by disassociating combinations of terms that can act as quasi-identifiers from the rest of the record or by disassociating the constituent terms, so that the identifying combination cannot be accurately recognized. To this end, we present an algorithm that anonymizes the data by first clustering them and then locally disassociating identifying combinations of terms. We analyze the attack model and extend the km-anonymity guaranty to the aforementioned setting. We empirically evaluate our method on real and synthetic datasets.