Re-identification is the process by which anonymized personal data is matched with its true owner. In order to protect the privacy interests of consumers, personal identifiers, such as name and social security number, are often removed from databases containing sensitive information. Through the use of data trails an adversary can independently reconstruct the trails of locations that identified entities and their un-identified data visited, which can then employed for re-identification via trail matching. The attack strategy is based on the premise that data collecting institutions partition and release a dataset as multiple subsets, such that one release contains identifying attributes (e.g. name, social security number, phone number) and a second is devoid of these attributes (e.g. DNA sequences). The trail attack is dependent on whether the identified data is always collected with the un-identified data, termed complete, or whether one of the attributes is under-collected, termed incomplete. Both the complete and incomplete trail problems are formalized and several novel algorithms for re-identification are introduced. Examples are drawn from the areas of clickstream, DNA sequence, health, and video data.
This anonymized, or de-identified, data safeguards the privacy of consumers while still making useful information available to marketers or datamining companies. Recently, however, computer scientists have revealed that this "anonymized" data can easily be re-identified, such that the sensitive information may be linked back to an individual. The re-identification process implicates privacy rights, because organizations will say that privacy obligations do not apply to information that is anonymized, but if the data is in fact personally identifiable, then privacy obligations should apply.
.( T. Hinke. Inference aggregation detection in database management systems. In Proc of IEEE Symp. on Research in Security and Privacy. Oakland, California, 96-107, 1988.)
Unique Identification Through Zip Code, Sex, Birthdate
Latanya Sweeney, a computer science professor, conducted a study in 1990 using census data, and found that zip code, birth date, and sex could be combined to uniquely identify 87% of the United States population. To illustrate this threat, Sweeney gathered data from a government agency called Group Insurance Commission (GIC) in order to reveal the identity of a Massachusetts governor. GIC, a purchaser of health insurance for employees, released records of state employees to researchers. GIC, with the support of Governor Weld of Massachusetts, removed names, addresses, social security numbers, and other identifying information, in order to protect the privacy of these employees. Governor Weld assured Massachusetts residents that the release information would remain private.
Sweeney purchased voter rolls, which included name, zip code, address, sex, and birth date of voters in Cambridge, where Governor Weld resided, and combined the information with GIC’s data and easily found the governor. From GIC’s databases, only six people in Cambridge were born on the same day as the governor, half of them were men, and the governor was the only one who lived in the zip code provided by the voter rolls. The information in the GIC database on the Massachusetts governor included prescriptions and diagnoses.
Predicting SSNs by Birth date and State
In their 2009 study, Carnegie Mellon professor Alessandro Acquisti and researcher Ralph Gross demonstrate through a two-step process how SSN is easily predicted by knowing an individual’s birth date and geographic location. First, the researchers analyzed public records in the Social Security Administration’s Death Master File (DMF) to examine statistical trends in the assignment of SSN for those whose deaths were reported to the Social Security Administration. Second, combining these patterns derived from DMF analysis with an alive individual’s state and birth date (which can be found on various offline sources, such as voter registration lists, or online sources, such as social networking sites), Acquisti and Gross identified the first 5 digits for 44% of DMF records from 1989 to 2003 and complete SSNs in less than 1000 attempts for 8.5% of the records. Acquisti and Gross found a strong correlation between birth date and all nine digits of an SSN, a correlation that increases for individuals in less populous states. These results have important consequences for the living population in the United States, as they imply that millions of SSNs for individuals whose birthdates are known can be identified.
How Data is Re-identified
In each of the above cases, data was re-identified by combining two datasets with different types of information about an individual. One of the datasets contained anonymized information; the other contained outside information - generally available to the public - collected on a daily or routine basis (such as voter registration information), and which includes identifying information (e.g., name). The two datasets will usually have at least one type of information that is the same (e.g., birthdate), which links the anonymized information to an individual. By combining information from each of these datasets, researchers can uniquely identify individuals in the population. While companies tend to focus on the removal of personally-identifiable information (PII), the studies above show that re-identification can occur even by combining non-PII, such as movie ratings in the Netflix study or search engine queries in the AOL example.
No comments:
Post a Comment