Graphing the sensitive boundary between personally identifiable information and publicly inferable insights

By James Kobielus | May 03, 2018

Sleuthing is the art of making intelligent inferences from sparse, disconnected and seemingly random data. Detectives, like any skilled analyst, are adept at using these inferences to solve real-world problems. Analysis is a profession that often demands ingenuity in how you play with the data that comes into your possession.

When analysts encounter sensitive data, they must be careful not to play fast and loose with it. As the European Union’s General Data Protection Regulation compliance deadline approaches, many enterprises are revisiting their procedures for analyzing, storing and protecting customers’ personal data. Privacy is serious business. Rather than assume that all customer data is fair game for them to explore, analysts must observe the bounds of what uses the customer has explicitly opted into.

GDPR strongly protects personally identifiable information, or PII, that’s maintained in digital data stores of all sorts. To that end, enterprises everywhere, not just in EU member states, are investing in platforms and tools to beef up privacy protections as a standard operating procedure that involves:

Inventorying all customer PII;
Establishing processes for gaining customer consent to acquire, store, profile, process, disclose and manage PII, as well as to erase, correct, withhold and restrict processing of it;
Flagging PII records for erasure, correction, nonprocessing and nontransfer;
Providing customers with comprehensive access to their stored PII; and
Logging all customer requests related to protecting their PII;

But these sorts of procedures may be of little use in stopping a determined analyst from using sophisticated tools to infer insights about a person that are not recorded as specific facts about them in any database, or that are available for the taking if you know how to correlate two or more public data sources.

That’s the privacy issue behind one of this week’s big stories: the apprehension of a suspect in the decades-old “Golden State Killer” case. In a nutshell, the suspect in many murders was fingered after a team of investigators tracked him down through correlation of crime-scene DNA with a publicly accessible family-genetics database. From a privacy standpoint, what was most noteworthy about this case is how it proceeded through careful correlation of publicly available information, with only a smidgen of personally identifying DNA evidence that fell short of the traditional scope of PII.

Over the decades since the “Golden State Killer” crimes were committed, investigators had come up short in useful leads, having exhausted the potential of criminal DNA databases, crime-scene fingerprints and informant tips. So they took an alternate route to an answer: using the suspect’s DNA from a duplicate crime-scene evidence kit as a key to searching for the killer through family connections in a publicly accessible family-genetics database service. They took the following steps to finger the suspect:

Had a lab convert the crime scene DNA sample into a format that could be read by the service: GEDmatch, a no-frills website for user-uploadable genetic profiles which analyzes hundreds of thousands of DNA datapoints to determine family connections;
Learned from that service that the killer had 10 to 20 distant relatives represented in the database, essentially third cousins who could trace back their lineages to common great-great-great grandparents from the early 1800s;
Used those relatives’ genetic profiles to construct 25 distinct family trees in an Ancestry.com graphing tool;
Relied on other publicly available data sources — including census data, old newspaper clippings, gravesite locators, police databases and LexisNexis — to fill in blanks in those family trees;
Vetted the family trees for potential suspects who were around the suspected age of the suspect when the crimes were committed, focusing on those who had connections to the California locations of the incidents;
Narrowed down the suspects to two;
Eliminated one through the DNA test of a relative; and
Confirmed the suspect through analysis of DNA on an item he had discarded.

What relevance does this have to GDPR or any other privacy protection regulation? Privacy can be difficult to safeguard when anybody can conceivably put two and two together without access to PII.

There is a fuzzy boundary between information that’s personally identifiable and insights about persons that are publicly inferable. GDPR and similar mandates only cover protection of discrete pieces of digital PII that that are maintained in digital databases and other recordkeeping systems. But some observers seem to be arguing that it also encompasses insights that might be gained in the future about somebody through analytics on unprotected data. That’s how I’m construing David Loshin’s statement that “sexual orientation [is] covered under GDPR, too.”

My pushback to Loshin’s position is to point out that it’s not terribly common for businesses or nonprofits to record people’s sexual orientation, unless an organization specifically serves one or more segments of the LBTGQ community — and even then, it’s pointless and perhaps gauche and intrusive to ask people to declare their orientation formally as a condition of membership. So it’s unlikely you’ll find businesses maintaining PII profile records stating that someone is gay, lesbian, bisexual or whatever.

However, you may very plausibly infer someone’s sexuality if they belong to such an LGBTQ-focused group, though it’s not 100 percent (some members may be simply showing their solidarity with friends or family who belong). By the same token, you may use machine learning to automate such inferences through correlation of disparate sources of publicly available big data. You can also use social graph analysis of their Facebook friends to make such inferences. And you can perform sophisticated multivariate behavioral and demographic analyses of publicly available data that might not fit the canonical definition of PII in to achieve highly accurate inferences on individuals.

With that as prelude, let’s return to the issue of inferring private matters about people from public databases, including searchable genomic databases that are springing up on blockchains and pretty much everywhere else. Though many of these allow people to sell their DNA profile in anonymized format, the techniques for analytically re-identifying individuals from their genetic profiles are well-known. Through graph analysis tools, this opens up the very real possibility that someone will also be able to identify your blood relatives in these databases even if they, like you, never authorized your identities to be revealed.

That’s simply smart inferencing, though some, like this headline writer, regard it as a form “hacking” into your DNA data. All of this suggests that, where privacy safeguards are concerned, the practical distinctions will continue to blur between protecting stored PII — which is the focus of GDPR and other mandates — and preventing sophisticated inferencing about sensitive personal matters, regardless of where the source data for those insights came from. People will be able to make all sorts of privacy-sensitive inferences about you and your relatives from data that’s pretty much in the clear, though no one ever secured your consents because no PII was used in the process.

As more people acquire and share their own genomic PII — whether it be through employer-sponsored programs, physician-supervised testing, do-it-yourself kits and so on — these unintended “privacy hacks” will become more frequent. The growing availability of powerful graph analysis tools will put more of your extended family at risk. Even if none of them ever comes under criminal suspicion, there are many secrets — related to medical, parentage and other sensitive matters — that society has every interest in protecting as best we can.

As “privacy hacks” of this sort become more common, they make a mockery of GDPR’s core premise that PII data management platforms and tools can ensure “privacy by design” and “privacy by default.” It even calls into doubt the notion that you can establish a secure perimeter around your otherwise GDPR-compliance data governance environment within which customer’s personal matters can always locked down and kept opaque from prying eyes.

For insights into privacy controls in the GDPR era, check out recent comments on theCUBE by Eve Maler, vice president of innovation and emerging technology at ForgeRock, who spoke with theCUBE host Jeff Frick at the recent Data Privacy Day event: