The Privacy Problems (?) of Outsourcing the Dragnet
Both Ed Felten …
I am reminded of the scene in Austin Powers where Dr. Evil, in exchange for not destroying the world, demands the staggering sum of “… one MILLION dollars.” In the year 2014, billions of records is not a particularly large database, and searching through billions of records is not an onerous requirement. The metadata for a billion calls would fit on one of those souvenir thumb drives they give away at conferences; or if you want more secure, backed up storage, Amazon will rent you what you need for $3 a month. Searching through a billion records looking for a particular phone number seems to take a few minutes on my everyday laptop, but that is only because I didn’t bother to build a simple index, which would have made the search much faster. This is not rocket science.
And Tim Edgar have started thinking about how to solve the dragnet problem.
One helpful technique, private information retrieval, allows a client to query a server without the server learning what the query is. This would allow the NSA to query large databases without revealing their subjects of interest to the database holder, and without collecting the entire database. Recent advances should allow such private searches across multiple, very large databases, a key requirement for the program. The use of these cryptographic techniques would make the need for a separate consortium that holds the data unnecessary. I discussed this in more detail in my testimonybefore the Senate Select Committee on Intelligence last fall. Seny Kamara of Microsoft Researchpoints out these techniques were first outlined over fifteen years ago, while the state of the art is outlined in “Useable, Secure, Private Search” from IEEE Security and Privacy.
But I want to consider something both point to that President Obama said in his speech which both Felten and Edgar consider.
Relying solely on the records of multiple providers, for example, could require companies to alter their procedures in ways that raise new privacy concerns.
I’m admittedly obsessed by this, but one processing step the NSA currently uses on dragnet data seems to pose particularly significant privacy concerns: the data integrity role, in which high volume numbers — pizza joints, voice mail access numbers, and telemarketers, for example — are “defeated” before anyone starts querying the database.
This training module from 2011 (and therefore before some apparent additions to the data integrity role, as I’ll lay out in a future post) describes three general technical roles, the first of which would be partly eliminated if the telecoms kept the data.
- Ensuring production meets the terms of the order and destroying that which exceeds it (5)
- Ensuring the contact-chaining process works as promised to FISC (much of this description is redacted) (7)
- Ensuring that all BR and PR/TT queries are tagged as such, as well as several other redacted tasks (this tagging feature was added after the 2009 problems) (9)
The first and third are described as “rarely coming into contact with human intelligible” metadata (the first function would likely see more intelligible data on intake of completed queries from the telecoms). But — assuming a parallel structure across these three descriptions — the redacted description on page 8 suggests that the middle function — what elsewhere is called the data integrity function — has “direct and continual access and interaction” with human intelligible metadata.
And indeed, the 2009 End-to-End Review and later primary orders describe the data integrity analysts querying the database with non-RAS approved identifiers to determine whether they’re high volume identifiers that should be taken out of the dragnet.
Those analysts are not just accessing data in raw form. They’re making analytic judgments about it, as this description from the E-2-E report explains.
As part of their Court-authorized function of ensuring BR FISA metadata is properly formatted for analysis, Data Integrity Analysts seek to identify numbers in the BR FISA metadata that are not associated with specific users, e.g., “high volume identifiers.” [Entire sentence redacted] NSA determined during the end-to-end review that the Data Integrity Analysts’ practice of populating non-user specific numbers in NSA databases had not been described to the Court.
(TS//SI//NT) For example, NSA maintains a database, [redacted] which is widely used by analysts and designed to hold identifiers, to include the types of non-user specific numbers referenced above, that, based on an analytic judgment, should not be tasked to the SIGINT system. In an effort to help minimize the risk of making incorrect associations between telephony identifiers and targets, the Data Integrity Analysts provided [redacted] included in the BR metadata to [redacted] A small number of [redacted] BR metadata numbers were stored in a file that was accessible by the BR FISA-enabled [redacted], a federated query tool that allowed3 approximately 200 analysts to obtain as much information as possible about a particular identifier of interest. Both [redacted] and the BR FISA-enabled [redacted] allowed analysts outside of those authorized by the Court to access the non-user specific number lists.
In January 2004, [redacted] engineers developed a “defeat list” process to identify and remove non-user specific numbers that are deemed to be of little analytic value and that strain the system’s capacity and decrease its performance. In building defeat lists, NSA identified non-user specific numbers in data acquired pursuant to the BR FISA Order as well as in data acquired pursuant to EO 12333. Since August 2008, [redacted] had also been sending all identifiers on the defeat list to the [several lines redacted]. [my emphasis]
That analytical judgment part is key: this does appear to be a judgment call about the distortion effect of the number balanced against its possible value. And as I’ve suggested, it is possible such judgment calls could strip the most important data from the database.
In addition, whether these tech people or others do the work, some analysts use raw data to test new chaining approaches and automatic queries, which has resulted in raw dragnet data ending up in places it didn’t belong.
It wasn’t until one of the three primary orders after September 3, 2009 (two of those have been withheld) that FISC required these techs to destroy the raw data when they were done with it. That didn’t prevent the retention of over 3,000 files apparently used for this purpose on a server up until 2012.
As of 16 February 2012, NSA determined that approximately 3,032 files containing call detail records potentially collected pursuant to prior BR Orders were retained on a server and been collected more than five years ago in violation of the 5-year retention period established for BR collection. Specifically, these files were retained on a server used by technical personnel working with the Business Records metadata to maintain documentation of provider feed data formats and performed background analysis to document why certain contact chaining rules were created. In addition to the BR work, this server also contains information related to the STELLARWIND program and files which do not appear to be related to either of these programs. NSA bases its determination that these files may be in violation of BR 11-191 because of the type of information contained in the files (i.e., call detail records), the access to the server by technical personnel who worked with the BR metadata, and the listed “creation date” for the files. It is possible that these files contain STELLARWIND data, despite the creation date. The STELLARWIND data could have been copied to this server, and that process could have changed the creation date to a timeframe that appears to indicate that they may contain BR metadata.
Which is to sum up: as of right now, it appears this role still requires both analytic judgment and access to human identifiable data in raw form. Verizon and AT&T presumably have their own automated function to do similar things for their own communities of interest, but that judgment call might be easier to automate than the one a tech analyst hoping to maximize the chances of finding a terrorist might make.
I’ll let the tech folks debate ways to accomplish this without creating the dragnet in the first place. But it does seem to be one likely explanation for the addition privacy challenges the President referenced in his speech.
The obvious solution is to get Congress to mandate that all telemarketers and pizza joint staffers must pass a security background check, just like the IT contractors at Booz Allen do. Can’t be too careful, you know.
Oh, wait a minute . . .
You’re absolutely right to obsesess about this, but I think the conclusion you draw is too narrow.
All raw data sets are dirty. There’s crap in there that you have to clean up before they can be useful. And it’s even worse than that. A data set cleaned up for a specific purpose is still going to be dirty if its used for an entirely different purpose.
That’s why there’s a quality control function in the dragnet process. You need it to make the data useable. No amount of cryptographic mumbo jumbo is going to make that go away. And, to clean data, you have to see it. You can’t clean blindly. It just doesn’t work that way.
So, if all the data are out there, either in one big database or in some federated database that encompasses all the individual telecom data stores, somebody’s going to have to look at it to make sure that queries aren’t returning garbage.
I guess you can imagine some set of specially cleared, ethically trained people who are accountable outside the normal chain of command to do this. Think of them as the intelligence community version of a Robert Heinlein Fair Witness. Then remember that’s fiction.
@Saul Tannenbaum:
From my experience in dealing with databases, I can say that the larger it is, the less likely that it can be cleaned up.
What Edgar is talking about isn’t anything new.In the 2011 or 2012 ODNI Data Mining Report (mandated by law) they report on doing exactly what he is talking about.
Read what I have below and what Edgar is talking about – probably the same thing – and what you will find is that what they are not talking about is my privacy or your privacy or anyone’s privacy other than the NSA/IC’s privacy. That is, the whole process makes it impossible to know what they are querying on, what data is being returned, nor how their systems are creating new derived data based on what they pull from elsewhere. This is “Trust Us” on steroids.
And that is if it isn’t all smoke and mirrors because it would significantly increase processing time for any one query or operation on already massive data piles. If this was truly ready for prime time then why isn’t it already out there to secure on-line banking, credit and debit transactions? If it were the Target (and others) breach would never have happened because there would be nothing useful to take.
From the report …
3. Security and Privacy Assurance (SPAR) Program. The SPAR program is a follow-on to the Automatic Privacy Protection (APP) program discussed in the 2009 and 2010 ODNI Data Mining reports. Neither the SPAR nor APP programs involve data mining, but the research results from both programs may enhance security and protect privacy in data mining activities.
The APP program ended in 2010 after acheiving two goals. First, it developed secure distributed private information retrieval (PIR) protocols that permit an entity (Client) to query a cooperating data provider (Server) and retrieve only the records that match the query without the Server learning what query was posed or what results were returned. These protocols are able to add only minimal overheads in computation and communication for simple queries and databases by using a cooperating third party who has access only to encrypted data. Second, APP demonstrated algorithms to determine automatically if complex queries are in compliance with privacy policies. This allows a Client’s auditor with access to the policy and the query history to rapidly verify that only authorized queries have been submitted to the Server.
The SPAR program was launched in 2011 to build on the successes of APP and explore additional applications of PIR to realistic IC scenarios. SPAR includes research projects in three technical areas. The first technical area protects security and privacy for database access. Unlike the simple queries and static databases of APP, SPAR will investigate protocols that handle multiple types of complex queries and databases whose records are frequently created, deleted, or updated. In addition, the protocols must integrate policy compliance checking with the security and privacy assurances so that the Server can verify that a query is compliant with a policy even though the query is never learned. The second technical area will build on advances in fully homomorphic encryption (FHE) schemes to implement PIR without relying on any third parties. FHE is the result of thirty years of cryptographic research, but current schemes are impractical due to high costs in time and memory. SPAR will attempt to explore gains in performance by modified FHE schemes that support only the computations necessary for information retrieval. The third technical area will investigate applications of PIR to the specialized information sharing architectures of publish/subscribe, email/message queues, and outsourced data storage systems.
If successful, the SPAR program will benefit the IC community by securing and protecting the privacy interests of both the custodians and the consumers of data. The technology may enhance cooperative information sharing within the IC, and among government and the private sector, by expanding policy options for satisfying security and privacy concerns when information is shared.
@Mindrayge: This is an open research problem, which, as you point out, makes it far from ready to deploy as a solution to anything. The research makes for interesting reading if you’re so inclined. There have been fantastic advances in these techniques, but that means that they’ve gone from “so computationally expensive that the thought of using it at scale is laughable” to “now we can add two numbers efficiently enough that we can argue that someday this may be practical.”
@Saul Tannenbaum: Thanks. I agree with all you say, but am not sure how what I said is more narrow and want to make sure I’m not missing something? Just because I didn’t generalize this more?
I’m actually fairly shocked more data people aren’t pointing this out. I’ve been harping on it since July and when I’ve mentioned it in last few days people are just now thinking about this, even in spite of describing Alexander’s problems chaining through pizza joints in an earlier incarnation of this.
Further, I think it poses a particular problem. I think some of our frenemies already figured out you can blow our surveillance system with spam. It’s not going to take much for people to figure out (if they haven’t already) they can get perfect cover to plot whatever by ALSO operating a call center. So it seems impossible to find the appropriate sweet spot (which is why I harp and harp and harp on Gerry’s Italian Kitchen and NSA’s adept removal of itself for review in its role in the Marathon attack).
@emptywheel: My reaction that your conclusion was too narrow comes from two things. First, your headline about outsourcing and second, framing this as a response to Obama’s remark about “relying on records of multiple providers”.
I think that it’s clear that you can’t have a dragnet and privacy, period. No amount of cryptographic special sauce is going to relieve the need to have someone look at the data for quality control purposes. This isn’t a function of any particular way of operating the dragnet (like keeping the data in the hands of providers) it’s a function of having a dragnet. And I think that that needs to be a response to the Feltons and the Edgars who mistake this for a problem amenable to crypto.
I also think this is a broader problem than the call center/Gerry’s Italian Kitchen problem. That’s a traffic analysis problem, something that cryptoanalysts having been dealing with for decades. Even if you could strip away the call center noise without losing the terrorist signal, you’d still have data clean up problems.
All this is colored, I’m sure, by my personal point of view which is that any attempt to “fix” the dragnet is misguided. It just has to stop.
@Saul Tannenbaum: You speak the truth. It is not possible to have a dragnet and privacy. Everything I am about to say after this I am sure you already know and even though I doubt many will read it due to the length I am going to just point out some of the things that show that this isn’t a narrow issue.
As you say the problems are not fixable using cryptographic approaches because all that has done is obfuscated the data itself and not done anything to change the fact that a dragnet is occurring.
We know that there has to be a homogenization process at the ingest point. It is very likely that each of the providers use various status codes for the completion status of a phone call, for example, that are different amongst all of them but share the same meaning. Actually, some of the status codes, sticking with the example, would be rolled up because they are irrelevant for the NSA’s purposes (for example, one status might be used to track a different billing rate).
We know there is such a system as TAPERLAY which homogenizes telephone numbers globally. Which makes the whole explanation of gathering all the 202 area code calls because of a “typo” comical. And this is just transformation.
There is derivation, such as building the contact chains. Which incidentally, is the only thing the FISA Court allows them to query on – already chained numbers. I have seen writing here and elsewhere that would leave one with the impression that contact chaining doesn’t occur until a query happens or that there is something like a contact chaining query when there is no such thing. All possible contact chains are created from all the data each and every time they get data. There are, no doubt, other derived data that arises from combinations of values in records, which is basically asserting a fact, in knowledge engineering terms.
Then there are processes that add fields based on derived information that may indicate types of further processing or handling or even access restrictions for any given record. Process flow and workflow related adornments and their statuses, etc. There is also likely to be grouping, summarizing, and filtering of some information to not only reduce noise but also to create higher order records that are less numerous than the raw detail records that would still provide for the searching requirements they have and reserving drilling down to or retrieving detail records only as necessary.
There are also operational security issues within the NSA and its contractors as well. It is possible that phone numbers, for example, are not only homogenized but transformed via algorithm to some obfuscated form so that even the analysts don’t actually know the real phone number being queried upon. This prevents a mole from passing on what phone numbers are being hunted down in the case of a spy ring, for example. There are no doubt hundreds of compartments at the NSA where none of them know who the targets are of the other groups and there are no doubt “generic” analyst functions that pull routine information and create analysis that feeds these compartments without the “generic” analyst knowing what they are contributing too. Also, there are very likely to highly restricted hunts where any related records to such targets are pulled or filtered from the process so that no one but the cleared compartment can ever see them.
Several hundred, if not low thousands of people, are necessary to perform such tasks and make sure that these processes work and that requires access to raw information. There is no way of avoiding it.
Plus there is one more aspect related to the whole idea of there being some third-party holding the data. It likely means that the ownership of such data would be virtual rather than physical. In that whatever entity it was would have to be given or lease NSA facilities because transport of that data has to occur across secure lines rather than the open internet which means the whole clown car of NSA protocols and TRANSSEC, LINKSEC, and INFOSEC encryption while in transit. Much of the FHE work I have seen out there was based on using RSA EC cryptology which we now know to be compromised so that wouldn’t be secure across the wires for NSA operational security purposes.
Like you say, the dragnet has got to go. We have a huge problem with it simply due to the Olsen Memorandum way back in the early 1980s that claims any data collected lawfully under a court order, such as information about every single phone called ever made in the country, can be searched by the government without 4th Amendment implications in other contexts because the government now owns or possesses that data. There are already Supreme Court decision that support such an opinion. All of this having been laid out in the Mukasey Memorandum.
The problem is that the government has the data in the first place. What they can do with it after that is virtually unlimited unless we get a Court that actually affirms the 4th Amendment in the people’s favor.