Two months ago, former U.S. National Security Agency contractor Edward Snowden leaked details about several of the agency’s controversial domestic and international surveillance programs. The disclosures implicated the U.S. government in a spying operation of almost unimaginable breadth, one that captures citizens’ emails, chats, phone data, and Internet browsing histories. In many cases, however, the documents give only vague insight into the actual amount of raw data being collected by the agency. 

In the absence of such information, analysts have turned to the NSA’s soon-to-be-completed data center in the Utah desert for clues about the agency’s surveillance intentions. They crunch the scant few details publicly known about the classified Utah Data Center—particularly its 100,000 square feet of allocated server space—in the hope of answering what has become the essential question for citizens, journalists, and lawmakers on both sides of the Internet privacy debate: Is the NSA actually collecting everything?

Since its groundbreaking in 2011, analysts have estimated that the data center will have a data storage capacity of anywhere from exabytes to yottabytes. That’s a huge range: An exabyte is a billion gigabytes; a yottabyte is a million of those exabytes. The discrepancy stems from the simple fact that so little is known about the agency or its technological capabilities to process the intelligence gathered by hundreds of classified programs around the world.

The program revealed by Snowden in early June, known as PRISM, collects information from nine major American Internet companies: Microsoft, Yahoo, Google, Facebook, PalTalk, AOL, Skype, YouTube, and Apple. According to one top secret NSA slide about the program, it collects emails, chats, videos, logins, file transfers stored data, and “special requests” for further information made by the agency through warrants from the secret Foreign Intelligence Surveillance Court. 

Beginning in September 2013, when the Utah Data Center is scheduled to be completed, this information, along with that harvested directly from fibreoptic and transcontinental submarine cables through a program known as FAIRVIEW, will all be likely be routed to the Utah Data Center. Based on interviews the Daily Dot has conducted with two NSA whistleblowers over the past two months, these programs appear to be as comprehensive as technologically possible. According to Mark Klein, the former AT&T computer network associate who blew on the NSA when it installed splitters directly into the company’s Internet cables, “What I know of the splitters is that they get everything.”

“They have established the technological infrastructure to spy on everyone,” Klein said.

Thomas Drake, a former NSA senior executive who revealed the agency’s invasive surveillance and reckless spending in 2006, believes the data center is almost certainly not focused on metadata. “I can put the metadata of the world in one room,” he said.

Of course, the technological access to information and the ability to store it are two different things. The most recent estimate of the storage capacity of the NSA was published in Forbes in July: “Based on blueprints of the facility obtained by Forbes—and published here for the first time—experts estimate that the storage capacity of the data center is lower than has previously been reported given the technology currently available and the square footage that the center has allocated for its servers.”

The article estimated that the center could hold about 12 exabytes, which “would only allow the NSA to create Beyoncé-style archives for the 13 million people living in the Los Angeles metro area.”

Bruce Schneier, a fellow at the Berkman Center for Internet and Society at Harvard Law School and a board member of the Electronic Frontier Foundation, cautioned against putting too much stock in storage estimates based on the center’s square footage. “It’s trying to get a lot of information from very little,” he told the Daily Dot.

The journalist James Bamford, who has covered the NSA for decades and last year authored a massive profile of the data center for Wired, was similarly skeptical of the Forbes estimate: “It's very hard to determine how much data can be held in the data center,” he told the Daily Dot. “So everyone's invited to give their best guess.”

In the Forbes article, Bamford is cited as claiming the center could hold data on the order of yottabytes. “I never said the data center would hold a yottabyte,” he explained. Indeed, Bamford never directly estimated the center’s storage capacity. In fact, he was already aware of the data center’s physical size years before the blueprints were made public. In November of 2009, the amount of space allocated to servers in the data center was published by the Army Corps of Engineers, in a presentation whose purpose was to “Generate Interest in the Utah Data Center.” 

Yottabyte-sized estimates appear to stem not from speculative calculations but from a 2007 Department of Defense (DoD) report titled “Global Information Grid Architecture Vision.” In the report, the DoD wrote that it is becoming a “net-centric force.” In a section listing “key targets” for DoD technologies, the report included “very large scale data storage, delivery, and transmission technologies that support ... capacities exceeding exabytes (1018 bytes) and possibly yottabytes (1024 bytes) of data.”

Now six years old—a lifetime on a technological time scale—the report stands more as an artifact of the DoD’s ambitions than as a statement of its capabilities. But if the DoD’s goal is to store yottabytes of data, the NSA would likely be in charge of fulfilling it. According the NSA’s website, the agency “provides products and services to the Department of Defense.”

So what explains the discrepancy between estimates of the data center’s capabilities made by Forbes and the target goal outlined in the DoD report (published years before ground was broken on the center)?

“One factor that must be considered, but is not,” Bamford explained, “is that the NSA spends a great deal of money on R&D, all of it very secret, and given that their job is to intercept and store vast amounts of data, it seems logical that a good portion of that R&D will be devoted to new and innovative ways to compress data. If so, then using commercial, unclassified storage estimates would be invalid.”

In essence, compression is the process of removing nonessential information from a file to reduce its size. One useful analogy is the security cameras in convenience stores. These cameras are often running 24 hours a day, which, from a storage capacity perspective, would quickly add up to an unmanageable amount of footage if the cameras were recording the standard digital film rate of about 30 frames per second. To get around that problem, the cameras record at a much lower frame rate, say 10 fps. This makes for choppy footage; but that hardly matters when the purpose of the film is just to capture a suspect’s face. In that sense, the essential information (the faces of those who enter the store) is not lost by “compressing” the film with a smaller frame rate. Similarly, songs stored as WAV files on commercially bought CDs are compressed into much smaller MP3s by lowering the sampling rate of the audio file.

Standard MP3 compression can reduce a WAV file by more than 10 times its size. But such assumptions are difficult to make in the case of the NSA. As Bamford pointed out, we have no sense of how far ahead of commercially available technologies the NSA’s compression capabilities really are.

There are other ways, too, that the essential information of these massive audio and video files can be stored. In many cases, the relevant aspect of an audio or video file is the content of the conversation, not the voice or video recording. In those instances, audio and video files can be stored as simple text transcripts. 

And indeed, there is evidence to suggest the NSA has concerned itself with speech-to-text translation since the late ‘90s. One patent filed by the NSA in 1997 is a method for “automatically generating topical descriptions of text.” In a list of applications for the technology, the NSA lists “machine transcribed speech,” implying the agency had the ability to automatically transcribe audio. 

Schneier said he suspects the agency does utilize machine transcription.

Advanced file compression and speech-to-text translations technologies would certainly allow the agency to reduce the actual amount of data being stored. But they are still mere speculation. And for Schneier, speculating on the Utah Data Center’s capacity misses the point. To the suggestion that the data center’s floor plan implies the NSA could not actually store every bit of information it has access to, Schneier said, “I think that makes no sense.

“It’s their job to spy on the world.” 

Photo by Jason Reed