Remember that Wikipedia wormhole you fell down the other night? Maybe it started with you looking at a list of every single episode of Glee and somehow ended with reading the biographical entry on John Wayne Gacy. You might not remember, and you might not want anyone else to know about it. But your history might still be on Wikipedia’s servers.
The free online encyclopedia is tracking the viewing patterns of some, but not all, of its users. And although the records are to be kept no longer than 90 days, it may retain an altered version of that data indefinitely.
Earlier this year, the Wikimedia Foundation—the non-profit organization that backs the user-edited, online encyclopedia—issued new data retention guidelines to let users know what kind of information they are tracking. Like many Web entities, the WMF has turned to transparency in order to respond to the growing public backlash against big data.
“The Foundation’s overall aim is to retain the minimum amount of information necessary in order to support the needs of the Foundation and the wider Wikimedia movement,” said Jay Walsh, a WMF spokesman.
In supporting “the needs of the Foundation,” Wikimedia automatically collects some personal information from visitors—even if they aren’t logged into a Wikipedia account. Such personal information includes visitors’ IP addresses and other data that “could be used to personally identify you.”
However, the policy states that all personal information will be kept for a maximum of 90 days, before it is either “deleted, aggregated, or anonymized.” In aggregating data, user information is combined with other data to illustrate broader trends, while anonymizing data removes the parts of the information that can identify particular users. In both cases, this allows WMF to keep data for more than 90 days. But the foundation admits that neither of these two processes can “completely eliminate the risk of re-identification.”
And this is the part that concerns some users, like Wikipedian Wnt.
“[A]ccording to the policy, not only do they retain it 90 days, but they then can retain it indefinitely by ‘anonymizing’ the IP addresses by ‘encrypting’ the ‘most specific’ part of the IP address, a process which they admit may not actually protect identity,” Wnt wrote during a recent exchange with Wikipedia cofounder Jimmy Wales on his talk page.
Wnt is concerned that the anonymizing/aggregating policy is leaving a loophole for Wikipedia to permanently retain data that is vulnerable to hackers or possible subpoena by law enforcement. He argues that even after data has been anonymized, it would be possible for technically well-equipped individuals to reconstruct IP codes and identify users.
“With these records acknowledged and their existence legitimized, there is no reason why they can’t start filing papers, cracking codes, and lining up access dates for whatever reasons they may have.”
In responding to Wnt’s concerns, Jimmy Wales noted that only a fraction of user actions—about 1 out of every 1,000—are randomly selected for tracking. He also pointsedout that 90 days is merely a maximum, and that most personal data is processed in a couple of days.
The Wikimedia Foundation would not reveal to the Daily Dot, however, what percentage of this data is actually deleted and what percentage is anonymized or aggregated and kept past 90 days.
Wikimedia was also tight-lipped about how often they are compelled to turn over information to law enforcement, with Walsh saying that “the Foundation complies with legitimate and lawful requests from enforcement agencies when it is necessary to do so.”
In April, Wikimedia published guidelines pertaining to requests for user information. In it, WMF states that information request are “relatively rare” and that each one is handled with discretion. The Foundation also has a policy of notifying users of a request for their information before it’s given out, but sometimes they forego that notification at the behest of law enforcement.
When it comes to the question of whether Wikimedia should be tracking user information at all, Walsh defended the practice, saying it was important for the overall growth of the site.
“The data the Foundation collects is critical in the development of new products that help them reach a wider international audience of readers, and to make it easier and more rewarding for people to begin contributing to the projects,” he said.
Photo Cary Bass/Flickr (CC BY-SA 2.0)