[ insider_reports_insider ] AOL Publishes, Withdraws User Search Data
David Utter Staff Writer
2006-08-07
Insider Reports RSS Feed
Data on searches performed by hundreds of thousands of users reached the Internet, then was promptly withdrawn by AOL after its release on an AOL Research site.
 | | Did AOL Publish Your Search Data? |  |
Information on searches performed by 650,000 users was published on the AOL Research site. Although the project is called "500k User Queries Sampled Over 3 Months," a text file included with the data indicated that 650,000 users and 20 million queries comprise the dataset. The original page has been taken down, but Adam D'Angelo of Caltech posted a link to the Google cache of the page.
D'Angelo wrote that "if you happened to be randomly chosen as one of these users, everything you searched for from March to May (2006) is now public information on the Internet." His comment that the disclosure of the 20 million queries by those 650,000 users was "a blatant violation of users' privacy" was echoed by TechCrunch blogger Michael Arrington, who blasted AOL:
The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.
D'Angelo listed a small sample of the kinds of keyword searches users performed, and commented that he was leaving out personally identifying and illegal information found in the dataset.
"I'm leaving out the worst of it - searches for names of specific people, addresses, telephone numbers, illegal drugs, and more. There is no question that law enforcement, employers, or friends could figure out who some of these people are," D'Angelo wrote.
That data takes up 439MB in compressed format. AOL included a text file that described the contents of the dataset:
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
Several mirror sites now contain the datafile, and it will likely be grabbed by numerous other people as well. The collection looks like it could have been information requested by the Department of Justice last year, as part of an inquiry that became public only after DOJ sued Google over its refusal to comply with a subpoena that major sites like AOL, Yahoo, and Microsoft complied with on request.
---
Tag: AOL
Add to Del.icio.us | Digg | Yahoo! My Web | Furl
Get all the updates in RSS:
About the Author:
David Utter is a business and technology writer for SecurityProNews and WebProNews.
More insider_reports_insider Articles
Insider Reports RSS Feed
|
|