This is a short series not so much about Amazon, Facebook and Google themselves, but rather the America that has fallen into those companies’ lengthening shadows
for an introduction to this series and “PART I – Amazon is a country, and we are all its citizens” click here
PART II
The new/not new Facebook “leak”
17 April 2021 – During this “wonderful week that was” reports surfaced that the personal data of more than 533 million Facebook users had been made publicly available on a hacker forum. Facebook responded the data had been stolen and made public from a reported “data breach” that took place in early 2018. In fact, no “breach” had occurred, it said. It was a theft. The company believes it has no case to answer.
Lately we keep seeing the same kind of headline pop up about big social networks. “Facebook Had Years to Fix the Flaw That Leaked 500M Users’ Data,” Wired reported. “Clubhouse data leak: 1.3 million scraped user records leaked online for free,” CyberNews told us last week. And “500M LinkedIn accounts leaked,” Security blared.
All of a sudden, it seems, the platforms have more leaks than Trump’s White House did. What’s going on? For starters, these publications are using the wrong verb. The data didn’t “leak,” at least not in the typical journalistic sense of a source telling you something you’re not supposed to know. Rather, the data was scraped – actively gathered in bulk from the websites where, importantly, it had been publicly posted.
Scraping is one of the more interesting quandaries in platform policies, because it comes with both obvious harms and significant benefits. Scraping is what enabled the malign facial recognition software dystopia known as Clearview AI to gather more than 3 billion images of people and sell them to law enforcement agencies. And scraping is also what enabled the NYU Ad Observatory to collect, with users’ permission, noteworthy evidence about political advertisements on Facebook for academic research.
Scraping is possible because the World Wide Web is made of text, and text can be copied and pasted. If you are reading this post on your desktop, you could write code to scrape this entire article and post it as a series of tweets. If you’re one step more technologically sophisticated, you could write a script to scrape the entire archive of any blog and publish it as an e-book.
NOTE: my pen test company would often do this for fellow bloggers to show them the danger and to help them install software to prevent scraping. And I must emphasise: we only wear a white hat, not a black hat. For instance, if you were at Legalweek 2019 we showed a few select eD contacts how easy it was to hack into the Hilton wi-fi and access your mobile. Rule #1 when traveling: never, evah use the hotel wi-fi.
Technology companies know all this scraping stuff, of course, and so (most) have long since implemented tools designed to prevent scripts and other tools from collecting data in bulk. The companies may limit the rate at which you can load new pages, for example, making any attempted scraping infeasible. They may track the IP address from which you are sending all of your requests. They may employ a CAPTCHA, and attempt to make you prove your humanity before showing you any more pages.
But this is a cat-and-mouse game, and over the years the mice who scrape have proven to be more inventive than the cats attempting to swat them away. A few years back, a San Francisco “talent management algorithm” company called hiQ Labs scraped public LinkedIn data to build its products. LinkedIn sent the company a cease-and-desist order, alleging that hiQ’s scraping was unauthorized.
The case made it to the Federal Ninth Circuit Court of Appeals, which in 2019 ruled in favor of hiQ. Judge Marsha Berzon ruled that hiQ faced irreparable harm if it weren’t allowed to go on scraping, Reuters reported at the time:
She also said giving companies such as LinkedIn “free rein” over who can use public user data risked creating “information monopolies” that harm the public interest.
“LinkedIn has no protected property interest in the data contributed by its users, as the users retain ownership over their profiles,” Berzon wrote. “And as to the publicly available profiles, the users quite evidently intend them to be accessed by others,” including prospective employers.
In practice, the ruling means that companies cannot rely on the courts to prevent scrapers like hiQ and Clearview from doing their dirty work. Instead, platforms have to build technology to thwart them. The stories about platforms in the past week are largely about those technologies failing.
In Clubhouse’s case, there appears not to have been much anti-scraping technology in place at all. The company shrugged off concerns, taking time to trash the press for even mentioning them. Here’s Kim Lyons at The Verge:
Cyber News reported a SQL database with users’ IDs, names, usernames, Twitter and Instagram handles and follower counts were posted to an online hacker forum. According to Cyber News, it did not appear that sensitive user information such as credit card numbers were among the leaked info.
Clubhouse did not immediately reply to a request for more information from The Verge on Sunday. But Davison said in response to a question during a town hall that the platform had not suffered a data breach. “No, This is misleading and false, it is a clickbait article, we were not hacked. The data referred to was all public profile information from our app. So the answer to that is a definitive ‘no.’”
The scraping may appear innocuous, but as this Twitter thread notes, it enables hackers to connect private Twitter accounts to public Instagram accounts, which could enable at least some harms.
The LinkedIn data, on the other hand, was aggregated from a number of sources, none of which it specified. We know that LinkedIn uses anti-scraping tools, though, because here is a comprehensive guide to getting around those tools. In any case, the data here was all publicly posted, and while LinkedIn was clearly unhappy and tried to sue to prevent this from happening again in the future, it lost.
Of the three companies here, Facebook seems to have taken the most precautions to prevent scraping – but still lost control of data for hundreds of millions of people. At Wired, Lily Hay Newman explains the hackers’ methods:
Attackers were able to “scrape” Facebook by enumerating batches of possible phone numbers from more than 100 countries, submitting them to the contact import tool, and manipulating it to return the names, Facebook IDs, and other data users had posted on their profiles. The lapse spoke to the potential for the contact import tool to access sensitive data and the need to look carefully for bugs and inadvertent behavior in the feature.
This looks and feels the most like a true hack of anything presented here, but also the hackers only managed to grab data that was publicly posted to user profiles. Like almost any data, it could still conceivably be used for harm. But whatever harms might arise likely won’t be any greater than the seemingly minimal effects of “most” of the company’s then-2 billion public profiles harvested as part of the Cambridge Analytica scandal.
And as always it seems we are back to balancing: what to forbid, what to permit. Some forms of scraping, such as academic research in the public interest, ought to be explicitly allowed, particularly in the extremely careful and sensitive way the NYU Ad Observatory was doing it.
But stopping commercial scraping? Good luck with that. The companies that do it know the trade-offs between privacy and the possibilities created by an open web are real and any legislation will only end in tears. And given a lack of any national privacy law in the United States (a pipe dream) it means platforms don’t even know where to start – if they even wanted to do something.
In the meantime, though, the cat-and-mouse game of bulk data collection continues – and the mice will continue to rack up victories like the ones we’ve seen over the past week. If platforms want the press to stop reporting erroneously on “leaking”, they ought to start getting out there and talk about scraping.
Some final thoughts on scraping … and Facebook
I can almost hear that collective yawn “is there really anything left to be revealed about the extent and the frequency with which large volumes of personal data leak from Facebook?!”
A collective yawn seems to be the appropriate response. As Richard Waters, the tech columnist for the Financial Times, noted:
If the information about users’ social networks that leaked out in the Cambridge Analytica scandal was like the plutonium of social media, then this latest slip involved a decidedly low-grade fuel. Details such as names, phone numbers and birth dates of more than 530m people had been scraped from the site, in what amounted to a mass harvesting of data that was already publicly available.
And, in lock step, the U.S. and EU regulators were right on cue: they would “seriously investigate”. Even the Irish data protection officials, who take the lead in overseeing Facebook in Europe, said “we have concerns” and they would “investigate” … adding it to the other 15 “active” reviews they already have going on into Facebook apps.
But I should not make light of it. As I have noted in many previous posts, even public material like this, combined with other data sets to build fuller profiles on people, can be used for malicious ends. And the case touches on a deeper issue: the growing volume of data that people release publicly as part of their digital lives – often after being nudged by the very companies which benefit from the disclosures – which can later be used in ways that hurt their own interests.
Scraping has been around since the early days of the internet, when potentially valuable information was first left in plain sight on public pages. But recently, the incentives and the opportunities have multiplied. Social networks have become ever-larger repositories, presenting attractive targets for harvesters operating at scale. And the rise of machine learning has brought new incentives, as AI has turned the raw material into potential gold. Clearview AI, for instance, has some of the most advanced AI in the world which it continues to use to build huge databases of images scraped as the raw material for its service.
NOTE: Clearview was banned by most law enforcement entities. But apparently a bunch of U.S. police departments have still been using it, without disclosure or thinking about how to deal with false positives.
As I noted, there are many ways to scrape in volume. Many companies now make their data available through APIs, the digital “hooks” that others can use to connect to their systems. This reflects the creeping automation in the information realm, as well as a common business strategy. These days, companies often set their sights on becoming platforms, making themselves an indispensable resource for others. Becoming the go-to source for data on any subject is one way to achieve that. This might raise few misgivings for a company such as eBay, which wants to be seen as the definitive source for all product listings. But it is more troubling when personal information is at stake.
It is not only scammers who have seen the opportunities. The commercial value in publicly available data has also led to creative — and unwanted — uses. As I noted, the data analytics company hiQ trawled LinkedIn, looking for tell-tale signs of who among the professional network’s users might be looking for a new job – then reported it back to the users’ employers.
In short, this looks like yet another instance where the design of today’s mass information systems has not always put users first, and where the guardians of the data have allowed their own interests to cloud their decisions. SURPRISE!!
Facebook has tacitly put some of the blame on its own users, saying they could protect themselves better by thinking more about what information they share publicly, and doing regular “privacy check-ups” to make sure they are not compromised. This ignores the fact that few people have the time or inclination to indulge in such digital hygiene, and are in no position to judge how what they disclose today might be used against them tomorrow. Facebook – where impunity reigns.
* * * * * * * * * * * * *
Next up …
PART III
Google gets the U.S. Supreme Court to rewrite patent law
Google vs Oracle. The United States Supreme Court recently ruled that Google’s use of another software company’s code in building its Android platform was permissible under the fair use doctrine. The decision comes after 10 plus years of litigation on the issue. The long-awaited verdict establishes a groundbreaking precedent in the world of coding and intellectual property as a whole.
However, certain questions remain unanswered as to how Big Tech may develop competing programs moving forward. And what happens in the market really should not weigh too heavily where the Court restricts its opinions to questions of law. But this opinion makes clear that the Court was terribly concerned about broad market effects – something the Supreme Court is ill-suited to deal with. It’s a bad decision.