Beyond OSINT - Scraping and more...

Jan 5, 2021
3 min read

Updated: Feb 16, 2021

As OSINT has evolved over time, one of the most often asked questions is how we move from analysis to investigation, and indeed at the same time, as the internet (both the surface and deep web) continue to grow at a staggering rate, how we maximise the huge volumes of data available in order to conduct these investigations. The simple answer is that this requires enduring commitment to curiosity and creativity. However, the practical answer is that this must be complemented by technical capability and understanding. As is often overlooked, in purist intelligence terms, OSINT remains a collection discipline… According to live-counter.com the internet (at the time of writing) contains around 19 million petabytes of data. To make a copy of the internet you would need 19 billion x 1 terabyte hard drives. If those hard drives were modern solid-state drives weighing approximately 50 grams the internet would weigh around 950 thousand tonnes. Or to put that into some kind of context that’s the equivalent of 475 space shuttles, 9500 Nimitz class aircraft carriers or 5578 million blue whales. So, the internet contains a lot of data. This means that often an investigator needs to be able to analyse large amounts of it quickly in order to make inferences and connect the dots. In order to do this, the data needs to first be obtained and stored somewhere where analysis can be conducted on it. The act of “scraping” can be defined as using software to extract human-readable data from another program. An example of this would be the information on a social media page. It is available to see and read but it is not easily downloaded. What this definition also implies is that the data being scraped is usually not designed to be easily obtained through conventional means, i.e., by being exported or downloaded. Instead, software is used to automate the process of reading and recording the information. Data scraping has a vast number of applications. For OSINT, one common application is to establish relationships between individual entities on social media sites - otherwise known as network analysis.

The above image shows the followers of 20 Twitter accounts. All the accounts involved in this investigation were highly active and had thousands of followers. Manually trawling through each follower’s account to establish who followed who would have taken a staggering amount of time. By scraping the data not only was the process substantially quicker and more efficient but by visualising the connections a far more thorough analysis was possible. This ultimately led to the identification of the key node in the network. Something that would almost certainly not have been possible otherwise. Another common application for data scraping is the preservation of evidence. Whilst the internet is a vast place (and increasingly so) data can also be deleted, lost or hidden. If a user decides to remove content from their social media site, that information is no longer available for an investigation. It’s commonplace to read stories about celebrities deleting Tweets and removing posts because they’ve publicly made incriminating comments. These scandals only occur as a result of people scraping their data. This is likely more often than not the work of journalists looking for a scoop. It may or may not also raise ethical concerns over the right to privacy. But the same concept can be applied to a criminal who has posted of photo that suggests involvement in a crime on their Facebook page. The difference is that if the person involved is not a public figure the data can be removed with no one noticing and is then gone forever. Many websites these days are also increasingly under pressure to remove content if it is has broken the site rules which usually means it is considered to be offensive or illegal. But it is often the offensive and illegal content that is of most interest to an online investigator. This means treating all information as ephemeral and recording the relevant stuff for posterity. So this outlines some of the many reasons why scraping is increasingly the bedrock of good OSINT. Fundamentally, the more data there is, the greater the need to see the big picture, structure that data, and derive real value from it. Scraping is therefore, the pivot between the quantitive and qualitative stages of real analysis – the means by which information is garnered before it can be triaged by an analyst to become intelligence.

Comments