corte.si

For the last fortnight I've been hard at work on a new project that aims to examine trust and security on the web at scale. The basic idea is to use a browser instance to render a URL, and then to extract all persistent state with browser forensic techniques afterwards. This gives you a dump of cookies, cache contents, Flash storage, HTML5 databases, and so on. At the same time, all traffic is routed through a specialised version of mitmproxy, and captured for later analysis. The result is a very detailed snapshot of what viewing a given URL actually does. The next step is to do this "at scale" - this means running many instances of this process in parallel on headless servers, decoupling things using queues, backing it all onto a database, and then spending days and days fine-tuning. I'm happy with my progress so far - my infrastructure is now now scanning all the URLs passing through Hacker News, Reddit, Digg, Delicious and Pinboard in realtime, without breaking a sweat.

I am pretty excited about the possibilities for this project, and I'm exploring plans for the future with like-minded security folk. Get in touch if this interests you, and keep an eye on my blog for more news.

After my pilot run, I had 150 gigs of data covering about 120 thousand URLs. Below is a quick peek at one tiny slice of this data - an appetizer for things to come.

Neighborhoods of trust

This graph shows structures that emerge from the way sites use third-party executable resources. In this context, "executable" means means JavaScript, Flash and HTML, and "third-party" means domains other than the URL's own. The nodes in this graph are the third-party domains, and the edges are associations between them via the URLs I crawled. For example, if a site loaded scripts from both Google Analytics and from Doubleclick, that would create (or reinforce) an edge between the nodes "google-analytics.com" and "doubleclick.com". Using this data, I calculated a co-occurrence coefficient for the third-party sources, and then extracted the resulting neighbourhood structures algorithmically. The neighbourhood information was used to colour and lay out the graph, trying to keep nodes that are closely correlated together. Finally, nodes are scaled based on how many URLs reference them.

The result is a rather stunning graph showing neighborhoods of trust - areas of the Internet bound together based on the third parties allowed to run code in users' browsers. I've spent a few hours playing with this data, and the sheer range of interesting structure is surprising. At one end of the spectrum, you can zoom in to the individual node relationships, and find small clusters of surprising sites that cross-load resources from each other, often because they are owned by the same entity. At the other end, countries, language groups, and broad fields of interest aggregate in huge tribes of kinship.

Here are a few of the larger-scale features from the graph.

Mainstream

The most widely used resources dominate in the neighbourhood extraction algorithm, which causes them to cluster together in their own super-community. The top nodes in this cluster, descending order of occurrence are: google-analytics.com, facebook.com, doubleclick.net, fbcdn.net, quantserve.com, twitter.com, google.com, googlesyndication.com, googleapis.com, scorecardresearch.net, facebook.net, addthis.com. These are also the top nodes overall.

Japanese

The main resources are hatena.ne.jp, microad.jp, mixi.jp, yahoo.co.jp, nakanohito.jp. More surprisingly, also in this cluster are topsy.com, appspot.com and postrank.com. Perhaps these resources are especially commonly used on Japanese sites.

Russian

Top resources are yadro.ru, yandex.ru, rambler.ru, vkontakte.ru, openstat.net, userapi.com, shinystat.net, and dt00.net

Porn

And here we have a portion of the web dedicated to porn. The top resources are awempire.com, clickbank.net, picadmedia.com, getresponse.com, adultfriendfinder.com, adultadword.com, phcdn.com, juicyads.com, brazzers.com, etology.com, data-ero-advertising.com and viddler.com. A more surprising inclusion in this group is wufoo.com - I wonder if this is an artifact, or whether Wufoo really does have a use in the adult content world.

Misc

Just to show that it's not all clear-cut, here's an example of a neighbourhood I find harder to explain. The top resources are netdna-cdn.com, amgdgt.com, trafficmp.com, ooyala.com, suitesmart.com, demdex.net, adfrontiers.com, lycos.com and break.com. I speculate that this group might be loosely aligned around a number of big CDNs and analysis suites.

Tech

The graph in this post was created, analyzed and pre-processed using graph-tool, a great Python library for dealing with large graphs. The visualization and modularity analysis was done using the ever-wonderful Gephi. If these aren't both in your arsenal of analysis tools, you're missing out.