cortesi

Malware

05 January 2012

Hover and click for more.

The images above are entropy visualizations of samples from a malware database - black is zero entropy, with colour ranging through blue, up to hot pink for maximum entropy. Large areas of very high entropy are usually sections that are packed - encrypted or obfuscated by the malware authors to make the malware hard to detect and reverse engineer. Smaller areas might be keys, passwords, or other chunks of data meant to be hidden from view.

When you hover over an image, you see a character class visualization with the following colors:

  0x00
  0xFF
  Printable characters
  Everything else

Clicking will show you high-detail versions of both visualizations, and let you look up the binary hash to see what it is. I've used a square Hilbert curve layout - the files start in the top-left corner, and pass through the quadrants clockwise.

I spent hours looking through thousands these visualizations today. I find them eerie and rather beautiful - an entirely different perspective from my day-to-day interactions with malware.

Last week, I wrote about visualizing binary files using space-filling curves, a technique I use when I need to get a quick overview of the broad structure of a file. Today, I'll show you an elaboration of the same basic idea - still based on space-filling curves, but this time using a colour function that measures local entropy.

Before I get to the details, let's quickly talk about the motivation for a visualization like this. We can think of entropy as the degree to which a chunk of data is disordered. If we have a data set where all the elements have the same value, the amount of disorder is nil, and the entropy is zero. If the data set has the maximum amount of heterogeneity (i.e. all possible symbols are represented equally), then we also have the maximum amount of disorder, and thus the maximum amount of entropy. There are two common types of high-entropy data that are of special interest to reverse engineers and penetration testers. The first is compressed data - finding and extracting compressed sections is a common task in many security audits. The second is cryptographic material - which is obviously at the heart of most security work. Here, I'm referring not only to key material and certificates, but also to hashes and actual encrypted data. As I show below, a tool like the one I'm describing today can be highly useful in spotting this type of information.

For this visualization, I use the Shannon entropy measure to calculate byte entropy over a sliding window. This gives us a "local entropy" value for each byte, even though the concept doesn't really apply to single symbols.

With that out of the way, let's look at some pretty pictures.

Visualizing the OSX ksh binary

In my previous post, I used the ksh binary as a guinea pig, and I'll do the same here. On the left is the entropy visualization with colours ranging from black for zero entropy, through shades of blue as entropy increases, to hot pink for maximum entropy. On the right is the Hilbert curve visualization from the last post for comparison - see the post itself for an explanation of the colour scheme. Click for larger versions with much more detail:

Entropy Byte class

Note that this is a dual-architecture Mach-O file, containing code for both i386 and x86_64. You can see this if you squint somewhat at these images - some broad structures in the file are repeated twice. We can see that there are a number of different sections of the ksh binary that have very high entropy. It's not immediately obvious why a system binary would contain either compressed sections or cryptographic material. As it happens, the explanation in this case is quite interesting. Let's have a closer look:

Sections 1 and 2 are a lovely validation of the central idea of this post. These two areas do indeed contain cryptographic material - in this case, code signing hashes and certificates. Rather satisfyingly, they stand out like a sore thumb. It turns out that all of the official OSX binaries are signed by Apple. This is then used in turn to apply a variety of policies, depending on who the signatory is, and whether they are trusted.

You can dump some rudimentary data about a binary's signature using the codesign command (which you can also use to sign binaries yourself):

> codesign -dvv /bin/ksh 
Executable=/bin/ksh
Identifier=com.apple.ksh
Format=Mach-O universal (i386 x86_64)
CodeDirectory v=20100 size=5662 flags=0x0(none) hashes=278+2 location=embedded
Signature size=4064
Authority=Software Signing
Authority=Apple Code Signing Certification Authority
Authority=Apple Root CA
Info.plist=not bound
Sealed Resources=none
Internal requirements count=1 size=92

Section 3 (the two occurrences are the same data repeated for each architecture) is interesting for a different reason - it's a cautionary example of how the simple entropy measure we're using sometimes detects high entropy in highly structured data. A hex dump of the start of the region looks like this:

000d1f00  00 01 00 00 00 02 00 00  00 06 00 00 00 00 00 00  |................|
000d1f10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000d1f20  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
000d1f30  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
000d1f40  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
000d1f50  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
000d1f60  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
000d1f70  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
000d1f80  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
000d1f90  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
000d1fa0  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|
000d1fb0  90 91 92 93 94 95 96 97  98 99 9a 9b 9c 9d 9e 9f  |................|
000d1fc0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  |................|
000d1fd0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  |................|
000d1fe0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  |................|
000d1ff0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  |................|
000d2000  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  |................|
000d2010  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  |................|

We see that this section contains each byte value from 0x00 to 0xff in order - furthermore this whole block is repeated with minor variations a number of times. There are two things to explain here - why is this detected as "high entropy" data, and what the heck is it doing in the file?

First, we need to understand that the Shannon entropy measure looks only at the relative occurrence frequencies of individual symbols (in this case, bytes). A chunk of data like the one above therefore looks like it has high entropy, because each symbol occurs once and only once, making the data highly heterogeneous.

Now, what earthly use would chunks of data like this be? With a bit of digging, I found the answer in the ksh source code. These sections are maps used for translation between various character encodings. If you're interested, here's the culprit in all its repetitive glory.

The code

As usual, the code for generating all of the images in this post is up on GitHub. The entropy visualizations were created with binvis, a new addition to scurve, my compendium of code related to space-filling curves.

A personal link mill

30 December 2011

I posted a link to an interesting visualization paper on Twitter today, prompting someone to ask me where I had found it. Sadly, I had to admit that I had no clue where I first saw it referenced, due to the way I consume links I find on the net. So, I thought I'd write a quick blog post to explain myself, and then pitch a product idea that could make my life (and maybe yours) much easier.

First, the problem statement: my aim is to efficiently discover links to interesting stuff on the net. Simple as that. A few years ago, my flow of links came mostly from social news sites (Hacker News and Reddit), and items shared by people I follow on social networks. Over time, I became more and more disenchanted with this way of doing things. The social news approach is to take a torrent of very low quality links (user submissions), and then crowd-source the filtration process through voting. But popularity is not a good measure of information quality, and the result is a bland, lowest-common-denominator view of the world that has no room for anything that doesn't make it to the front page. Don't get me wrong - Reddit and HN do a lot of other things well - but they just don't cut it as primary information sources. Mining links from social networks is a more promising approach, but still problematic. None of the social networks provide the tools needed to extract shared links from the update stream and consume them efficiently. There is also a structural issue - I don't necessarily want to mix my social ties and my information sources, and I definitely don't want to be limited to just one platform. These are separate functions that I feel require separate tools.

My personal link mill

Eventually, I took matters into my own hands. First, I hugely broadened the number of information sources I consumed. The tool I use for this is Google Reader - I now subscribe to about 800 individual feeds, and this number is growing daily. The trick here is to find high-quality, low-volume link sources. The motherlode of good links for me was to be found on social bookmarking sites. About 700 of my subscriptions are to the RSS feeds of individual users on Pinboard and Delicious. This gives me very fine control and a great mix of interests. Plus, getting links from individual curators handily sidesteps the social news group-think problem. The remainder of my subscriptions are split between blogs, some sub-Reddits, a few Twitter users and subsections of arXiv.

So much for how my intake works. Just as important is the way that I consume it. I do my "filtering" in batches, usually in the evening. Using Reeder on my iPad works well for me, letting me flick quickly and comfortably through all the new links of the day. When I find something that looks interesting, I resist the temptation to read it then and there - instead, I batch up all my reading for later. If it's a web page, it goes to Instapaper. If it's a PDF, it gets downloaded into a DropBox folder, which is synced to GoodReader.

Finally, the actual reading. Every morning, I toddle off to a nice cafe with my iPad, and read all the interesting stuff I saved the previous day in a single sitting. I'm ruthless about just skimming things that don't warrant careful attention. If I find something particularly interesting I save it permanently, and perhaps tweet it or mail it to someone I think might be interested.

Problems - and a product idea?

This system works for me, but it has many problems. There's no end-to-end coordination, so by the time I sit down to actually read something, I have no easy way to tell which feed it came from. Google Reader sucks at managing hundreds of low-volume subscriptions. Reeder is a great, but is not tailored to consuming redundant information from many sources. The end result is that maintaining the system I have is a time-consuming pain in the ass. The fact that it's still worth it despite this, makes me think there might be commercial room for a better solution.

Which brings me to a rough product idea - a formalized version of this link mill for people who want to take direct control of their information intake. The business end is a generalized feed consumer, letting you subscribe to RSS feeds, Twitter users, Google+ updates, sub-Reddits and other information sources. Links are extracted from these feeds, keeping track of which links appeared where. The user is then presented with a stream of links to consume, de-duplicated so that those appearing in multiple feeds are presented only once. The system keeps track of links the user marks as "interesting", batching them for later consumption. It also uses this information to score the feeds, letting the user see which feeds are low quality, and should be ditched. Given the right tools, the time needed for a user to maintain and tend their link feed garden would be quite modest, and the rewards would be great.

If someone built this, I for one would gladly fork over some of my hard-earned doubloons to use it. In fact, with some validation of the idea and a few collaborators I might think of building it myself. Does this sound useful to anyone else?

In my day job I often come across binary files with unknown content. I have a set of standard avenues of attack when I confront such a beast - use "file" to see if it's a known file type, "strings" to see if there's readable text, run some in-house code to extract compressed sections, and, of course, fire up a hex editor to take a direct look. There's something missing in that list, though - I have no way to get a quick view of the overall structure of the file. Using a hex editor for this is not much chop - if the first section of the file looks random (i.e. probably compressed or encrypted), who's to say that there isn't a chunk of non-random information a meg further down? Ideally, we want to do this type of broad pattern-finding by eye, so a visualization seems to be in order.

First, lets begin by picking a colour scheme. We have 256 different byte values, but for a first-pass look at a file, we can compress that down into a few common classes:

  0x00
  0xFF
  Printable characters
  Everything else

This covers the most common padding bytes, nicely highlights strings, and lumps everything else into a miscellaneous bucket. The broad outline of what we need to do next is clear - we sample the file at regular intervals, translate each sampled byte to a colour, and write the corresponding pixel to our image. This brings us to the big question - what's the best way to arrange the pixels? A first stab might be to lay the pixels out row by row, snaking to and fro to make sure each pixel is always adjacent to its predecessor. It turns out, however, that this zig-zag pattern is not very satisfying - small scale features (i.e. features that take up only a few lines) tend to get lost. What we want is a layout that maps our one-dimensional sequence of samples onto the 2-d image, while keeping elements that are close together in one dimension as near as possible to each other in two dimensions. This is called "locality preservation", and the space-filling curves are a family of mathematical constructs that have precisely this property. If you're a regular reader of this blog, you may know that I have an almost unseemly fondness for these critters. So, lets add a couple of space-filling curves to the mix to see how they stack up. The Z-Order curve has found wide practical use in computer science. It's not the best in terms of locality preservation, but it's easy and quick to compute. The Hilbert curve, on the other hand, is (nearly) as good as it gets at locality preservation, but is much more complicated to generate. Here's what our three candidate curves look like - in each case, the traversal starts in the top-left corner:

Zigzag Z-order Hilbert

And here they are, visualizing the ksh (Mach-O, dual-architecture) binary distributed with OSX - click for the significantly more spectacular larger versions of the images:

Zigzag Z-order Hilbert

The classical Hilbert and Z-Order curves are actually square, so for these visualizations I've unrolled them, stacking four sub-curves on top of each other. To my eye, the Hilbert curve is the clear winner here. Local features are prominent because they are nicely clumped together. The Z-order curve shows some annoying artifacts with contiguous chunks of data sometimes split between two or more visual blocks.

The downside of the space-filling curve visualizations is that we can't look at a feature in the image and tell where, exactly, it can be found in the file. I'm toying with the idea (though not very seriously) of writing an interactive binary file viewer with a space-filling curve navigation pane. This would let the user click on or hover over a patch of structure and see the file offset and the corresponding hex.

More detail

We can get more detail in these images by increasing the granularity of the colour mapping. One way to do this is to use a trick I first concocted to visualize the Hilbert Curve at scale. The basic idea is to use a 3-d Hilbert curve traversal of the RGB colour cube to create a palette of colours. This makes use of the locality-preserving properties of the Hilbert curve to make sure that similar elements have similar colours in the visualization. See the original post for more.

So, here's a Hilbert curve mapping of a binary file, using a Hilbert-order traversal of the RGB cube as a colour palette. Again, click on the image for the much nicer large scale version:

This shows significantly more fine-grained structure, which might be good for a deep dive into a binary. On the other hand, the colours don't map cleanly to distinct byte classes, so the image is harder to interpret. An ideal hex viewer would let you flick between the two palettes for navigation.

The code

As usual, I'm publishing the code for generating all of the images in this post. The binary visualizations were created with binvis, which is a new addition to scurve, my space-filling curve project. The curve diagrams were made with the "drawcurve" utility to be found in the same place.

Today, I'm launching Netograph, a new privacy-related site that I've been hacking on over the past few months. The goal of the project is to provide you with a quick overview of the privacy picture for a URL, before you've clicked on the link. At the moment, Netograph scans Reddit, Hacker News, Pinboard, Delicous and Digg - links on these sites should show up within a few minutes of submission.

For more details, head over to netograph.com. There you will also find Firefox and Chrome browser addons that let you view the Netograph report for a URL instantly with a right-click. Enjoy!

guardian.co.uk techcrunch.com reddit.com

What's next?

This is just the first step. As I hinted in a previous post, the most interesting results from Netograph are likely to come from aggregating and cross-correlating the data for individual URLs. I'm already hard at work on this - the next iteration of Netograph will aim to shine some light on the sometimes shadowy network of third-parties that track and analyze nearly every URL we visit. I will also be publishing some interesting tidbits from this data corpus on my blog as I go along, so watch this space.

Otago Polytechnic Talk

31 October 2011

Further reading for the guest lecture I'm giving at Otago Polytechnic today:

For the last fortnight I've been hard at work on a new project that aims to examine trust and security on the web at scale. The basic idea is to use a browser instance to render a URL, and then to extract all persistent state with browser forensic techniques afterwards. This gives you a dump of cookies, cache contents, Flash storage, HTML5 databases, and so on. At the same time, all traffic is routed through a specialised version of mitmproxy, and captured for later analysis. The result is a very detailed snapshot of what viewing a given URL actually does. The next step is to do this "at scale" - this means running many instances of this process in parallel on headless servers, decoupling things using queues, backing it all onto a database, and then spending days and days fine-tuning. I'm happy with my progress so far - my infrastructure is now now scanning all the URLs passing through Hacker News, Reddit, Digg, Delicious and Pinboard in realtime, without breaking a sweat.

I am pretty excited about the possibilities for this project, and I'm exploring plans for the future with like-minded security folk. Get in touch if this interests you, and keep an eye on my blog for more news.

After my pilot run, I had 150 gigs of data covering about 120 thousand URLs. Below is a quick peek at one tiny slice of this data - an appetizer for things to come.

Neighborhoods of trust

This graph shows structures that emerge from the way sites use third-party executable resources. In this context, "executable" means means JavaScript, Flash and HTML, and "third-party" means domains other than the URL's own. The nodes in this graph are the third-party domains, and the edges are associations between them via the URLs I crawled. For example, if a site loaded scripts from both Google Analytics and from Doubleclick, that would create (or reinforce) an edge between the nodes "google-analytics.com" and "doubleclick.com". Using this data, I calculated a co-occurrence coefficient for the third-party sources, and then extracted the resulting neighbourhood structures algorithmically. The neighbourhood information was used to colour and lay out the graph, trying to keep nodes that are closely correlated together. Finally, nodes are scaled based on how many URLs reference them.

The result is a rather stunning graph showing neighborhoods of trust - areas of the Internet bound together based on the third parties allowed to run code in users' browsers. I've spent a few hours playing with this data, and the sheer range of interesting structure is surprising. At one end of the spectrum, you can zoom in to the individual node relationships, and find small clusters of surprising sites that cross-load resources from each other, often because they are owned by the same entity. At the other end, countries, language groups, and broad fields of interest aggregate in huge tribes of kinship.

Here are a few of the larger-scale features from the graph:

Mainstream

The most widely used resources dominate in the neighbourhood extraction algorithm, which causes them to cluster together in their own super-community. The top nodes in this cluster, descending order of occurrence are: google-analytics.com, facebook.com, doubleclick.net, fbcdn.net, quantserve.com, twitter.com, google.com, googlesyndication.com, googleapis.com, scorecardresearch.net, facebook.net, addthis.com. These are also the top nodes overall.

Japanese

The main resources are hatena.ne.jp, microad.jp, mixi.jp, yahoo.co.jp, nakanohito.jp. More surprisingly, also in this cluster are topsy.com, appspot.com and postrank.com. Perhaps these resources are especially commonly used on Japanese sites.

Russian

Top resources are yadro.ru, yandex.ru, rambler.ru, vkontakte.ru, openstat.net, userapi.com, shinystat.net, and dt00.net

Porn

And here we have a portion of the web dedicated to pron. The top resources are awempire.com, clickbank.net, picadmedia.com, getresponse.com, adultfriendfinder.com, adultadword.com, phcdn.com, juicyads.com, brazzers.com, etology.com, data-ero-advertising.com and viddler.com. A more surprising inclusion in this group is wufoo.com - I wonder if this is an artifact, or whether Wufoo really does have a use in the adult content world.

Misc

Just to show that it's not all clear-cut, here's an example of a neighbourhood I find harder to explain. The top resources are netdna-cdn.com, amgdgt.com, trafficmp.com, ooyala.com, suitesmart.com, demdex.net, adfrontiers.com, lycos.com and break.com. I speculate that this group might be loosely aligned around a number of big CDNs and analysis suites.

The graph in this post was created, analyzed and pre-processed using graph-tool, a great Python library for dealing with large graphs. The visualization and modularity analysis was done using the ever-wonderful Gephi. If these aren't both in your arsenal of analysis tools, you're missing out.

Why the Apple UDID had to die

09 September 2011

EDIT: A WSJ Digits article is now up, containing a responses from Zynga and Chillingo. Other networks declined to comment.

A UDID is a "Unique Device Identifier" - you can think of it as a serial number burned permanently into every iPhone, iPad and iPod Touch. Any installed app can access the UDID without requiring the user's knowledge or consent. We know that UDIDs are very widely used - in a sample of 94 apps I tested, 74% silently sent the UDID to one or more servers on the Internet, often without encryption. This means that UDIDs are not secret values - if you use an Apple device regularly, it's certain that your UDID has found its way into scores of databases you're entirely unaware of. Developers often assume UDIDs are anonymous values, and routinely use them to aggregate detailed and sensitive user behavioural information. One example is Flurry, a mobile analytics firm used by 15% of apps I tested, which can monitor application startup, shutdown, scores achieved, and a host of other application-specific events, all linked to the user's UDID. I recently showed that it was possible to use OpenFeint, a large mobile social gaming network, to de-anonymize UDIDs, linking them to usernames, email addresses, GPS locations, and even Facebook profiles.

This post looks at the way UDIDs are used in the broader social gaming ecosystem. The work is based on a simple question: what happens if we swap our UDID for another while communicating with the network? There are a number of ways to do this - in my case I used mitmproxy, an intercepting HTTP/S proxy I developed which lets me re-write the traffic leaving a device on the fly. In most cases this was a simple matter of replacing one string with another, but two networks (Scoreloop and Crystal) prevented UDID substitution using cryptography. Unfortunately, both networks relied on the secrecy of key material distributed in the application binaries to every device. I have verified that it is possible to reverse engineer the application binaries to extract the key material and circumvent the cryptographic protection.

The outcome of this experiment shows that social gaming networks systematically misuse UDIDs, resulting in serious privacy breaches for their users. All the networks I tested allowed UDIDs to be linked to potentially identifying user information, ranging from usernames to email addresses, friends lists and private messages. Furthermore, 5 of the 7 networks allow an attacker to log in as a user using only their UDID, giving the attacker complete control of the user's account. Two networks had further problems that compromised a user's Facebook and Twitter accounts - Crystal lets an attacker take control of a user accounts by leaking API keys, while Scoreloop partially discloses users' friends lists, even if they are private.

Data leaked Log in as user Social Media Accounts
Crystal Username, friends, Facebook, Twitter, games played, location, email address Yes Control of Facebook, Twitter accounts
GameLoft Username, email address, games played, nationality, friends Yes No
Geocade Username, email address, games played, location Yes No
OpenFeint Username, last played game, online status, friends Yes No
Scoreloop Email address, gender, username, nationality, friends Yes Access private Facebook and Twitter friends lists
Plus+ Username No No
Zynga First name, username, friends*, in-game messages*, mobile number* Yes* No

* The starred Zynga findings rely on the fact that other networks can be used to obtain the user's email address using the UDID.

There are two caveats to keep in mind while considering these results. First, the findings are based on the default settings for each social network - some networks may have settings that reduce the amount of information exposed. Second, some of the data leaked is optional - for instance, it's not mandatory for a user to link Facebook or Twitter accounts with any of the networks.

All the affected companies and Apple were notified 5 weeks ago. The Crystal and Scoreloop teams have both repaired the problems that could lead to a follow-on compromise of a user's social network accounts. At the time of writing, it is still possible to log in as a user using only a UDID on five of the vulnerable networks.

The future

A few days after I notified the companies involved, it was revealed that Apple was quietly killing the UDID API. It will still be present in IOS5, but is marked deprecated, and will probably be removed in future. I recommend that developers shift away from using UDIDs now, rather than wait for formal removal of the API.

We can now expect a frenzy of activity as developers look for alternatives. The challenge will be to make sure that the cure isn't as bad as the disease - Apple's recommendation to "create a unique identifier specific to your app" could tempt developers to replicate the UDID mechanism on a smaller scale, flaws and all. Expect more blog posts on this topic soon.

RSS Feed ---
subscribers
Add to Google

    Copyright 2008 Aldo Cortesi