cortesi

I've just released pathod, a pathological HTTP/S daemon useful for testing and torturing HTTP clients. At its core is a tiny, terse language for crafting HTTP responses. It also has a built-in web interface that lets you play with the response spec language, inspect logs, and access pathod's full help document.

The rest of this post is a quick teaser showing some of pathod's abilities. See the detailed documentation on the pathod site if you want more.

The simplest possible response

The easiest way to craft a response is to specify it directly in the request URL. Lets start with the simplest possible example. Start pathod, and then visit this URL:

http://localhost:9999/p/200

The "/p/" path is the location of the response generator in pathod's default configuration - everything after that a response specification in pathod's mini-language. The general form of a response spec is as follows:

code[MESSAGE]:[colon-separated list of features]

In this case, we're specifying only the HTTP response code - that is, an HTTP 200 OK with no headers and no content, resulting in a response like this:

HTTP/1.1 200 OK

Specifying features

One example of a "feature" is a response header. Lets embellish our response by adding one:

200:h"Etag"="foo"

The first letter of the feature - "h", in this case - is a mnemonic indicating the type of feature we're adding. The full response to this spec looks like this:

HTTP/1.1 200 OK
Etag: foo

Both "Etag" and "foo" are Value Specifiers, a syntax used throughout the response specification language. In this case they are literal values, as indicated by the fact that they are quoted strings. The Value Specification syntax also lets us load values from files or generate random data. For instance, here is a specification that generates 100k of random binary data for the header value:

200:h"Etag"=@100k

Now, binary data in the header value will probably break things in interesting ways, but is unlikely to be read by the client as a valid (but over-long) value. To see if the client really drops off its perch if we feed it a single 100k header, we have to constrain the random data. Here's the same response, but with data generated only from ASCII letters:

200:h"Etag"=@100k,ascii_letters

pathod has a large number of built-in character classes from which random data can be generated.

Pauses and Disconnects

Next, we can disrupt the communications in various ways. At the moment, this means adding pauses and disconnects to a response. Let's start with an HTTP 404 response with a body consisting of a 100k of random binary data:

404:b@100k

Here's the same response, but with a 120 second pause after sending 100 bytes:

404:b@100k:p120,100

And, the same response again, but with hard disconnect after sending 100 bytes:

404:b@100k:d100

Instead of specifying a time explicitly, we can ask pathod to just randomly disconnect at a time of its choosing:

404:b@100k:dr

That's it for the teaser - hopefully it's enough to entice you into looking at pathod's full documentation.

What's next?

pathod is an "airport project" - the first draft was written in its entirety during a 40-hour trip back home from New York (I drew a bad lot in stopovers). I've now firmed it up a bit, but there's still work to be done. In the next month, mitmproxy's test suite will move to pathod, after which there will be a simple, well-documented way to unit test. I also plan to build out the JSON API (which is used to drive pathod in test suites), and expand the mini-language with convenient ways to generate pathological cookies, authentication headers, SSL errors, and cache control.

mitmproxy 0.8

09 April 2012

I'm happy to announce the release of mitmproxy 0.8. This release has a few major new features, big speedups, and many, many small bugfixes and improvements. Here are the headlines:

Android interception

The most prominent new feature is that we now have a supported way to intercept Android traffic. What's more, we can do this without a cumbersome transparent proxying rig - see the Android section in the documentation for the details. Special thanks goes to Jim Cheetham for lending me an Android device and helping to get this feature off the ground.

Replacement patterns

Another exceedingly useful new feature is replacement patterns. These consist of a filter, a regular expression and a replacement string, and run continuously while mitmproxy processes requests and responses. You can pass these either on the command-line, or using a built-in replacement pattern editor.

I'm sure you can immediately think of many uses for this flexible feature, but my favourite is to use it during testing as a way to conveniently inject complicated exploits into web traffic. I do this by setting a replacement pattern that swaps a short but likely unique string (say MYXSS) for a long exploit, and then I use simple interaction and front-end tools like Firebug to inject exploits into requests manually based on the short string marker.

Improved pretty-printing of request and response contents

This release of mitmproxy has a completely redesigned subsystem for pretty-printing request and response bodies. For instance, we now extract EXIF tags and other basic information to give you something better than a hex dump when looking at an image:

We also have much improved HTML indenting (using lxml), and a built-in JavaScript beautifier (thanks to JSBeautifier) that teases out compressed and obfuscated scripts into something readable.

Changelog

  • Detailed tutorial for Android interception. Some features that land in this release have finally made reliable Android interception possible.
  • Upstream-cert mode, which uses information from the upstream server to generate interception certificates.
  • Replacement patterns that let you easily do global replacements in flows matching filter patterns. Can be specified on the command-line, or edited interactively.
  • Much more sophisticated and usable pretty printing of request bodies. Support for auto-indentation of JavaScript, inspection of image EXIF data, and more.
  • Details view for flows, showing connection and SSL cert information (X keyboard shortcut).
  • Server certificates are now stored and serialized in saved traffic for later analysis. This means that the 0.8 serialization format is NOT compatible with 0.7.
  • Add a shortcut key ("f") to load the remainder of a request or response body, if it is abbreviated.
  • Many other improvements, including bugfixes, and expanded scripting API, and more sophisticated certificate handling.

mitmproxy 0.7

27 February 2012

I'm happy to announce the release of mitmproxy 0.7. The biggest visible change is a new structured editor for headers, query strings and form fields. Other new feature include a reverse proxy mode, extended script API that makes many common tasks much easier, and a myriad of improvements to the interface (including a massive increase in speed). Everybody still on 0.6 should upgrade - get it here:

mitmproxy-0.7.tar.gz (docs)

You can also now install mitmproxy using pip, like so:

pip install mitmproxy

In other news, the project has had an amazing month, after a rash of high-profile results obtained using mitmproxy were published. It started with Arun Thampi's discovery that Path uploads users' address books to their servers. Things snowballed from there, and for a few days mitmproxy seemed to be everywhere. Similar findings were made for Hipster, The Verge did a mitmproxy-driven AddressbookGate expose (including vaguely threatening background shots of mitmproxy doing its dastardly work), and lots of people said nice things on Twitter.

To see the impact all of this for the mitmproxy project, you need only look at the Github page - watchers of the repo went from about 200 a month a go, to 950 at the time of this post.

Changelog

  • New built-in key/value editor. This lets you interactively edit URL query strings, headers and URL-encoded form data.
  • Extend script API to allow duplication and replay of flows.
  • API for easy manipulation of URL-encoded forms and query strings.
  • Add "D" shortcut in mitmproxy to duplicate a flow.
  • Reverse proxy mode. In this mode mitmproxy acts as an HTTP server, forwarding all traffic to a specified upstream server.
  • UI improvements - use Unicode characters to make GUI more compact, improve spacing and layout throughout.
  • Add support for filtering by HTTP method.
  • Add the ability to specify an HTTP body size limit.
  • Move to typed netstrings for serialization format - this makes 0.7 backwards-incompatible with serialized data from 0.6!
  • Significant improvements in speed and responsiveness of UI.
  • Many minor bugfixes and improvements.

OpenBSD in decline?

26 February 2012

My leisurely Sunday activity today is to set up a new OpenBSD firewall for my mobile app testing lab. I haven't done a from-scratch OpenBSD install for years, so I spent some time reading through the change logs for the last few versions to catch up with what's changed. Although the project is clearly still making steady, well-engineered progress, I had the nagging feeling that the rate of change wasn't what it used to be. So, I pulled some numbers from CVS commit message list archives, and graphed them. Here are the number of commits per month from January 2001 to January 2012. The orange line is a simple 12-month moving average:

Now, we should be cautious about interpreting this - the number of commits doesn't tell us anything about the quality, importance or magnitude of code change. Even if it did all of these things, there are other and perhaps better measures of a project's health. Still, the trend is clear, and suggests a sustained decline in activity.

I just bought some T-shirts to help support one of my favourite open source projects. You should too.

Malware

05 January 2012

Hover and click for more.

The images above are entropy visualizations of samples from a malware database - black is zero entropy, with colour ranging through blue, up to hot pink for maximum entropy. Large areas of very high entropy are usually sections that are packed - encrypted or obfuscated by the malware authors to make the malware hard to detect and reverse engineer. Smaller areas might be keys, passwords, or other chunks of data meant to be hidden from view.

When you hover over an image, you see a character class visualization with the following colors:

  0x00
  0xFF
  Printable characters
  Everything else

Clicking will show you high-detail versions of both visualizations, and let you look up the binary hash to see what it is. I've used a square Hilbert curve layout - the files start in the top-left corner, and pass through the quadrants clockwise.

I spent hours looking through thousands these visualizations today. I find them eerie and rather beautiful - an entirely different perspective from my day-to-day interactions with malware.

Last week, I wrote about visualizing binary files using space-filling curves, a technique I use when I need to get a quick overview of the broad structure of a file. Today, I'll show you an elaboration of the same basic idea - still based on space-filling curves, but this time using a colour function that measures local entropy.

Before I get to the details, let's quickly talk about the motivation for a visualization like this. We can think of entropy as the degree to which a chunk of data is disordered. If we have a data set where all the elements have the same value, the amount of disorder is nil, and the entropy is zero. If the data set has the maximum amount of heterogeneity (i.e. all possible symbols are represented equally), then we also have the maximum amount of disorder, and thus the maximum amount of entropy. There are two common types of high-entropy data that are of special interest to reverse engineers and penetration testers. The first is compressed data - finding and extracting compressed sections is a common task in many security audits. The second is cryptographic material - which is obviously at the heart of most security work. Here, I'm referring not only to key material and certificates, but also to hashes and actual encrypted data. As I show below, a tool like the one I'm describing today can be highly useful in spotting this type of information.

For this visualization, I use the Shannon entropy measure to calculate byte entropy over a sliding window. This gives us a "local entropy" value for each byte, even though the concept doesn't really apply to single symbols.

With that out of the way, let's look at some pretty pictures.

Visualizing the OSX ksh binary

In my previous post, I used the ksh binary as a guinea pig, and I'll do the same here. On the left is the entropy visualization with colours ranging from black for zero entropy, through shades of blue as entropy increases, to hot pink for maximum entropy. On the right is the Hilbert curve visualization from the last post for comparison - see the post itself for an explanation of the colour scheme. Click for larger versions with much more detail:

Entropy Byte class

Note that this is a dual-architecture Mach-O file, containing code for both i386 and x86_64. You can see this if you squint somewhat at these images - some broad structures in the file are repeated twice. We can see that there are a number of different sections of the ksh binary that have very high entropy. It's not immediately obvious why a system binary would contain either compressed sections or cryptographic material. As it happens, the explanation in this case is quite interesting. Let's have a closer look:

Sections 1 and 2 are a lovely validation of the central idea of this post. These two areas do indeed contain cryptographic material - in this case, code signing hashes and certificates. Rather satisfyingly, they stand out like a sore thumb. It turns out that all of the official OSX binaries are signed by Apple. This is then used in turn to apply a variety of policies, depending on who the signatory is, and whether they are trusted.

You can dump some rudimentary data about a binary's signature using the codesign command (which you can also use to sign binaries yourself):

> codesign -dvv /bin/ksh 
Executable=/bin/ksh
Identifier=com.apple.ksh
Format=Mach-O universal (i386 x86_64)
CodeDirectory v=20100 size=5662 flags=0x0(none) hashes=278+2 location=embedded
Signature size=4064
Authority=Software Signing
Authority=Apple Code Signing Certification Authority
Authority=Apple Root CA
Info.plist=not bound
Sealed Resources=none
Internal requirements count=1 size=92

Section 3 (the two occurrences are the same data repeated for each architecture) is interesting for a different reason - it's a cautionary example of how the simple entropy measure we're using sometimes detects high entropy in highly structured data. A hex dump of the start of the region looks like this:

000d1f00  00 01 00 00 00 02 00 00  00 06 00 00 00 00 00 00  |................|
000d1f10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000d1f20  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f  |................|
000d1f30  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  |................|
000d1f40  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
000d1f50  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
000d1f60  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
000d1f70  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
000d1f80  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
000d1f90  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|
000d1fa0  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|
000d1fb0  90 91 92 93 94 95 96 97  98 99 9a 9b 9c 9d 9e 9f  |................|
000d1fc0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af  |................|
000d1fd0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  |................|
000d1fe0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf  |................|
000d1ff0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df  |................|
000d2000  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef  |................|
000d2010  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe ff  |................|

We see that this section contains each byte value from 0x00 to 0xff in order - furthermore this whole block is repeated with minor variations a number of times. There are two things to explain here - why is this detected as "high entropy" data, and what the heck is it doing in the file?

First, we need to understand that the Shannon entropy measure looks only at the relative occurrence frequencies of individual symbols (in this case, bytes). A chunk of data like the one above therefore looks like it has high entropy, because each symbol occurs once and only once, making the data highly heterogeneous.

Now, what earthly use would chunks of data like this be? With a bit of digging, I found the answer in the ksh source code. These sections are maps used for translation between various character encodings. If you're interested, here's the culprit in all its repetitive glory.

The code

As usual, the code for generating all of the images in this post is up on GitHub. The entropy visualizations were created with binvis, a new addition to scurve, my compendium of code related to space-filling curves.

A personal link mill

30 December 2011

I posted a link to an interesting visualization paper on Twitter today, prompting someone to ask me where I had found it. Sadly, I had to admit that I had no clue where I first saw it referenced, due to the way I consume links I find on the net. So, I thought I'd write a quick blog post to explain myself, and then pitch a product idea that could make my life (and maybe yours) much easier.

First, the problem statement: my aim is to efficiently discover links to interesting stuff on the net. Simple as that. A few years ago, my flow of links came mostly from social news sites (Hacker News and Reddit), and items shared by people I follow on social networks. Over time, I became more and more disenchanted with this way of doing things. The social news approach is to take a torrent of very low quality links (user submissions), and then crowd-source the filtration process through voting. But popularity is not a good measure of information quality, and the result is a bland, lowest-common-denominator view of the world that has no room for anything that doesn't make it to the front page. Don't get me wrong - Reddit and HN do a lot of other things well - but they just don't cut it as primary information sources. Mining links from social networks is a more promising approach, but still problematic. None of the social networks provide the tools needed to extract shared links from the update stream and consume them efficiently. There is also a structural issue - I don't necessarily want to mix my social ties and my information sources, and I definitely don't want to be limited to just one platform. These are separate functions that I feel require separate tools.

My personal link mill

Eventually, I took matters into my own hands. First, I hugely broadened the number of information sources I consumed. The tool I use for this is Google Reader - I now subscribe to about 800 individual feeds, and this number is growing daily. The trick here is to find high-quality, low-volume link sources. The motherlode of good links for me was to be found on social bookmarking sites. About 700 of my subscriptions are to the RSS feeds of individual users on Pinboard and Delicious. This gives me very fine control and a great mix of interests. Plus, getting links from individual curators handily sidesteps the social news group-think problem. The remainder of my subscriptions are split between blogs, some sub-Reddits, a few Twitter users and subsections of arXiv.

So much for how my intake works. Just as important is the way that I consume it. I do my "filtering" in batches, usually in the evening. Using Reeder on my iPad works well for me, letting me flick quickly and comfortably through all the new links of the day. When I find something that looks interesting, I resist the temptation to read it then and there - instead, I batch up all my reading for later. If it's a web page, it goes to Instapaper. If it's a PDF, it gets downloaded into a DropBox folder, which is synced to GoodReader.

Finally, the actual reading. Every morning, I toddle off to a nice cafe with my iPad, and read all the interesting stuff I saved the previous day in a single sitting. I'm ruthless about just skimming things that don't warrant careful attention. If I find something particularly interesting I save it permanently, and perhaps tweet it or mail it to someone I think might be interested.

Problems - and a product idea?

This system works for me, but it has many problems. There's no end-to-end coordination, so by the time I sit down to actually read something, I have no easy way to tell which feed it came from. Google Reader sucks at managing hundreds of low-volume subscriptions. Reeder is a great, but is not tailored to consuming redundant information from many sources. The end result is that maintaining the system I have is a time-consuming pain in the ass. The fact that it's still worth it despite this, makes me think there might be commercial room for a better solution.

Which brings me to a rough product idea - a formalized version of this link mill for people who want to take direct control of their information intake. The business end is a generalized feed consumer, letting you subscribe to RSS feeds, Twitter users, Google+ updates, sub-Reddits and other information sources. Links are extracted from these feeds, keeping track of which links appeared where. The user is then presented with a stream of links to consume, de-duplicated so that those appearing in multiple feeds are presented only once. The system keeps track of links the user marks as "interesting", batching them for later consumption. It also uses this information to score the feeds, letting the user see which feeds are low quality, and should be ditched. Given the right tools, the time needed for a user to maintain and tend their link feed garden would be quite modest, and the rewards would be great.

If someone built this, I for one would gladly fork over some of my hard-earned doubloons to use it. In fact, with some validation of the idea and a few collaborators I might think of building it myself. Does this sound useful to anyone else?

In my day job I often come across binary files with unknown content. I have a set of standard avenues of attack when I confront such a beast - use "file" to see if it's a known file type, "strings" to see if there's readable text, run some in-house code to extract compressed sections, and, of course, fire up a hex editor to take a direct look. There's something missing in that list, though - I have no way to get a quick view of the overall structure of the file. Using a hex editor for this is not much chop - if the first section of the file looks random (i.e. probably compressed or encrypted), who's to say that there isn't a chunk of non-random information a meg further down? Ideally, we want to do this type of broad pattern-finding by eye, so a visualization seems to be in order.

First, lets begin by picking a colour scheme. We have 256 different byte values, but for a first-pass look at a file, we can compress that down into a few common classes:

  0x00
  0xFF
  Printable characters
  Everything else

This covers the most common padding bytes, nicely highlights strings, and lumps everything else into a miscellaneous bucket. The broad outline of what we need to do next is clear - we sample the file at regular intervals, translate each sampled byte to a colour, and write the corresponding pixel to our image. This brings us to the big question - what's the best way to arrange the pixels? A first stab might be to lay the pixels out row by row, snaking to and fro to make sure each pixel is always adjacent to its predecessor. It turns out, however, that this zig-zag pattern is not very satisfying - small scale features (i.e. features that take up only a few lines) tend to get lost. What we want is a layout that maps our one-dimensional sequence of samples onto the 2-d image, while keeping elements that are close together in one dimension as near as possible to each other in two dimensions. This is called "locality preservation", and the space-filling curves are a family of mathematical constructs that have precisely this property. If you're a regular reader of this blog, you may know that I have an almost unseemly fondness for these critters. So, lets add a couple of space-filling curves to the mix to see how they stack up. The Z-Order curve has found wide practical use in computer science. It's not the best in terms of locality preservation, but it's easy and quick to compute. The Hilbert curve, on the other hand, is (nearly) as good as it gets at locality preservation, but is much more complicated to generate. Here's what our three candidate curves look like - in each case, the traversal starts in the top-left corner:

Zigzag Z-order Hilbert

And here they are, visualizing the ksh (Mach-O, dual-architecture) binary distributed with OSX - click for the significantly more spectacular larger versions of the images:

Zigzag Z-order Hilbert

The classical Hilbert and Z-Order curves are actually square, so for these visualizations I've unrolled them, stacking four sub-curves on top of each other. To my eye, the Hilbert curve is the clear winner here. Local features are prominent because they are nicely clumped together. The Z-order curve shows some annoying artifacts with contiguous chunks of data sometimes split between two or more visual blocks.

The downside of the space-filling curve visualizations is that we can't look at a feature in the image and tell where, exactly, it can be found in the file. I'm toying with the idea (though not very seriously) of writing an interactive binary file viewer with a space-filling curve navigation pane. This would let the user click on or hover over a patch of structure and see the file offset and the corresponding hex.

More detail

We can get more detail in these images by increasing the granularity of the colour mapping. One way to do this is to use a trick I first concocted to visualize the Hilbert Curve at scale. The basic idea is to use a 3-d Hilbert curve traversal of the RGB colour cube to create a palette of colours. This makes use of the locality-preserving properties of the Hilbert curve to make sure that similar elements have similar colours in the visualization. See the original post for more.

So, here's a Hilbert curve mapping of a binary file, using a Hilbert-order traversal of the RGB cube as a colour palette. Again, click on the image for the much nicer large scale version:

This shows significantly more fine-grained structure, which might be good for a deep dive into a binary. On the other hand, the colours don't map cleanly to distinct byte classes, so the image is harder to interpret. An ideal hex viewer would let you flick between the two palettes for navigation.

The code

As usual, I'm publishing the code for generating all of the images in this post. The binary visualizations were created with binvis, which is a new addition to scurve, my space-filling curve project. The curve diagrams were made with the "drawcurve" utility to be found in the same place.

RSS Feed ---
subscribers
Add to Google

    Copyright 2008 Aldo Cortesi