How mitmproxy works

I started work on mitmproxy because I was frustrated with the available interception tools. I had a long list of minor complaints - they were insufficiently flexible, not programmable enough, mostly written in Java (a language I don't enjoy), and so forth. My most serious problem, though, was opacity. The best tools were all closed source and commercial. SSL interception is a complicated and delicate process, and after a certain point, not understanding precisely what your proxy is doing just doesn't fly.

The text below is now part of the official documentation of mitmproxy. It's a detailed description of mitmproxy's interception process, and is more or less the overview document I wish I had when I first started the project. I proceed by example, starting with the simplest unencrypted explicit proxying, and working up to the most complicated interaction - transparent proxying of SSL-protected traffic1 in the presence of SNI.

Explicit HTTP

Configuring the client to use mitmproxy as an explicit proxy is the simplest and most reliable way to intercept traffic. The proxy protocol is codified in the HTTP RFC, so the behaviour of both the client and the server is well defined, and usually reliable. In the simplest possible interaction with mitmproxy, a client connects directly to the proxy and makes a request that looks like this:

GET http://example.com/index.html HTTP/1.1

This is a proxy GET request - an extended form of the vanilla HTTP GET request that includes a schema and host specification, and it includes all the information mitmproxy needs to relay the request upstream.

1 The client connects to the proxy and makes a request.
2 Mitmproxy connects to the upstream server and simply forwards the request on.

Explicit HTTPS

The process for an explicitly proxied HTTPS connection is quite different. The client connects to the proxy and makes a request that looks like this:

CONNECT example.com:443 HTTP/1.1

A conventional proxy can neither view nor manipulate an SSL-encrypted data stream, so a CONNECT request simply asks the proxy to open a pipe between the client and server. The proxy here is just a facilitator - it blindly forwards data in both directions without knowing anything about the contents. The negotiation of the SSL connection happens over this pipe, and the subsequent flow of requests and responses are completely opaque to the proxy.

The MITM in mitmproxy

This is where mitmproxy's fundamental trick comes into play. The MITM in its name stands for Man-In-The-Middle - a reference to the process we use to intercept and interfere with these theoretically opaque data streams. The basic idea is to pretend to be the server to the client, and pretend to be the client to the server, while we sit in the middle decoding traffic from both sides. The tricky part is that the Certificate Authority system is designed to prevent exactly this attack, by allowing a trusted third-party to cryptographically sign a server's SSL certificates to verify that they are legit. If this signature doesn't match or is from a non-trusted party, a secure client will simply drop the connection and refuse to proceed. Despite the many shortcomings of the CA system as it exists today, this is usually fatal to attempts to MITM an SSL connection for analysis. Our answer to this conundrum is to become a trusted Certificate Authority ourselves. Mitmproxy includes a full CA implementation that generates interception certificates on the fly. To get the client to trust these certificates, we register mitmproxy as a trusted CA with the device manually.

Complication 1: What's the remote hostname?

To proceed with this plan, we need to know the domain name to use in the interception certificate - the client will verify that the certificate is for the domain it's connecting to, and abort if this is not the case. At first blush, it seems that the CONNECT request above gives us all we need - in this example, both of these values are "example.com". But what if the client had initiated the connection as follows:

CONNECT 10.1.1.1:443 HTTP/1.1

Using the IP address is perfectly legitimate because it gives us enough information to initiate the pipe, even though it doesn't reveal the remote hostname.

Mitmproxy has a cunning mechanism that smooths this over - upstream certificate sniffing. As soon as we see the CONNECT request, we pause the client part of the conversation, and initiate a simultaneous connection to the server. We complete the SSL handshake with the server, and inspect the certificates it used. Now, we use the Common Name in the upstream SSL certificates to generate the dummy certificate for the client. Voila, we have the correct hostname to present to the client, even if it was never specified.

Complication 2: Subject Alternative Name

Enter the next complication. Sometimes, the certificate Common Name is not, in fact, the hostname that the client is connecting to. This is because of the optional Subject Alternative Name field in the SSL certificate that allows an arbitrary number of alternative domains to be specified. If the expected domain matches any of these, the client will proceed, even though the domain doesn't match the certificate Common Name. The answer here is simple: when extract the CN from the upstream cert, we also extract the SANs, and add them to the generated dummy certificate.

Complication 3: Server Name Indication

One of the big limitations of vanilla SSL is that each certificate requires its own IP address. This means that you couldn't do virtual hosting where multiple domains with independent certificates share the same IP address. In a world with a rapidly shrinking IPv4 address pool this is a problem, and we have a solution in the form of the Server Name Indication extension to the SSL and TLS protocols. This lets the client specify the remote server name at the start of the SSL handshake, which then lets the server select the right certificate to complete the process.

SNI breaks our upstream certificate sniffing process, because when we connect without using SNI, we get served a default certificate that may have nothing to do with the certificate expected by the client. The solution is another tricky complication to the client connection process. After the client connects, we allow the SSL handshake to continue until just after the SNI value has been passed to us. Now we can pause the conversation, and initiate an upstream connection using the correct SNI value, which then serves us the correct upstream certificate, from which we can extract the expected CN and SANs.

There's another wrinkle here. Due to a limitation of the SSL library mitmproxy uses, we can't detect that a connection hasn't sent an SNI request until it's too late for upstream certificate sniffing. In practice, we therefore make a vanilla SSL connection upstream to sniff non-SNI certificates, and then discard the connection if the client sends an SNI notification. If you're watching your traffic with a packet sniffer, you'll see two connections to the server when an SNI request is made, the first of which is immediately closed after the SSL handshake. Luckily, this is almost never an issue in practice.

Putting it all together

Lets put all of this together into the complete explicitly proxied HTTPS flow.

1 The client makes a connection to mitmproxy, and issues an HTTP CONNECT request.
2 Mitmproxy responds with a 200 Connection Established, as if it has set up the CONNECT pipe.
3 The client believes it's talking to the remote server, and initiates the SSL connection. It uses SNI to indicate the hostname it is connecting to.
4 Mitmproxy connects to the server, and establishes an SSL connection using the SNI hostname indicated by the client.
5 The server responds with the matching SSL certificate, which contains the CN and SAN values needed to generate the interception certificate.
6 Mitmproxy generates the interception cert, and continues the client SSL handshake paused in step 3.
7 The client sends the request over the established SSL connection.
7 Mitmproxy passes the request on to the server over the SSL connection initiated in step 4.

Transparent HTTP

When a transparent proxy is used, the HTTP/S connection is redirected into a proxy at the network layer, without any client configuration being required. This makes transparent proxying ideal for those situations where you can't change client behaviour - proxy-oblivious Android applications being a common example.

To achieve this, we need to introduce two extra components. The first is a redirection mechanism that transparently reroutes a TCP connection destined for a server on the Internet to a listening proxy server. This usually takes the form of a firewall on the same host as the proxy server - iptables on Linux or pf on OSX. Once the client has initiated the connection, it makes a vanilla HTTP request, which might look something like this:

GET /index.html HTTP/1.1

Note that this request differs from the explicit proxy variation, in that it omits the scheme and hostname. How, then, do we know which upstream host to forward the request to? The routing mechanism that has performed the redirection keeps track of the original destination for us. Each routing mechanism has a different way of exposing this data, so this introduces the second component required for working transparent proxying: a host module that knows how to retrieve the original destination address from the router. In mitmproxy, this takes the form of a built-in set of modules that know how to talk to each platform's redirection mechanism. Once we have this information, the process is fairly straight-forward.

1 The client makes a connection to the server.
2 The router redirects the connection to mitmproxy, which is typically listening on a local port of the same host. Mitmproxy then consults the routing mechanism to establish what the original destination was.
3 Now, we simply read the client's request...
4 ... and forward it upstream.

Transparent HTTPS

The first step is to determine whether we should treat an incoming connection as HTTPS. The mechanism for doing this is simple - we use the routing mechanism to find out what the original destination port is. By default, we treat all traffic destined for ports 443 and 8443 as SSL.

From here, the process is a merger of the methods we've described for transparently proxying HTTP, and explicitly proxying HTTPS. We use the routing mechanism to establish the upstream server address, and then proceed as for explicit HTTPS connections to establish the CN and SANs, and cope with SNI.

1 The client makes a connection to the server.
2 The router redirects the connection to mitmproxy, which is typically listening on a local port of the same host. Mitmproxy then consults the routing mechanism to establish what the original destination was.
3 The client believes it's talking to the remote server, and initiates the SSL connection. It uses SNI to indicate the hostname it is connecting to.
4 Mitmproxy connects to the server, and establishes an SSL connection using the SNI hostname indicated by the client.
5 The server responds with the matching SSL certificate, which contains the CN and SAN values needed to generate the interception certificate.
6 Mitmproxy generates the interception cert, and continues the client SSL handshake paused in step 3.
7 The client sends the request over the established SSL connection.
7 Mitmproxy passes the request on to the server over the SSL connection initiated in step 4.

  1. I use "SSL" to refer to both SSL and TLS in the generic sense, unless otherwise specified. 

pathod 0.9

I've just released pathod 0.9, my toolset for crafting malicious and interesting HTTP traffic. Apart from the usual range of stability improvements and bugfixes, this release introduces a major new set of features: proxy support. Pathoc, the client, has sprouted support for vanilla proxy connections, and is also able to tunnel through proxies using CONNECT. Pathod, the server, will now respond to proxy requests as well as straight HTTP, and will treat CONNECT requests as SSL with on-the-fly generation of dummy certificates.

The Pathod changes in particular open a whole new range of possibilities for fuzzing and other mischief. Any client with proxy support can be directed at Pathod, which can then impersonate the upstream server and return the creatively malicious response of your choice.

There have also been some organizational changes. This is the first release based on netlib, the gonzo networking library pathod now shares with mitmproxy. Over the next while, pathod and mitmproxy will move closer together. As a sign of this, the major version numbers between these projects are now synchronized.

mitmproxy 0.9

I'm happy to announce the release of mitmproxy 0.9. This is a major release, with huge improvements to mitmproxy pretty much across the board. So much has happened in the year since the last release that it's difficult to pick out the headlines. Mitmproxy is now faster, more scalable, and works in more tricky corner cases than ever before. Full transparent mode support has landed for both Linux and OSX. Content decoding is much nicer, with a slew of new targets like AMF and Protocol Buffers. We now have a WSGI container that allows you to host web apps right in the proxy. In addition to this, there is a myriad of new features, bugfixes and other small improvements.

There are also changes afoot in the project itself. As a first step, I've moved mitmproxy from the GPLv3 to an MIT license. I hope that this will make it easier for people to use the project in more contexts. Keep an eye out for more changes along these lines soon, geared to broadening participation in the project.

Changelog

  • Upstream certs mode is now the default.
  • Add a WSGI container that lets you host in-proxy web applications.
  • Full transparent proxy support for Linux and OSX.
  • Introduce netlib, a common codebase for mitmproxy and pathod (http://github.com/cortesi/netlib).
  • Full support for SNI.
  • Color palettes for mitmproxy, tailored for light and dark terminal backgrounds.
  • Stream flows to file as responses arrive with the "W" shortcut in mitmproxy.
  • Extend the filter language, including ~d domain match operator, ~a to match asset flows (js, images, css).
  • Follow mode in mitmproxy ("F" shortcut) to "tail" flows as they arrive.
  • --dummy-certs option to specify and preserve the dummy certificate directory.
  • Server replay from the current captured buffer.
  • Huge improvements in content views. We now have viewers for AMF, HTML, JSON, Javascript, images, XML, URL-encoded forms, as well as hexadecimal and raw views.
  • Add Set Headers, analogous to replacement hooks. Defines headers that are set on flows, based on a matching pattern.
  • A graphical editor for path components in mitmproxy.
  • A small set of standard user-agent strings, which can be used easily in the header editor.
  • Proxy authentication to limit access to mitmproxy

Google has finally shut down a service I actually care about - Google Reader will die a graceless, undignified death on July 1, 2013. The only way Google could inconvenience me more would be to shut down search itself, and yet - I'm not angry that Google is shutting Reader down. I'm furious that they ever entered the RSS game at all. Consider this quote from a TechCrunch article in January 2006. Here, Michael Arrington ends an article about the shutdown of a feed reader service with a statement that seems truly bizarre today:

The RSS reader space is becoming hyper competitive, with dozens of different choices for readers.

A hyper competitive space with dozens of choices? Reader made its first public appearance a couple of months before this, in October 2005. I remember this period well - it was a time of immense excitement, when RSS seemed to be the future, the news ecosystem was vibrant, and this thing called the blogosphere, fueled by peer subscription, was doubling in size every six months. It was into this magic garden that Google wandered, like a giant toddler leaving destruction in its wake. Reader was undeniably a good product, but it's best quality was also its worst: it was free. Subsidized by Google's immense search profits, it never had to earn its keep, and its competitors started to die. Over time, the "hyper competitive" RSS reader market turned into a monoculture. Today, on the eve of its shutdown, RSS more or less means "Google Reader" to a large fraction of readers, to the extent where even the best feed readers on IOS are just Google Reader clients1.

The sudden shock of Reader's closure will harm a news ecosystem that I already believe to be deeply ill. Google Reader is not just a core part of my information diet - it's also the most direct channel I have to readers of this blog. As of today, the Reader subscriber count for corte.si stands at about 3 times the total number of other subscribers combined. Some of these readers will migrate to other services and stay in touch, but many will inevitably abandon the idea of direct subscription to blogs entirely. In the next few months, tens of thousands of small blogs will lose direct contact with a large fraction of their readers.

The truth is this: Google destroyed the RSS feed reader ecosystem with a subsidized product, stifling its competitors and killing innovation. It then neglected Google Reader itself for years, after it had effectively become the only player. Today it does further damage by buggering up the already beleaguered links between publishers and readers. It would have been better for the Internet if Reader had never been at all.


  1. Yes, I'm aware that there are a few hardy outliers still playing in this place. My own logs show that their reach is insignificant, though, and when I tried to shift my subscriptions about a year ago, there was nothing as good as Reader itself. Once NewsBlur's servers have recovered, I definitely plan to give it another shot. 

I've been doing a series of posts looking at data gathered with ghrabber, a simple tool I wrote that lets you grab files matching a search specification from GitHub. Last week, I looked at shell history in the broad, and then specifically at pipe chains. Today, I move on to something different - custom aspell dictionaries. When aspell finds a word it doesn't recognize, the user is prompted to correct it, ignore it, or add it to a custom dictionary so that it will be recognized as correct in future. These words are written to the user's custom dictionary - a file named .aspell_en_pw that lives in the user's home directory. It turns out that 30 people have checked aspell dictionaries into GitHub, containing a total of 9501 custom words. The chart below shows the top 50 words, with the X-axis showing the percentage of files the word appeared in.

There were a few requests for the raw data behind the previous two posts, so this time round you can also download a CSV file with the occurrence totals for each word in the dataset.

Earlier this week I published ghrabber, a simple tool that lets you grab files matching an arbitrary search specification from GitHub. I used ghrabber to retrieve all the bash_history and zsh_history files accidentally checked in to repos, and took a light look at the dataset with some simple graphs. In total, I obtained 234 shell history files with 165k individual command entries. This is a very rare opportunity to "shoulder-surf", to actually see what people do at the command prompt, and perhaps get some insights into how to improve things.

Along those lines, today's post looks at pipe chains - that is, compound commands that pipe the output of one command to another. The pipe operator lies at the core of the Unix command-line philosophy. The fact that we can easily compose complex operations is the reason why we are able to write small tools that "do one thing well" without losing generality. The shell history data on Github can give us some real data about what people do with composed commands, and how they do it.

It turns out that about 2% of all commands issued on the command-line use pipes. The graph above shows the prevalence the most common pipe chains - that is, what percentage of the user in my sample used each chain. There's a lot of fascinating stuff we can read straight from this image.

Starting at the top, the first thing we notice is how widely used the ps | grep chain is. About 17% of users in my sample used this chain - given the type of data we have, the real-world prevalence would surely be higher still. I've just been extolling the virtues of small tools and composability, but in this case practicality should beat purity. I suggest that everyone should have a command-alias similar to this in their shell configuration:

alias pg="ps aux | grep"

I've added this to my .zshrc today, and I've already used it twice.

Next up, we have the ls | grep pipes. The vast majority of uses here could actually be accomplished using the shell's filename generation mechanism. This ranges from simple redundancies like grepping for file extensions, to performing quite complex matching operations that could be done using the shell's advanced glob operations. I'm guilty of this myself - I rarely use features like recursive globbing, expansions using character ranges, case insensitive globbing, and so forth. I've brushed up on filename expansion for my chosen shell, and perhaps you should too.

The last thing I want to point out is a pattern that's genuinely dangerous - curl | bash, along with its cousins curl | sh and wget | sh. Unfortunately, this has become the recommended installation pattern for some tool - the vast majority of invocations here are for RVM and Yeoman. I don't think it's a good idea to pipe anything from the web straight into a local shell, but the situation is made particularly dire by the fact that almost half of these invocations are either over plain HTTP or explicitly turn certificate validation off.

I'll stop here, although There are interesting things to say about nearly every entry in the graph above. Next week, I'll move on from the shell history sample, look at some other juicy datasets extracted using ghrabber.

Github recently introduced hugely improved code search, one of those rare moments when a service I use adds a feature that directly and measurably measurably improves my life. Predictably, there was soon a flurry of breathless stories about the security implications. This shouldn't have been news to anyone - by now, it should be clear that better search in almost any context has security or privacy implications, a law of the universe almost as solid as the second law of thermodynamics. We saw this with Google's own code search, as well as Google proper, Facebook's Graph Search and even Bing. A certain fraction of people will always make mistakes, and and any sufficiently powerful search will allow bad guys to find and take advantage of the outliers.

After the dust had settled a bit I started wondering what else we could do with Github's search - other than snookering schmucks who checked in their private keys. I'm always enticed by data, and the combination of search and the ability to download raw checked-in files seemed like a promising avenue to explore. Lets see what we can come up with.

ghrabber - grab files from GitHub

First, some tooling. I've just released ghrabber, a simple tool that lets you grab all files matching a search specification from GitHub. Here, for instance, is an obvious wheeze - fetching all files with the extension ".key":

./ghrabber.py "extension:key"

Downloaded files are saved locally to files named user.repository. Existing files with the same name are skipped, which means that you can reasonably efficiently stop and resume a ghrab.

Shell history files

I've been having a lot of fun exploring Github with ghrabber. I'll return to this in future posts - today I'll start with a quick illustration of what can be done. One type of difficult-to-find information that is sometimes checked in to repos is shell history. Two simple ghrabber commands for the two most popular shells is all we need:

./ghrabber.py "path:.bash_history"

and

./ghrabber.py "path:.zsh_history"

After cleaning the data a bit, I had 234 history files varying in length from 1 line to just over 10 thousand, containing a total of 165k entries. I fed this into Pandas for analysis, parsing each command using a combination of hand-hacked heuristics and the built-in shlex module. The remainder of this post is a light exploration of some approaches to this dataset, steering clear of the obvious and tediously well-covered security implications.

One way to slice the data is to look at the percentage of history files a given command appears in. This gives us a nice listing of the top commands by user prevalence, which you can see in the graph on the left above. On the right, I've taken the same list of commands, and checked how many invocations are preceded by a man lookup for the command. This gives us an idea of which commonly-used commands have difficult or unintuitive interfaces. It's interesting that ln is right at the top of the list, considering how simple the command syntax is. My theory is that everyone forgets the order of the source and target files.

Since we have a list of the most widely used commands, it's also trivial to do silly popularity comparisons. Above is the obvious look at the state of the editor wars (vim is winning, folks), and a check on how tmux is doing in supplanting screen (the faster the better).

Another interesting thing to do is to look at the most commonly used flags to commands. I think having "real data" of command use may well guide us to design better command-line interfaces. I'd love to know the most common invocation flags for some of the tools I write.

I'll stop there. The data pool in this case is very deep, and there are a huge range of interesting bits of command-line ethnography that could be done. Stay posted for more in the coming weeks.

There is something terribly awry with the social news ecosystem. This is a feeling that's been growing on me over the last few years, and is the reason why I've cut both Reddit and Hacker News (who together constitute pretty much all of "social news") out of my information diet. Although I've mulled over things in various conversations, I've never actually tried to put my feeling of unease in writing, until today. What's spurring me into action is a proposal by Yann LeCun that a model similar to social news be adopted for scientific peer review - self-assembled Reviewing Entities voting on streams of submitted papers, regulated by a reputation system for authors and reviewers. Basically, this is science a la Reddit: complete with subreddits, karma and upboats. I find the idea frankly terrifying.

I guess it's time, then, to put finger to keyboard and lay out what disquiets me about social news.

Karma Corrupts

You start by introducing a reputation mechanism like karma to improve some outcome - say, to increase the quality of comments, or to apply a threshold to restrict voting to trustworthy community members. This seems like a plausible and even elegant mechanism at first, until you discover the terrible side-effects.

Humans are fundamentally status-seeking social apes, and you've now introduced a visible measure of social worth that people will be driven to maximize. In the real world, we have a word for those who spend their lives accumulating karma - we call them politicians. And so, within karma communities, we see the rise of a political class - persuasive centrists who cater (perhaps unconsciously) to a constituency, and who express (perhaps eloquently) opinions calculated to appeal to the masses and avoid controversy. Hacker News and many subreddits are dominated by people like this, whose comments are largely predictable and rarely add anything new or unexpected to the conversation.

At the bottom end of the food chain, we have a different class of creature with the same basic aim as the politicians, but without the persuasive charm needed to pull off the political approach. These are the karma whores, who use a mixture of frank pandering, provocation and calculated outrage to achieve the same aims.

The karma maximization game often acts contrary to the goals we aimed to achieve by introducing karma in the first place: the tenor of the community suffers, the diversity of opinion declines, and the karma whores post pictures of their cats everywhere.

The Lossy Sieve

Go and have a look at the new story submission queue on Hacker News. Scroll through a few pages, and pay attention to the stories stuck at one vote - they will most likely never receive another upvote and will die in obscurity. Now, go look at the front page. When I do this exercise I'm struck by the fact that there's plenty of crap on the front page, and quite a bit of good stuff in the submission queue languishing in obscurity. So, quality can't be the sole metric here - what determines what gets onto the front page and what doesn't?

Lets try a thought experiment. First, set up a small number of voting accounts - say, 10 or so. Now, in the new submission queue, pick 5 random stories every hour, and give them a small number of upvotes soon after they are submitted. I predict that you will find that stories that received this small initial boost are vastly more likely to end up on the front page. If I'm right, then chance dominates story selection - as long as an article exceeds some basic quality threshold, it all depends on who happens to see the story soon after it is submitted, and whether the spirit moves them to vote. Note that this is not the case at the extremes - frankly bad content won't be upvoted, and really important stories will usually find their way to the top. The lossy sieve phenomenon affects everything in between.

What this boils down to is that social news doesn't provide an effective filter - good content gets lost, and mediocre content finds its way onto our screens.

The Pinhole Effect

In social news, the front page is king. Most users never go beyond the first or second page of top stories. However, front-page real estate is incredibly limited compared to the volume of submissions on most popular subreddits and on Hacker News. The effect of this is that we're looking at a fast-flowing river of information through a pinhole. Even assuming that the selection mechanism works flawlessly, what you see on the front page is a small sliver of the total, chosen through a consensus mechanism that takes no account of individual variation in tastes and interests. The news you see is not tailored to you - it's tailored to some abstract, average participant, with all the rough edges of individuality smoothed away. The effect of this is that even at its best, the stories that emerge from the social news system feel like a predictable pablum dished up by the hivemind. The subreddit system tries to improve this by allowing communities to self-assemble around interests, but the pinhole effect still dominates in busy subreddits like /r/programming.

Gaming The System

Social news systems are eminently gameable, and cheating is rife. Part of the reason for this is that a story's destiny depends on a relatively small number of votes. If your story has any merit at all, you significantly increase the likelihood that it will end up on the front page by giving it a small nudge at the beginning of its life. If it has no merit whatsoever, you can still force it onto people's screens with a few tens or hundreds of votes. Conversely, you can use the same effect to censor and oppress views you disagree with if your social news site has downvotes. Anyone who's kept an eye on these things can rattle off examples of gaming in action: the voting rings, the "social media consultants", the vigilante thought-polizei, the political operators, and dozens of other types of manipulation and villainy. What's more - these visible scandals are just the tip of the iceberg. Eyeballs are valuable, and there's an active arms race with social news sites on the one side, and a dark army of spammers, scammers and true believers on the other. How much of what we see is affected by this type of cheating? We just don't know, but my suspicion is that the effect is significant.

The point here is broader than any particular instance of gaming. It's that social news sites are structurally susceptible to manipulation in ways that can't be fixed without changing the core of their operation. A system like this might be good enough to deliver rage comics, but I feel queasy trusting it any further.

Community Collapse Disorder

My final beef with social news is a problem that it shares with pretty much all online communities, especially technical ones. We're all familiar with the life-cycle of technical forums. They start with a small community of insiders who create value, which then attracts more people to participate, which then dilutes the quality of the contributions (and often introduces a few pathological bad actors), which then causes the good contributors to move on, which causes the magic well to dry up. Everyone then take their toys and move to the next community, and the cycle repeats. We saw this with Usenet and the original C2 wiki, and we are seeing it now with Hacker News and many technical subreddits all at various points in this life-cycle.

I believe that Community Collapse Disorder is one of the Big Problems online that we don't yet have a satisfactory solution to. People are trying, though. Hacker News, for instance, seems to be rather poignantly aware of its own decline, with some of the best of the old-timers calling for an alternative. Paul Graham himself recognizes the issue, and has been tweaking things in various ways to combat the phenomenon, without much success.

At the moment, we just don't know how to build online communities that are both inclusive and stable. Democracy, here, seems to lead inevitably to decline, and social news sites are no exception.

A better way forward?

A big part of the reason I don't use social news anymore is that my existing social networks have become so much more effective at turning up good content. The absolute best source of news for me is simply the set of links shared by the folks I follow on Twitter. I follow people who post interesting content, and whom I trust to act as information filters for me. Most of them share my technical interests, but some are interesting because they are from my home town, or because they share some more esoteric pursuit with me. So, the news stream I see is exactly tailored to me. At the same time, there is also room idiosyncrasy - if someone I follow shares something left-field that tickles their fancy, I'll see it. In turn, I try to be a responsible information filter for those who follow me - I find a link or two worth tweeting on most days.

There are still things I miss - Twitter is great for sharing links, but is an awful medium for technical discussion. Google+ could be a better alternative, but just doesn't seem to have achieved liftoff for me. I would also love better tools for aggregating and harvesting links from my social network. At the moment I use Flipboard and Prismatic, but I have issues with both. On the whole, though, these are quibbles. It seems to me that using social networks to filter news is a better way forward - if I was tackling the social news problem, I'd be building tools to support this process.