A couple of months ago, I wrote a post following one of my blog posts through the the social news grist mill. In it, I bemoaned the fact that social news seems to be displacing more old-fashioned person-to-person connections on the 'net - 33,000 unique visitors to my blog resulted in only 41 new Google Reader subscribers. Google Reader is so completely dominant in this space that ignoring everything else was good enough as a first approximation, but I made a mental note to come up with a more complete figure.

So, yesterday I hacked up a little tool called subscount to help. It parses parses Apache-style logs to make a best guess at feed subscriber numbers, and emits a snippet of JavaScript that can be used to show subscriber numbers on statically rendered sites like my blog.

Estimating feed subscribers from web server logs

Broadly speaking, there are four different groups of feed retrievers we need to deal with:

  • Well behaved aggregators that report a feed ID and the number of end subscribers in the user agent string. In my case, this is Google Reader, FriendFeed and NetVibes. There's no standard governing this, but there are so few significant players that I just catered manually for all the variations.

  • Poorly behaved aggregators that report a subscriber number, but no feed ID. An example here is PostRank. Again, there are a small number of these, so subscount handles them with a hand-coded set of rules.

  • Individual subscribers using tools like Akregator and NetNewsWire. In this case, we distinguish between subscribers by IP address, which should be good enough as long as we keep the analysis time window to a day or so.

  • A myriad of automated feed consumers. These are mostly poorly behaved, and rarely identify themselves properly. Weeding them out would be nearly impossible, so we treat them just like individual subscribers.

When subscount traverses a log file, it calculates a unique identifier and a number of subscribers for each retriever of the feed. For individual subscribers, the ID is the IP address, and the number of subscribers is 1. For aggregators, we use the reported feed ID, and the reported number of subscribers. We use the unique ID to make sure we don't count anyone more than once, and simply tot up the numbers at the end.

Needless to say, the figure we come up with is just an estimate - but I think it's probably a reasonable one. As expected, the figures show that Google Reader alone is responsible for 68% of my subscribers.

Deploying subscount

I thought it would be neat to report the number of feed subscribers I have next to the feed icon on my blog, so I extended subscount to help. The -j flag to subscount takes a DOM element ID, like so:

./subscount -p "/rss.xml" -j subscriber_div /var/log/mylog

And then prints a snippet that modifies the specified tag with the subscriber number, like this:

function _subs(){
    var subsdiv = document.getElementById("subscriber_div");
    if (subsdiv)
        subsdiv.innerHTML = ("947");
};
window.onload = _subs;

I run subscount from cron just after log rotation every night (it can read gzipped log files directly), and pipe the output to a file in my blog's web root. I then simply source this file in a script tag, and voila! - dynamically updated subscriber numbers for my statically rendered site. You can see the results in my sidebar.

The code

As usual, the code is available on GitHub.