The rest of this post takes a basic look at the numbers for 12 languages. I had to leave some out for lack of data. Haskell, for example, didn't make the cut with only 18 projects. Ah, well.
Lets look at the numbers.
Lets start with a quick overview of the basics of the dataset.
First, the sample size. Clearly, github is very popular with the Ruby crowd, with more than four times as many projects as Python, the runner-up. The sample sizes for C#, Erlang and Scala are pretty small, so the results for these languages aren't as firm as for the others.
Here we see the median number of commits for projects in each language - in some senses, we can view this as a proxy for project age. Most projects have around 75 commits. The Perl and C++ data, however, seems significant - projects in these languages on average have a much longer commit history. I suspect that this is due to a decline in popularity in these languages. Recall that I collected data only for projects that had recent commits. If fewer new projects are created in C++ and Perl, we would expect projects in these languages to be older, on average.
This chart shows the median commit size, in lines of code. We take the total commit size to be the sum of lines inserted and the lines deleted, as reported by "git log --shortstat". Most commits touch around 19 lines of code. The C# outlier is probably due to the small sample set. I suspect that the differences in this graph are a reflection of basic language verbosity, with Objective C, C++ and Java being more verbose, and Perl, Python and Ruby being less so.
Most commits touch about 4 files, with C++ touching somewhat more, and Perl, Python and Ruby somewhat less. The C# outlier is probably due to small sample size.
An average of only 5 commits - that's not much. Lets look at this from a different perspective - graphing the percentage of the total commits to a project made by contributors.
The percentage of commits by contributors is shown on the Y axis, and the matching f-value on the X axis. An f-value of 25 is the bottom quartile, 50 is the median, and 75 is the upper quartile. Looking at the Python graph, for example, we can see that the bottom 75% of contributors provided a bit less than 20% of the commits. The shape of these graphs gives us our first take-away: For all languages, a small fraction of the committers do the vast majority of the work. This won't be news to anyone in the Open Source community. More interesting, though, is the fact that C, C++ and Perl projects are significantly more "top-heavy" than those in other languages, with a smaller core of contributors doing more of the work.
How projects evolve
This graph shows the number of commits per day, over the first 300 days of a project's life. To prevent skew, I only included projects that are 300 days or older. The red line is a smoothed curve. C and Perl projects show a marked decline in activity over their first year. I suspect that the Perl result is due to the fact that it becomes harder and harder to contribute to a Perl codebase, the bigger it gets. The C result is more of a mystery.
And now for something silly.
This shows the number of swearwords used per 1000 commits. Objective C and Perl programmers are the most foul-mouthed. Java coders are more restrained, possibly because the language is more corporate, and they're afraid of having their pay docked.
You can play too
I had fun playing with this dataset, and I've barely scratched the surface of what could be done with it. I'll probably squeeze another blog post or two out of the data, but in the meantime, I'm making the full database available so people can point out the many mistakes and shortcomings of my analysis. At the time of writing, I still have the checked out repositories, so if you have suggestions for refinements or expansions to the data, let me know.
You can check the database out here. Be warned, though - it's about 100mb of data.