The impact of language choice on github projects


Although I spend a lot of my play-time fooling about with other languages, my professional and released code consists of Python, C, C++ and, alas, Javascript. I've lived in this tiny corner of the magic garden of modern software development for 10 years, and I'm itching to strike out in a different direction for my next project. With this in mind, I've started to wonder about the impact of language choice on the development process. Are there major differences between projects in different languages? Is it possible to quantify these differences? I decided to try to gather some hard numbers. I started by writing a small script to watch the public timeline on github. Over a period of weeks, I collected a list of about 30 thousand active projects. Using the github API, I eliminated projects with less than 3 watchers, on the basis that these are likely to be small personal repositories like dotfiles, programming exercises and so forth. After this, I was left with some 5000 repositories, which I checked out, giving me about 55G of data to work with. The next step was to analyse the data, extracting commits, committers and line counts for each file type contained in each project. Lastly, I got rid of duplicate projects by looking for matching commit hashes. From start to end, this process took more than a week to complete. The end result result is a database consisting of 3 400 repositories, 20 000 authors, and 1.5 million commits. I'm releasing the dataset for others to play with - see the bottom of this post for information.

The rest of this post takes a basic look at the numbers for 12 languages. I had to leave some out for lack of data. Haskell, for example, didn't make the cut with only 18 projects. Ah, well.

Lets look at the numbers.

The Basics

Lets start with a quick overview of the basics of the dataset.

Sample size

First, the sample size. Clearly, github is very popular with the Ruby crowd, with more than four times as many projects as Python, the runner-up. The sample sizes for C#, Erlang and Scala are pretty small, so the results for these languages aren't as firm as for the others.

Median contributors

This graph shows the median number of contributors to projects in each language. The red line here and in the graphs below is the median for all projects in the dataset. Most projects have around 3 contributors, with Perl and Java projects having about 5, and Javascript and Objective C around 2.

Median commits

Here we see the median number of commits for projects in each language - in some senses, we can view this as a proxy for project age. Most projects have around 75 commits. The Perl and C++ data, however, seems significant - projects in these languages on average have a much longer commit history. I suspect that this is due to a decline in popularity in these languages. Recall that I collected data only for projects that had recent commits. If fewer new projects are created in C++ and Perl, we would expect projects in these languages to be older, on average.

Median commit size

This chart shows the median commit size, in lines of code. We take the total commit size to be the sum of lines inserted and the lines deleted, as reported by "git log --shortstat". Most commits touch around 19 lines of code. The C# outlier is probably due to the small sample set. I suspect that the differences in this graph are a reflection of basic language verbosity, with Objective C, C++ and Java being more verbose, and Perl, Python and Ruby being less so.

median files touched per commit

Most commits touch about 4 files, with C++ touching somewhat more, and Perl, Python and Ruby somewhat less. The C# outlier is probably due to small sample size.

The Contributors

Median commits per contributor

This shows the median number of commits contributors make. The average contributor contributes about 5 commits to a project. C, Objective C and Ruby developers contribute somewhat less, PHP, C#, Java and Javascript developers somewhat more. I suspect the results for C and Ruby are due to projects in these languages receiving more one-off contributions.

An average of only 5 commits - that's not much. Lets look at this from a different perspective - graphing the percentage of the total commits to a project made by contributors.

% commits vs % contributors

The percentage of commits by contributors is shown on the Y axis, and the matching f-value on the X axis. An f-value of 25 is the bottom quartile, 50 is the median, and 75 is the upper quartile. Looking at the Python graph, for example, we can see that the bottom 75% of contributors provided a bit less than 20% of the commits. The shape of these graphs gives us our first take-away: For all languages, a small fraction of the committers do the vast majority of the work. This won't be news to anyone in the Open Source community. More interesting, though, is the fact that C, C++ and Perl projects are significantly more "top-heavy" than those in other languages, with a smaller core of contributors doing more of the work.

How projects evolve

Contributors vs Commits

This dot plot shows the total number of contributors vs the total number of commits for each project. I've restricted the X and Y values - we're effectively looking at the bottom-left corner of a larger dataset. The red line is a loess fitted curve. Over a large number of projects, we can consider the number of commits to be a measure of time - the graph effectively shows how quickly projects tend to accumulate contributors over their lifespan. Ruby projects recruit contributors astoundingly well, with Python a close second. Java, Javascript and PHP projects, on the other hand, do particularly badly. The fact that the fitted curve is a nice straight line with a consistent slope shows that these results hold for young and old projects alike. Note that the Scala data is not significant - that nice straight line is an extrapolation by the curve fitting algorithm, which is not backed up by information.

Commit age

This graph shows the number of commits per day, over the first 300 days of a project's life. To prevent skew, I only included projects that are 300 days or older. The red line is a smoothed curve. C and Perl projects show a marked decline in activity over their first year. I suspect that the Perl result is due to the fact that it becomes harder and harder to contribute to a Perl codebase, the bigger it gets. The C result is more of a mystery.

The Silly

And now for something silly.

Swearwords per 1000 commits

This shows the number of swearwords used per 1000 commits. Objective C and Perl programmers are the most foul-mouthed. Java coders are more restrained, possibly because the language is more corporate, and they're afraid of having their pay docked.

The Caveats

There are all sorts of reasons why you should take all of this with a grain of salt. There are many factors that make github projects atypical - not least of which is the use of Git for source control. The way that I collected data skews the dataset in favor of projects with recent commits - unfortunately dead projects aren't included. I detected a project's primary language purely based on line count by file extension. Due to the large number of projects that include Javascript libraries in their repos wholesale, I had to apply a fudge-factor weighting to .js files to get reasonably sensible results.

You can play too

I had fun playing with this dataset, and I've barely scratched the surface of what could be done with it. I'll probably squeeze another blog post or two out of the data, but in the meantime, I'm making the full database available so people can point out the many mistakes and shortcomings of my analysis. At the time of writing, I still have the checked out repositories, so if you have suggestions for refinements or expansions to the data, let me know.

You can check the database out here. Be warned, though - it's about 100mb of data.