In this post, we’re going to have a look at some data about the tags used on Stack Overflow to label questions, their frequencies and what we can measure around them.
For those who are heavy Stack Overflow users, which is probably pretty much the vast vast majority of the people who do tech, having posts appropriately tagged is vital to navigate the immense archive the site is. At the moment of writing this (January 2018), Stack Overflow contains more that 15 million questions, with a number of unique tags around 54000. The help page here is meant to give some guidelines on how to best use this feature, as it’s super important to have your question tagged in the best way it can in order to be found by the relevant people who can answer it. Tags are names of technologies, like programming languages, frameworks, software or technical elements and components, like coding techniques or computer science general topics.
Bits and bobs of history
Stack Overflow has been launched in September 2008, according to the Wikipedia page and is now quite well-established as the one-go Q&A resource for programming. One year later, in September 2009 (still according to Wikipedia) its successful model has been extended to a general Q&A site divided into sub-sites about many topics (many of which outside tech), Stack Exchange. All in all then, the network has been around for about 8/9 years now, so we’re all preparing to celebrate its 10th year anniversary soon. On this historical note, it’s quite interesting to have a read at this article called Find the answer to anything with StackExchange, from back in 2009 when the network was just born.
Fetching some data
To query the Stack Overflow API, we’ll use the Python package StackAPI. Note that the API allows 10000 requests every 24 hours.
We’ll fetch the 5000 most frequent tags (those with the highest usage counts) as due to said API limits we can’t fetch them all: if we did split the requests to get fetch all over multiple days, because the usage counts of tags change in time we wouldn’t have a consistent picture. Anyway, 5000 seems way more than enough to see something interesting.
All the methodologies followed and the results are reported in this repo, there is also a Jupyter notebook for those interested. For the figures, note that we will use the XKCD theme for Matplotlib and that we tried to reproduce Stack Overflow’s colours!
Tags and their usage - rich gets richer
In fact, the distribution of frequencies is power-law and if you fit the trend of frequency vs. rank (the Zipf’s law) you end up with, the inner bulk of the distribution considered, with a trend with . Does this hint at a rich-gets-richer phenomenon?
The Zipf law is what governs the trend frequency/rank for words in a language; a power-law trend with slope -1 means that the second most-frequent word is half as frequent as the first one; the third is one third as frequent, and so on. We are sort of seeing the same in regards to the frequency of tags here.
In the preferential attachment (also known more informally as “rich gets richer”) model for text generation, when you choose which words to use, you choose an existing word proportionally to its frequency so that more frequent words have higher probability to be reused, and this generates a power-law trend for the frequencies. In the case of our tags, we infer there might be a similar mechanism underlying what we see and this could be depending on a series of reasons.
Old gets richer?
We can also ask the data if, by any means, old gets richer as well, that is to say, if tags that are older in time attract more posts. This seems like a reasonable thing to happen as with time flowing there’ll simply be more and more posts on existing tags. As simple as that, but how true is it?
On this, we need to note that the
tags/ endpoint of the API does not furnish the creation date of the tag, even though the info page of a tag does. So what we did has been attaching the tag with the creation date extracted from the first question appeared with that tag (querying the
search/ endpoint); note that because questions can always be edited, including retagging, this might mean that an old question sees its tags modified when a new tag first appears. This is why some tags have a creation date which is older than the technology they represent (see for instance AngularJS, born in 2010 but whose tag is from 2009).
The plot below here shows a scatter of the usage counts vs. their age (in days, and again this data has been retrieved on the 11 January 2018 so it’s the number of days till then). It also displays an average binning of the usage counts, the bin is 200 days large.
What is noticeable is the quite large jump in the last bin, which is for the oldest tags. Apart from a bit of a jumpy behaviour, there is this effect of old tags getting richer, as expected. But the oldest one get way richer compared to the rest, meaning there is a spike there: you go from around 3000 in the second-to-last bin to 15000 in the last one! This is suggesting that in the first about 200 days of life of the site, the tags which cover the greatly vast majority of popular topics have been created and keep getting huge traction. In around a year since its birth, Stack Overflow has seen most of the topics that are still super popular and of interest today created.
Tags can have synonyms. When a question gets tagged with something which is just a variation of something existing, it gets mapped to the mother tag automatically, so the usage counts go against the mother tags, which are those we retrieved. Note that when you query the API for a tag which is a synonym of a mother tag, you get the usage counts for the mother tag.
The distribution of the number of synonyms per tag is in figure, again a very skewed situation where just a few tags have large sets of synonyms.
For reference, the tag with most synonyms (25) is css (things like font-weight, dynamic-css, inline-block).
Now, would you expect that tags who have synonyms would receive higher frequency counts? This would be intuitively due to the fact that a tag with synonyms is a tag that can be expressed in multiple flavours, maybe a composite and broad technology and this would mean that people might be interested in different parts of it. We ran a quick t-test to determine if the means of the distributions of usage counts for tags with and without synonyms are significantly different and the result has been that they actually are, with said means sitting at respectively 21500 and 3200: in general terms, tags with synonyms are about 7 times as popular as those with no synonyms.
A note on time trends
A small word of caution
The assumptions behind taking these temporal trend results at face value is that Stack Overflow is a representative community of the technologists around, that questions are correctly tagged and data is not biased. As an extreme case for instance, the popularity of a tag might be due to the lack of good documentation of the related tool, not necessarily to the how popular and interesting it is among users: lots of users on the site will be just popping there to get some help on solving a problem they can’t find good references anywhere else.
All in all though, a great tool to see how the world of tech is moving!