The human brain has 100 billion neurons, each neuron connected to 10 thousand other neurons. Sitting on your shoulders is the most complicated object in the known universe.
Michio Kaku, in an interview (2014)
For a long time, scientists believed the human brain was a single, continuous chunk of matter organised in a reticulate with floating small organs that behave in an undifferentiated way. It was Santiago Ramón y Cajal (1852-1934), a Spanish medical researcher, to illustrate, literally, that it is instead composed of individual unit cells connected in such a way to form a hierarchical system capable of performing highly specialised tasks.
Ramón y Cajal wasn’t born in a privileged or well-known family and he came from a small village, but his father - a doctor - initiated him to the science and practice of medicine, although this was still the time when surgeons and barbers where one and the same: you went for a haircut to the same place where you went for a wound suture or a tooth removal - indeed the “barber pole” with its white and red colours has its meaning in this. Santiago Ramón y Cajal moved to Zaragoza to study medicine as a young man and then spent his career in several notable cities in Spain, most importantly Madrid and Barcelona. He was also called to serve in the the Ten Years’ War (1868-1878) against Cuban rebels, performing medical duties for the army.
From an early age he displayed an amazing talent for drawing and throughout the years he produced many very-detailed and beautiful illustrations of brain matter which became pivotal to the emergence of what became known as the “neuron theory”. His illustrations are used to this day, and indeed they are beautiful.
Ramón y Cajal perfected Camillo Golgi’s staining method, which allowed for the clear visual isolation of just a subset of the cells in neural matter by virtue of a process known at the time as the “black reaction”, which made just a few cells appear as coloured in black. This way, he could then put what he saw on paper, which allowed him to understand how neural matter really is structured.
His work wasn’t really appreciated from the start, quite the opposite indeed. On top of the fact that it proved to be really hard to dislodge the common conviction within the scientific community that the brain wasn’t made of individual cells, he also suffered from outright discrimination within scientists’ circles (which were very elitist) due to his modest upbringing. It took a while for the neuron theory to be accepted: Golgi himself, in his Nobel lecture (text available here) states “I shall therefore confine myself to saying that, while I admire the brilliancy of the doctrine which is a worthy product of the high intellect of my illustrious Spanish colleague, I cannot agree with him on some points of an anatomical nature …” and delivered a whole speech critical of the idea.
In any case, his is a story of resilience, talent and rigour. In 1893 he published a book called “Nuevo concepto de la histología de los centros nerviosos” which was immediately translated in several languages and became the basis for a new way to look at the study of the nervous system.
The word “neuron” was coined around the same time by Wilhelm Walyeger after studying these new developments. In 1906 Ramón y Cajal shared the Nobel prize for Physiology & Medicine with Golgi himself (who, as we saw, still wasn’t convinced by his work …). After that, he became a national legend, but with his humble and servant-oriented personality he always kept a low profile, devoting his time to teaching and research and setting up organisations to help young Spanish researchers advance in their work and get international recognition. He really believed in science, he looks to me like a genuine passionate scientist who didn’t care for politics and fame.
I don’t think he is very well known to the general public, and that’s a shame given the importance of his discoveries, so I thought I’d write a few lines.
The neurobiology of the brain has inspired the idea to create “artificial” networks of “artificial” neurons that could function as mechanisms to learn patterns. Neurons transmit information by virtue of electrical and chemical signals that propagate through the cell and act at the interfaces, the synapses. In an artificial paradigm aimed at creating a modelled representation of a neuron, we can envisage units that emit an impulse (“fire”), hence passing information to neighbours, or stay quiet.
🚨 It is super important to stress though that artificial neurons and networks have never been designed to “imitate” the brain. As F Chollet writes in his book “Deep Learning with Python”,
“The term neural network is a reference to neurobiology, but although some of the central concepts in deep learning were developed in part by drawing inspiration from our understanding of the brain, deep-learning models are not models of the brain. There’s no evidence that the brain implements anything like the learning mechanisms used in modern deep-learning models. You may come across pop-science articles proclaiming that deep learning works like the brain or was modeled after the brain, but that isn’t the case.”
Biology worked as an inspiration, not as something to resemble.
You know I like to doodle to show concepts. In the following, rather than doodling on paper I’ve used a brilliant tool called Excalidraw, which allows you to create quick illustrations with a handmade feel.
An artificial neuron can be thought of, in its bare bones, as a entity that takes several inputs, say \(n\) and call them \((x_1, x_2, \ldots, x_n)\), and generates one output by applying a function \(f\) on them:
\(f\) is known as the activation function and it can vary depending on the type of neuron one is building - we will see a very simple version in the following section. Before getting passed to the activation function, inputs are added up, so effectively the output is given by \(f(x_1 + x_2 + \cdots + x_n)\).
In 1943 W MCCulloch and W Pitts (yep, it’s the prehistory of AI) developed a simple mathematical model for an artificial computing unit which used the idea illustrated above, but assumes that
This is the step function:
which mathematically writes as
\[f(x) = \begin{cases} 1 & x \geq t \\ 0 & x < t \end{cases}\]The McCulloch-Pitts unit was the basis of the simplest neural network built, the Perceptron.
This paradigm is really simplistic and not the one used today anymore, but it is nevertheless extremely educational to visually see how we can process simple functions, like logic ones. I have used Rojas’ book (see references) as the basis for this part.
In the following, we’ll use the standard convention whereby 1 means True
and 0 means False
.
Let’s walk through the basic logical functions. When a function takes more than one input we will show the simplest case with just two, but the reasoning can be easily extended to a generic number of inputs. We will represent the McCulloch-Pitts computing unit/neuron as we did above, but we will split the circle into two halves where the second is black - this is to signify that the white part is the one receiving the inputs and the black part spits the output. Also, the threshold will be indicated on the white half. I’ve learned this convention from Rojas’ book which itself takes from Minsky’s work (see references).
We’ll look at the NOT, the AND and the OR.
The NOT function works on a single input - it simply flips the value of what’s in input. In terms of a logical proposition, “I go to the gym” becomes “I don’t go to the gym” and vice versa. The truth table is:
\(x\) | NOT output |
---|---|
1 | 0 |
0 | 1 |
This can be easily encoded with an artificial unit by using an inhibitory input (note the circle on the edge) and a threshold of 0:
The AND logical function (logical conjunction) with two inputs yields a True only if both are True. From the perspective of logical propositions, if I say “I go to the supermarket and I buy apples”, it means I do both things. If any of the propositions or both are false, the result is false. Truth table:
\(x_1\) | \(x_2\) | AND output |
---|---|---|
1 | 1 | 1 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 0 |
This can be encoded as a unit whose threshold is 2 and with two inputs:
The OR (logical disjunction) with two inputs yields a True if any or both of the inputs are True, False otherwise.
\(x_1\) | \(x_2\) | OR output |
---|---|---|
1 | 1 | 1 |
1 | 0 | 1 |
0 | 1 | 1 |
0 | 0 | 0 |
The OR can be encoded by a unit with a threshold of 1:
Watch out: in regular language, usually, when we use an “or” to connect two propositions we actually mean the XOR operation (exclusive OR), which doesn’t yield a True if both are True (it admits only one of the two), like when in English you say “I’ll go to the gym or to the cinema”, better and more precisely phrased as “Either I’ll go to the gym or to the cinema”. I guess we’d enter a linguistic discussion here with this, which is out of scope.
But an important note about the XOR function is that it cannot be encoded with a single computing unit! This is because the boundary between output values cannot be a line and our little McCulloch-Pitts unit is only capable of separating outputs linearly by value. Let’s see what we mean. The truth table of the XOR reads like this:
\(x_1\) | \(x_2\) | XOR output |
---|---|---|
1 | 1 | 0 |
1 | 0 | 1 |
0 | 1 | 1 |
0 | 0 | 0 |
If we look at the decision boundaries of the previous logic functions and this one, which are the divides between output values, we have this situation:
For the XOR to be encoded, a more complex structure is needed that adds non-linearity. I’ll probably expand on this point in another post at some point, but for now, I hope you enjoyed this and as always any feedback is appreciated! Read more on all this in the great references I refer to below.
His subjects are mostly religious/biblical (well, he worked primarily under commission) and are defined by the masterful use of light contrast (a technique known as chiaroscuro) that generate powerful visual effects.
A while ago, I did a data card on Renoir’s colours, so I thought I would replicate the work for another artist very distant in time and style from the Impressionists. I have used the exact same approach: downloading images of paintings from Wikipedia (this page) and analysing their colour segmentation with a chosen palette - more details below.
As part of my data cards I (nearly) always try to recommend things that go along with them and help give some context.
Most of Caravaggio’s works are in Rome as he spent several years there. Naples however hosts one of my favourites, the “Sette opere di Misericordia” (“Seven works of Mercy”) in the Pio Monte della Misericordia, a palace (now museum) dating from the seventeenth century built by a group of youngsters engaged in charity endeavours aimed at helping the city’s underprivileged.
At some point in the early part of the century, these people commissioned the idea of creating a depiction of the Gospel’s charity ways to Caravaggio, who was new in town. By the way, young Caravaggio must have been quite the character, he had to fly Rome because he committed a homicide during a brawl. Really, he was a bit of a madcap and kept overall a high profile, living between painting and violent rioting. A letter published in The Lancet in 2018 shows evidence that he died of sepsis after a probable infection contracted as a result of a fight. He was 38 - imagine what he could have still produced if only he lived a little longer.
The artwork is hosted in the chapel within the palace, which you can visit together with the museum, and I would highly recommend it. You can find the location by walking alongside the colourful cobblestones-paved road that is Via dei Tribunali in the very heart of the city’s historical centre, which is also home to some of the most ancient pizzerias, likely surrounded by mopeds and various people masterfully and quickly shifting pizza doughs and espressos from one side to the other. Honestly, it takes skills and practice to do that.
At the top we see the Virgin Lady with baby Christ and some angels; then the seven works of mercy are:
Many of Caravaggio’s works contain several figures involved in complex acts so it is always worth taking the time to observe the details.
I’ve represented all the paintings I could extract data for (from this list on Wikipedia), amounting to a total of 92; there are a few more in the list but some failed in either downloading or passing through the colour-extraction routine.
Each painting is a line and the colours represent the fraction of pixels in the image which are assigned that colour cluster (colour occupancy). The palette used, the 12 colours shown at the bottom of the card, is the one from Chamorro-Martínez et al. (like in the case of Renoir).
On the vertical axis, time as embodied by the three decades Caravaggio was active, from circa 1592 to circa 1610.
We can see that there’s a lot of warm hues, which may be slightly counterintuitive given the dark tones of many of his backgrounds, but those tones map in this palette and with this method to mostly oranges and yellows.
There are decisively less blues and greens in Caravaggio than in Renoir, given the typical impressionist flowery and light outdoors. Also, note that in the case of Renoir I had many more paintings.
All in all I would expect to retrieve similar results for any artist - the palette would have to be made more specific to actually see differentiating colour schemes.
As in the case of the Renoir card, I have been able to do this very easily thanks to the great Python library colour-segmentation - you can see all code details in this Jupyter notebook.
The same caveats I surfaced for Renoir apply here.
Obviously, I’ve used pictures of paintings which have likely been taken with different devices and/or edited differently, which means there is no homogeneity. Also, the list of paintings is not necessarily comprehensive of Caravaggio’s work (and several of his paintings are thought to have been lost in time) and the palette used is arbitrary.
Liked this? I have a newsletter if you want to get things like this and more in your inbox. It’s free.
A few years ago I had produced some illustrative notebooks outlining the power of the Python data stack, the main one being this one, devoted to NumPy and SciPy. In there I go through some of the many features of these two libraries, mostly for the sake of giving a flavour to people who maybe haven’t used them much yet. The notebook is rendered in the nbviewer in the link but you can download it from Github here and play with it locally - just note that if you do, it will lack access to the functions I’ve placed in a common_functions
module within the repo, but you can scrap that part as it’s just notebook-styling choices.
I did a talk about data science tooling to an audience of Python engineers and Scipy was were I had decided to focus - I used parts of that notebook. SciPy is awesome. In the notebook, you can see how you can easily use it for statistical work, for solving equations, for integrating functions et cetera (big et cetera, there’s a lot you can do). I still maintain that the SciPy lecture notes are wonderful and a great way to familiarise with both Python and data science. To this day in 2024, to people asking for a list of things to study to get good in data science I feel like suggesting to give yourself the opportunity to try things with your hands and understand first principles, without going into the details all at once. The basics of stats and maths can be explored exactly with e.g. SciPy. In any case I also maintain a list of resources (many free) on this blog, which I think are fantastic.
One thing I’ve encountered in the fast.ai course is the the interact
Jupyter (well, IPython) widget, which allows you to add immediate interactivity to your notebook cells, enabling some simple UI controls.
In the fast.ai course, the instructor uses it to illustrate the agreement between some points, extracted from a parabolic trend with added random noise, and the analytical form of a parametric parabola, the parameters being what you can change in the UI with interact
controls. I think it’s a great way to give people a sense of how a function looks like when you change numbers.
I’ve added a quick section to the NumPy/SciPy notebook (given this is still about mathematical manipulations) where I did the same, but playing around with a simple linear function - so, two parameters:
\[f(x) = ax + b\]Note that you can also have a plot title change according to the choice.
Isn’t this wonderful?
]]>This concert screams celebration, opulence, culture, history and lots of light. It has a long and prestigious pedigree, albeit not without shadows. Every year, it is led by a famous conductor and presents a series of pieces from the heyday of the Austro-Hungarian empire, most notably featuring a lot of the Strauss family.
Incidentally, if like me you forget who’s the father and who’s the son, this is the genealogy:
Note: Richard Strauss has nothing to do with them, he was German and one of the major representatives of Romanticism.
Music from the Strauss was overall considered “popular”: they composed a lot of polkas, quadrilles, galops, waltzes etc - folk dance tunes from Central Europe, from Poland to Bohemia and beyond. The NY concert is a big celebration of these musical forms, however we may now perceive the concert as quite “aristocratic”. The show always ends with three encores: a fast polka, the “Blue Danube” (“An der schönen blauen Donau” in the original) and the “Radetzky March”.
I thought of many things before realising this card, but eventually resorted to drawing a quick one where I show the counts of musical pieces by composer for the last five years’ worth of concerts. By piece here I mean the occurrence, not the unique title. Programmes for past concerts are easily retrievable online.
It’s fairly simple: Johann Strauss II (the son) eclipses everyone else with 34 pieces overall - note that his is the “Blue Danube”. Following is his brother Josef with 20 pieces and the other one (Eduard) with just 8. The Vater only has 6 mentions overall and considering that 5 (one per year) are for his “Radetzky March” it’s quite a surprising low score - the sixth piece is “Venetianer-Galopp, Op. 74” played in 2021, by the way.
About some of the other ones, I personally didn’t know their names. Interestingly, Mr. Hans Christian Lumbye was a Dane, who, according to Wikipedia “In 1839, heard a Viennese orchestra play music by Johann Strauss I, after which he composed in the style of Strauss, eventually earning the nickname “The Strauss of the North””. The other big outsider is of course the magnificent Beethoven, who got a mention in 2020 with his “Twelve Contredanses, WoO 14”, also dance pieces.
Happy 2024 to all of you 🎉!
Oh, I have a newsletter (see link in navigation above), powered by Buttondown, if you want to get things like this and more in your inbox you can subscribe from here, entering your email. It’s free.
Hey there,
you're getting this because at some point you signed up to Doodling Data on its previous Substack home, which isn't active anymore. I've just migrated it here to Buttondown, so welcome again!
Quite simply: Substack has openly clarified they won't censor/ban hate speech and I'm not OK with that. Some of you may not be aware of what Substack is or what happened in the recent (prior-to-Christmas) days, so here's a quick recap:
End of Nov. 2023. The Atlantic published "Substack has a Nazi problem": the author found several Substacks (at least some of which paid) spreading racist, hateful, white-suprematist-kinda-thing content. Some have clear Nazi symbols (😮). These are (mercifully) a tiny portion of the world of Substack, but they make noise (and $$, including to Substack itself). However, I bet there are more than just the ones mentioned in the article. Note: the problem isn't new.
Mid Dec 2023. Some Substack writers (including big, notable names) wrote a open letter to the company urging for an explanation.
A few days ago. The company responded, essentially saying that while they don't like Nazis either, they stand for free speech, can't be the referees of who-writes-what, and that suppressing hate speech makes the problem worse (I think that's the weakest of their arguments).
Personally, I am of the opinion that hate speech of all sorts, speech promoting the idea that some types of humans are better, speech that promulgates fake, anti-scientific and/or anti-facts stuff must be banned immediately and put in a condition not to harm society. It's the paradox of tolerance.
Heck, in places like Italy, Germany, Austria (not sure about other countries) using Nazi/Fascist symbols and paraphernalia is a criminal offence, for good reason.
There's at least two schools of thought - and this is real free speech in action, a good thing:
Those who think if you tolerate in the name of free speech, you're in error - as per above, I stand with these lot. An excellent piece with a great analogy is "Leaving the Nazi bar" by Ben Werdmuller, who also moved to Buttondown. In the same crowd, I recommend this other piece (on Substack!). Note that this school of thought stresses that this isn't a zero-sum game, because given all this Substack is profiting from hate speech;
Those who think that Substack is right in not taking a stance. The argument goes that in the vast sea of publications hateful ones are just a few and you, the reader, can decide whom to follow in much the same way as for other (online and non) groups; it shouldn't be a platform's role to police content. As I said, the debate is interesting (see comments in the article) but I disagree.
About a year ago, I had chosen Substack because it presented itself as a refreshing place to have both a blog and a newsletter for civil conversations undisturbed by the all-present stream of ads that pollutes most other digital experiences. Naive of me maybe, but I don't regret it - I've had a great time and I've met lots of great publications I will continue to follow nevertheless. And it has allowed me to reach many people with my writing and doodling.
But, like B. Werdmuller above, I have now moved it to Buttondown, an indie newsletter platform created by Justin Duke. It has nice and friendly takes on open source and climate contributions and a minimalistic design I am really a fan of. It publishes a public roadmap (in fact, I am looking forward to some improvements) and is transparent on costs. Plus, I can attest Justin is very responsive and friendly when you have a question!
It is paid (for me, the user) because it has to cover costs of shipping emails, and that's fair. I am more than happy to pay for good software, especially if independent, and I intend to keep my newsletter free to readers, but I may open a Ko-fi page or another possibility for you to contribute to my work in the future.
I will now use this as the newsletter in tandem with my website which will host all posts (and of which you can follow the RSS feed). I still have to migrate some old Substack posts there but bear with me. You can also just follow this newsletter via RSS feed and not email. All old Substack posts are, by the way, on the migrated archive here.
I've brought all of you Substack subscribers here and I hope you stay for the ride, but if you find this annoying, if you had maybe signed up from Substack out of recommendations (a feature I never loved as it encourages "blind" subscriptions) and aren't interested, if you disagree with any of the above... please feel free to unsubscribe, no offence taken; I want this to be valuable, not a burden.
Until next time,
Martina @ Doodling Data
I will likely use this data (I’ve got Pierluigi’s permission!) to produce other cards, as there’s a bunch of stories you can draw from it, but for now, let’s focus on countries and cuisines.
The Michelin guide is the Scripture of food critique; it saw light for the first time in 1900, in the early days of the automobile, when the French tyre company created it as a way to encourage more driving and, consequently, more tyre consumption. From its early days when it only listed hotel-bound restaurants and practical information about where to buy petrol and parts it has grown into the ~3500 items list we have today, growing country by country. We will do a quick dive into the highest star-decorated restaurants, by country!
Looking at things to recommend you for a synaesthetic journey into exquisite food, I thought of:
The latest edition of the guide lists 3453 restaurants, distributed in 42 countries.
The country with the highest number of Michelin-decorated restaurants is… France 🇫🇷. I don’t think that’s a surprise! Japan, Italy, Germany, Spain and the USA follow to build up the top 6 we show in the data card: I was quite surprised to see that the UK isn’t there - in fact, it appears to be at position 7. I will use these top-6 countries for the representation, which means the UK is unfortunately just out.
Country | Restaurants 2023 | Restaurants 2024 |
---|---|---|
France | 622 | 621 |
Japan | 411 | 411 |
Italy | 377 | 395 |
Germany | 327 | 325 |
Spain | 246 | 245 |
USA | 225 | 231 |
Looking at the counts of starred restaurants for these 6 countries, it is interesting to note that while France, Germany and Spain have all lost one or two, Italy had a significant gain of 18 restaurants! The USA gained 6 and Japan kept the same count. Note: I have not checked which specific restaurants stayed the same, maybe that’s material for another card later.
The counts of restaurants for each country are illustrated in the following plot - you can see the long tail of those countries with just a handful. Bear in mind that the Michelin guide expanded gradually (and still does) from France to other countries though, so it is not a fair representation of the global food scene!
The numbers of countries shown in the table above are displayed within the donuts, so that these 6 countries are ordered by size of their presence in the guide. On the size, I’m showing the distribution of stars amongst these restaurants: while it is arguably obvious that most places will have (anywhere) just one star, you can see that e.g. Italy has a lower proportion of 2s and 3s with respect to e.g. France.
Pierluigi Vinciguerra had given me the first hint of an idea: looking at how different countries differ in terms of the variety of cuisines represented in the guide. I’ve decided to explore this in terms of the split between local and foreign cuisine - I, arguably naively, would be expecting that countries with a strong food tradition, such as Italy, would see very few restaurants of foreign cuisine. I was partly right.
For this part, look at the donut charts.
The cuisines data comes from tags each restaurant is equipped with (it can be more than one). Unfortunately for me, but in a way that makes lots of sense because these are high-end places, some of the most frequent tags are “Creative” and “Contemporary”, which do not relate to an origin. These are the reasons behind the grey portions in the donuts.
For local and foreign cuisines, I have manually mapped each tag to whether it is local or foreign. For example, Italian local cuisines encompass “Campanian”, “Ligurian”, “Cuisine from Abruzzo” etc. I’ve also put “Regional Cuisine” amongst the local ones for each country that displayed it. Foreign cuisines can be “Japanese” for France or “French” for the USA. About “Mediterranean Cuisine”, I’ve ignored it (which means it ends up in the grey part) apart from countries outside of the Mediterranean, for which it is a foreign cuisine. I’ve also chosen to put “Italian-American” amongts the local cuisines for the USA: I understand this is a probably controversial and largely driven by my being Italian, but I think of Italian-American food as mostly American with Italian inspiration or ingredients. The logic I used checks to class a restaurant whether it has any tag amongst the local ones, and failing that, whether it had any tag amongst the foreign ones. Failing that too, the restaurant was classed as a “Other/Mixed” one (grey part). This way, a local tag trumps everything else.
You can see that Italy and Japan have a substantial portion of restaurants classed as local cuisine, and I was quite surprised to see that that’s not the case for France, which in fact is for the vast majority inhabited by creative and not-place-related cuisines.
Liked this? I have a newsletter if you want to get things like this and more in your inbox. It’s free.
No structure, even an artificial one, enjoys the process of entropy. It is the ultimate fate of everything, and everything resists it.
— P K Dick, Galactic Pot-Healer
That concept of entropy has its roots in thermodynamics where it emerged in the mid of the 19th century as part of that vast intellectual exploit that followed the industrial revolution and was defined from the study of thermal exchange. In an isolated system, which will naturally tend to reach thermal equilibrium, entropy increases as a result - this is the second law of thermodynamics.
L. Boltzmann linked the macroscopic definition of entropy to the microscopic components of the system, rendering entropy into a statistical concept, and J W Gibbs gave further development to the idea, defining it in terms of an aggregation over all possible microstates of a system (the constant is the the Boltzmann constant) as:
\[S = - k_B \sum_i p_i \ln p_i\]where \(i\) is the index of each microstate and \(p\) its probability. A microstate is a configuration of the microscopic elements of the system (e.g. molecules, in the case of physical matter). Effectively, the entropy is proportional to the expected value of the logarithm of probability.
The popular notion of entropy as a measure of disorder stems from all this: entropy measures (statistically) the ways in which a system can arise from the configuration of its microscopic components.
Rocks crumble, iron rusts, some metals corrode, wood rots, leather disintegrates, paint peels, and people age. All these processes involve the transition from some sort of “orderliness“ to a greater disorder. This transition is expressed in the language of classical thermodynamics by the statement that the entropy of the Universe increases.
— W M Zemansky, R H Dittman, “Heat and Thermodynamics“
Fast-forward a few decades and we have C E Shannon who in 1948, while employed at Bell Labs, wrote the crucial paper “A Mathematical Theory of Communication“, that defined entropy in the realm of Information Theory as a way to measure the level of information in a message. This definition follows the same paradigm:
\[H(X) = - \sum_{x \in A} p(x) \log_2 p(x) = - \mathbb E[\log_2 p(X)] \ ,\]where \(X\) is a random variable spanning space A the base-2 logarithm is used because of the unit of bits, which can have two states. Note that Shannon used the letter H to refer to entropy, you can read more about the why here. The Shannon entropy is again linked to disorder, or uncertainty: the more similar probabilities are, the higher the entropy.
Let’s make a simple example. Suppose we have a fair coin (one with the same probability of yielding heads or tails, that is, 50%). H would then simply be:
\[H = - \frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{2} \log_2 \frac{1}{2} =- \log_2 \frac{1}{2} = 1\]In the case of a coin that is maximally unfair on the side of heads (it yields heads every time), we would have H = 0, and the same with a coin maximally unfair on the side of tails. In fact, if p is the probability of yielding heads, then (1-p) would be the probability of yielding tails, and the entropy becomes:
\[H = p \log_2p - (1-p) \log_2(1-p) \ ,\]which plots as
You can see the graph reaches its maximum for the fair coin situation (p=0.5) and is 0 at the extremes: these are the levels of minimum information as the result is directly predictable, with certainty. The fairer the coin, the less predictable the result is, and the higher the informational content. The equiprobable situation is the one with the highest informational content, and highest entropy.
You can extend this reasoning to more complex trials with a higher number of possible states, e.g. a dice which has 6 possible states. In fact, the entropy reaches its maximum value for \(p=1/n\) (a uniform distribution), n being the number of possible states, and that maximum value is the logarithm of n.
Now why is this relevant to the world of data? Because you can use the entropy as an accessory evaluation metric for the reliability of your ML classifier - it will inform you on how sure your classifier is.
One important note: entropy is not an alien concept to Machine Learning as it is used for instance in tree-based models to decide on each split. But here we are discussing the use of entropy ex post facto, at the inference step.
Suppose you have a probabilistic classifier (one that spits out the inferred class alongside a probability value), examples are tree-based models or logistic regression. Normally in these cases the class inferred with the highest probability is the one furnished in output. But there is information in the probability values! Are they very different, or are they roughly the same, with one of the classes just barely topping the others? In short: is the probability distribution over the classes resembling a uniform or is it far from it?
Intuitively, if for example we had a binary classification (classes A, B) and our inferred result on a new instance were class B with probability 0.8 (which means class A has probability 0.2), we would feel like the classifier is quite sure of its choice. If instead, on another instance, the result were class B with probability 0.52 we wouldn’t feel that convinced. The same extends to multi-class problems.
Knowing that the entropy is maximal and equal to the logarithm of the number of classes in the most confused situation where all classes have the same probability helps us: we can compute the entropy over all our inference instances and analyse results in a statistical fashion. We can for example draw a histogram of the entropy on all the instances, to check how many times the classifier has been quite confused/unsure - maybe these instances are not good enough to keep.
This paper goes further and creates the concept of an “entropy score”, the entropy normalised to its maximum value and then explores an application to the Naive Bayes classifier.
I searched for a simple dataset that could illustrate all this well - the requirements were that it had to be suitable for a multi-class classification, ideally with a small number of classes. I could have used a binary classification problem one too, but it would have been a less interesting example, with just two probabilities to calculate the entropy on.
I finally found a 3-class problem in this dataset1 hosted on Kaggle: diabetes health indicators. It’s free to download. I am only using it as toy example to calculate the entropy so I will not care for the general quality of the classification; for the sake of simplicity I will use a base Random Forest classifier and I will keep track of the probabilities associated to the classified class on a portion of the dataset.
I’ve put all the code in this gist if you’re interested.
When you plot the histogram of the probabilities of the classified class (the one chosen by the model, which means the one with the highest probabilities amongst the three possible ones) you obtain this plot:
And, correspondingly, the entropies of each classification display this histogram:
You can see that there’s many items where the probability is exactly 1: the classifier is sure here, and the entropy is 0.
The probabilities are generally quite high, which means this classifier is quite confident in its choices - that’s good. There are, however, a handful of instances in which the max probability is around 0.45 - still higher than a uniform situation (which would see all probabilities at ~0.33), but nevertheless not very good. Here, the entropy is the highest (around 0.99, compare to the logarithm of 3, ~1.58, we would get with a uniform), these are the most confused instances - you may want to discard them in inference and not spit out a classification at all.
To conclude, it can be very useful to do a quick entropy analysis when you are dealing with a classification problem, on top of model evaluation for quality.
Finding a decent dataset for this task proved to be not that trivial, actually. Most toy datasets in circulation are for binary classification problems, or are very small and not interesting from the point of view of classified probabilities (this is the case of the famous Iris dataset, embedded into Scikit-learn, that gave me very all high probabilities for the classified class, which means low entropy). ↩
As a follow-up to that, I’ve decided to explore how different countries fare in terms of both their wealth and their inequality. As a matter of fact the two things do not necessarily go hand in hand, you can for instance (and we will see this) have highly wealthy countries with large income differences within the population.
As a measure of wealth, I have used the GDP per capita; as a measure of inequality the Gini index (or coefficient), a classic indicator of income spread in a population - it goes from 0 to 1: the higher it is, the more unequal the population is.
In this scatter plot, I’m displaying the GDP per capita (at constant 2015 US dollars) on the x-axis and the Gini index on the y-axis, a point per country. Points are sized based on their (bucketed) population and coloured based on the continent.
Quick note: the definition of continent is not necessarily standard and has not been in time either - I’ve used the 7-continents convention (North America, South America, Europe, Africa, Asia, Oceania, Antarctica - this last one is of course not represented). Turkey and Russia have been coloured with two hues because they are classed as belonging to both Europe and Asia.
As you can see African countries are in the low-wealth area of the plot, Western countries are quite spread but without representation in the lowest GDP per capita part. No surprises here - note especially the Northern European clique of wealthy/quite equal ones (e.g. the Netherlands, Sweden, Belgium). South Africa is interesting as it has the highest inequality in the lot, an egregious 0.63.
As a rule of thumb, a Gini coefficient above 0.4 represents a significant inequality (this is more of a guideline than a standard divide): the USA are a prominent outlier in the West, with high wealth but also high inequality. I guess this is not particularly surprising either.
Asian countries, where most of the global population resides (see the large balls of China and India), are mostly at low wealth but some like Japan distance themselves from the group.
Oceania is represented by only Australia in the viz, at high wealth of course. South America is mostly in the least fortunate area of the plot.
For a detailed overview of the data used to produce this viz, and caveats, please see the next section. As a general note, you should know that I’ve only displayed countries exceeding the 10 million mark in population (we would have had too many otherwise) and that data points refer to the last year available, which may be different for each country and for each of the three metrics; unfortunately, because of how data is compiled from different sources, we don’t have info updated at the same time for each country.
The data I’ve used comes directly from Our World in Data, specifically these sources (you can download CSVs):
All three datasets report historical data, meaning there is a row per country per year: I’ve selected the latest year for each country, so as to have the most up-to-date info available. Of course, this implies that not every data point will refer to the same year. Specifically, GDP data spans the years 2015-2021; Gini data spans the years 2003-2021 (a very large range), population data refers all to 2021. This means ignoring temporal changes.
The late Professor Hans Rosling has been an absolute master of presenting data in a way that is engaging, digestible and fun. This here is a pretty famous TED talk he gave in 2006, discussing trends in global development.
He worked a lot in the area of fighting misconceptions we all have about the world, and educating the public on thinking of phenomena under the quantitative lens. The book “Factfulness”, which he co-authored with his son and daughter-in-law is a cult amongst data communities. They also founded the non-profit Gapminder, an organisation devoted to increasing public knowledge on issues of global development. If you head to the website you can test your perception (and likely prejudices) too on a variety of topics.
Liked this? I have a newsletter if you want to get things like this and more in your inbox. It’s free.
“She was beautiful - but especially she was without mercy.”
― F. Scott Fitzgerald, “The Beautiful and Damned”
For anyone who worked in science or generally with data, power law distributions are the norm (pun not intended). They are an actual feature of the natural and social world, present in all sorts of phenomena: wealth distributions (Pareto), word frequencies (Zipf), number of node connections in a social network, sales on Amazon…
Newman’s review paper on this topic, “Power laws, Pareto distributions and Zipfs’s law” is a joy to read, you can do so as it’s free and open-access on ArXiv. We will use it to outline why these kinds of distributions are so alluring and deserve thought.
Newman says, at the start of the paper, that
Most adult human beings are about 180cm tall.
— M E J Newman, “Power laws, Pareto distributions and Zipfs’s law”
The crux is in the “about”, but this sounds like an exaggeration to me (and that’s not just because I am barely 152cm tall)! Let’s check this statement. I found the SOCR dataset of human height and weight, kindly scraped and organised in a CSV by Kaggle. SOCR stands for “Statistics Online Computational Resource”, it’s a site (with a wonderful early-2000s interface) set up by UCLA for educational purposes. The chosen dataset apparently comes from data simulated based on children’s real heights as measured in a survey. We end up with this distribution of adult heights:
It is very clearly normal, with the mean at 172cm. The third quartile is at 175cm so I’m still right, based on this data, that 180cm is too high to be a representative human height! Note that SOCR doesn’t specify any features of the data, e.g. gender or ethnicity, which may determine differences. Newman is using himself data to make his point though, data from the National Health Examination Survey and he effectively finds a distribution mean around 180cm.
Back to power laws now.
Power laws are relations of the form (A is a constant and \(\alpha\) is a positive number1):
\[f(x) = A x^{-\alpha}\]which, if we apply a logarithm to both sides, become
\[\log f(x) = \log A - \alpha \log x \ .\]This is why they appear linear in a log-log plot, which makes them easier to visualise.
In a probabilistic context, a variable distributed in a power law way is one where there’s very many elements with low values and some with extremely large values, so not only do you see no characteristic “peak”, but you also cannot easily describe the distribution by a simple representative number.
Citing directly from Newman’s paper, where he used genuine data, examples of things distributed in a power-law way are (see figure below):
There’s more examples and more details in the paper.
Note that it isn’t completely correct to say that “a lot of phenomena follow a power-law distribution”, because usually it’s only the tail that does. Real-world data distributions can be tricky to analyse, and simply drawing them in log-log is a poor way to ascertain a power-law behaviour, as there can be several other possibilities. However, the power law model works well as a guideline.
Power laws benefit from scale invariance, which means that if x scales by a multiplicative factor, the shape of the function remains the same bar a multiplier:
\[f(Bx) = A B^{-\alpha} x^{-\alpha} = B^{-\alpha} f(x) \ .\]This property is very interesting in areas like the physics of materials and statistical mechanics, where it describes the emergence of phase transitions. It implies the power law relation lacks a typical scale.
Because \(f(x)\) diverges for \(x=0\) this means that in practical terms as a probability density function
\[p(x) = A x^{-\alpha}\]it must have a minimum value greater than 0. Let’s indicate it with the suffix min.
The expected value calculates then as
\[\begin{align} \mathbb{E}[X] &= \int_{x_{min}}^{\infty} p(x) x \mathrm{d} x \\ &= \int_{x_{min}}^{\infty} A x^{1-a} \mathrm{d} x \\ &= \frac{A}{2-a} x^{2-a} \vert_{x_{min}}^{\infty} \end{align}\]which diverges if \(a <= 2\): power law distributions can have infinite means.
Consequences of this are non trivial. If you have a finite sample of values you can always (obviously) calculate the mean, even when the population is distributed with a power law. But in that case a sample mean may not be in line with the population mean: different samples may lead to very different sample means, because some will be characterised by extremely large values. The mean is not a representative value for the distribution.
Similar reasonings arise with higher momenta, for instance you can calculate that the variance (hence the standard deviation) diverges if \(a <= 3\).
The power law distribution is also known, in jargon, as a “fat-tail”. To be correct though, a fat-tail distribution is one whose tail behaves like a power-law.
If a distribution has a fat-tail it means that the probability of extreme values is quite chunky (compared to e.g. a normal distribution). “Fat-tail” and “heavy-tail” are often used interchangeably, but in reality the first is a subset of the second - a “heavy” tail is one which is chunkier in the tail than an exponential, so decreases more slowly. A “fat” tail is “heavy” but the reverse isn’t always true - an example is the log-normal distribution, chunkier than an exponential but slimmer than a power (decreases faster). In the references you find a link to a very good discussion on Cross Validated on this difference.
To have a reference mental model, a fat tail is fatter than the tail of a normal distribution, where we know that the probability of staying within one/two/three standard deviations of the mean is respectively ~68%/~95%/~99%. In a fat tail, large values can appear with higher chance, making it hard(er) to call them “outliers”.
Because these distributions are so ubiquitous, this matter isn’t just an erudite academic one. The scholar N N Taleb has done much work on the concept of “black swan” (hard to predict, different from the rest and consequential events) and the fact that in fields like economics the blind use of the canonical instruments of descriptive statistics, usually developed for gaussians (e.g. using means to represent a distribution, calculating standard deviations, …), can lead to disaster. He says:
The traditional statisticians approach to thick tails has been to claim to assume a different distribution but keep doing business as usual, using same metrics, tests, and statements of significance. Once we leave the yellow zone, for which statistical techniques were designed (even then), things no longer work as planned.
— N N Taleb, “Statistical consequences of fat tails”
I’m leaving some good stuff in the references. So we need to watch out, when we analyse data.
We’ll likely expand this topic some other posts in this section.
\(\alpha\) could very well be a negative number too, for a general power-law relation. However, for the sake of the following, we are specialising in relations where the exponent is negative. ↩
To track my reads I use Goodreads. I don’t love its old-fashioned interface, I wish it were still under active development (it’s not), but it remains the place where I have invested most of my time tracking what I read and creating wish lists. Goodreads was bought by Amazon in 2013 and since then it’s been pretty much abandoned - it has allegedly got to the point that it would need a complete architectural overhaul, something that Amazon apparently deemed not worth its time and effort. Of course it also suffers from problems common to many social media: issues with online abuse and fake reviews. If you’re interested, theres’s many great alternatives, some of which in the Fediverse.
Anyway, back to the point here. Since I was very young, I always loved reading so-called “classics of literature” but in latest years (couple of decades really) I started spending most of my time with either contemporary fiction or non-narrative texts. I got curious to see the distribution of my reads per decade of publication.
Luckily, you can easily dump your data out of Goodreads (as a CSV), and so here is the plot.
In regards to the scope of the “data cards” section of this publication, I realise I’m cheating a bit here - this isn’t a hand-drawn viz. Sorry, but really there was not much to draw and a simple 5-mins (make it 10) effort with a Google sheet was more than enough for what I needed. At some point, I will talk more about choosing the path with the best compromise between effort and value when you do data work.
You just have to filter out everything not in the “read” shelf, create a column for the publication decade and then use a GROUP BY to count the titles for each. If you want, I can pass you a copy of the sheet you can use as template to do the same.
Important: this is not a really good visualisation because given the x-axis is a time variable it may be perceived as a time series. It’s not, it’s a histogram: it just shows the counts per category, in this case the decades, and this is why there is no representation for e.g. decades between 1750 and 1810 - apparently I’ve got nothing read that was published in those years.
Anyway, you can see most of my reads are books of recent publication. I want to populate older times a bit more as I am missing so many great “classics”, so I will occasionally post this plot again, say in a few months, and see how things have changed. I have currently resumed reading older stuff, starting with some Steinbeck (which I had never read before and I am really, really liking).
For another time: looking at the geographical distribution of my reads, based on the author’s nationality/biography. It won’t take a visualization to discover that the vast (vast, vast) majority of what I’ve read is Western.
]]>