<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://martinapugliese.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://martinapugliese.github.io/" rel="alternate" type="text/html" /><updated>2026-02-28T14:03:25+00:00</updated><id>https://martinapugliese.github.io/feed.xml</id><title type="html">Clearly Erroneous</title><subtitle>Tech, data and 1000s wonderful things</subtitle><author><name>Martina Pugliese</name></author><entry><title type="html">Underrated ways to widen your horizons</title><link href="https://martinapugliese.github.io/underrated-ways-horizon/" rel="alternate" type="text/html" title="Underrated ways to widen your horizons" /><published>2025-10-11T00:00:00+01:00</published><updated>2025-10-11T00:00:00+01:00</updated><id>https://martinapugliese.github.io/underrated-ways-horizon</id><content type="html" xml:base="https://martinapugliese.github.io/underrated-ways-horizon/"><![CDATA[<p>This post was inspired by the excellent <a href="https://www.experimental-history.com/p/underrated-ways-to-change-the-world">“Underrated ways to change the world”</a> by A Mastroianni and focuses specifically on some easy, usually accessible and free/cheap ways to expand one’s understanding of the world, recognising the complexity of human interactions, texture and diversity more.</p>

<p>We live in an age where “we have the most sophisticated information technology in human history yet we have lost the ability to talk to each other”, as <a href="https://www.facebook.com/Prof.Yuval.Noah.Harari/posts/how-come-we-dont-talk-or-listen-to-each-other-anymore-despite-having-the-most-so/930359131780267/">Y N Harari famously said</a>. We may blame social media, AI, autocrats, capitalism, or all of those things and more; on the flip side we may look at the past and decide that nothing’s really changed, that maybe a lot of us have <em>always</em> been dumb/racist/selfish, only they wouldn’t show it <em>globally</em> because they didn’t have broadcasting platforms. I don’t know, but I think there’s something each of us can do to increase our own texture, the quality of our thoughts, the knowledge we hold about other places and the people that inhabit them. And as a bonus, we get a deeper appreciation of the value of diversity and a fuller realisation that you don’t have to <em>know</em> stuff, you just have to be open to learn it. Racism and discrimination can and must be fought by the force of the law, but their roots can only be extirpated by openness and recognition.</p>

<p>This list is far from complete, but it’s a start. It is what I try to do myself, but there’s of course plenty more ideas than these.</p>

<h2 id="listen-and-read-news-from-outlets-from-other-geographies">Listen and read news from outlets from other geographies</h2>

<p>The farther from you, the better. Find reputable TV news channels and/or newspapers and magazines from places and cultures far apart from yours. If you’re Western, chances are most of your news intake is from Western sources and while there are of course illustrious outlets that cover the whole world, the base perspective can nevertheless be… Western. Mixing up and listening to the same piece of news from other places can be immensely educational. Many outlets from all over the world have English versions.</p>

<h2 id="read-authors-from-other-countries-and-less-privileged-backgrounds">Read authors from other countries and less privileged backgrounds</h2>

<p>Read books in general, especially the great literature of the world. Investigate great authors from countries you know little about and read their production. If you’re able to, read in the original language (but otherwise translators usually do an amazing job!).
I found <a href="https://www.instagram.com/p/DKaA_nBPBAQ/?hl=en-gb">this person</a> who’s reading a book from each country in the world and I think it’s a delightful project.
Also, read women authors, black authors, queer authors, politically-persecuted authors, authors from backgrounds and walks of life that are far from yours. Read quality thoughts about race, freedom, systemic discrimination, get deep into what it can mean to be treated like you’re a lesser human being.</p>

<p>Use your public library (if you have the privilege to have one, I realise not everyone does), frequent your local bookstore: they’ll likely also hold events and presentations. If you’re a bookworm, entering a library can be like being in the quintessential candy store - there’ll be shelves with new acquisitions, shelves organised by topic, shelves celebrating a particular author or theme. Explore.</p>

<h2 id="watch-movies-and-shows-from-same-as-above">Watch movies and shows from [same as above]</h2>

<p>The exact same as above applies - make an effort to look for the good stuff from abroad. Maybe you have a nearby cinema that regularly hosts foreign film festivals. Or if you pay for a streaming platform there may be sections of material from various countries, it may be worth having a look next time you’re in doubt as to what to watch.</p>

<h2 id="learn-a-new-language">Learn a new language</h2>

<p>In the age of machine translation, is learning a language still a worthy enterprise? I’m convinced it is, not much or not necessarily because of practical repercussions, although being able to ask things and sustain a conversation without the need for a phone mediator is clearly empowering, but because it helps widening your mind, and by a good extent. Language is culture and it is history. Of course becoming fluent would help a lot with the two above ideas, but that’s not what I mean here - you don’t have to be fluent to enjoy how different languages may structure things differently or to see that grammatical concepts do not necessarily overlap 1:1 (and in fact, the farther apart the languages, the smaller the chances that they do). For example, Italian has a conditional mood for verbs, which German renders with Konjunktiv II; Mandarin Chinese does not “conjugate” verbs, or even have the same strict concept of a “verb” like in European languages.</p>

<p>You also don’t need to be fluent to make connections between words, explore possible common roots and intertwine the development of a language with the history of peoples who migrated, mixed, were conquered. Some languages don’t share the same alphabet or writing system, some do but with variations. Linguistics is fascinating and it follows the very human history itself - getting some familiarity with it is a wonderful way to embrace the complexity of humanity.</p>

<h2 id="study-some-maths">Study some maths</h2>

<p>While on the topic of learning, learn mathematics. Maths powers a lot of things - we’d have no tech, no AI, no coding, no bridges and no bookkeeping without it. But we’d also struggle to understand much of the wonder of the world, like for instance how <a href="quantamagazine.org/math-of-the-penguins-20200817">penguins find the most efficient configurations when battling a storm</a>.</p>

<p>Maths is the ultimate logic-trainer. There’s just nothing else like it. Go to the basics, train your brain, it’ll help in so many ways. You don’t need to enroll at the university or buy expensive textbooks, you just need to have the curiosity and start small - there’s plenty of great, free and entertaining material online. Mathematics is powerful and it is beautiful and as many said, it is more often than not butchered when it is taught. It is a giant shame. However learning it does require focus and some effort. Is it hard? Yes, but the reward is huge. As they say, no pain, no gain - and this applies to pretty much anything in life.</p>

<h2 id="ask-people-be-painfully-curious">Ask people, be painfully curious</h2>

<p>When meeting someone that comes from the other side of the world, it is very easy to keep the conversation on the superficial level of the “oh great/oh I want to go there/oh I was there last year”. Go deeper, be curious (without being creepy of course!): ask them to recommend something, their favourite dish or one they don’t like, what do families and friends do when they spend social time together… get that texture, engage, learn.</p>

<hr />

<p>It is true that the world is on fire right now, that consumerism has fully shown its ugly face, that places that prided themselves as the proverbial beacon of democracy (?) are turning autocratic, that we are all more digitally connected yet more actually lonely. But it doesn’t have to be this way, and it’s up to all of us to do something to keep educating ourselves, show kindness, connect with those different from us, be humble and learn. Maybe if more and more of us did this we’d naturally attract more people to do the same and we’d actually make a dent.</p>]]></content><author><name>Martina Pugliese</name></author><category term="learning" /><summary type="html"><![CDATA[Learning about the world doesn't have to be a hard job]]></summary></entry><entry><title type="html">Excellent dystopias in literature - A Scanner Darkly</title><link href="https://martinapugliese.github.io/excellent-dystopian-literature/" rel="alternate" type="text/html" title="Excellent dystopias in literature - A Scanner Darkly" /><published>2025-08-31T00:00:00+01:00</published><updated>2025-08-31T00:00:00+01:00</updated><id>https://martinapugliese.github.io/excellent-dystopian-literature</id><content type="html" xml:base="https://martinapugliese.github.io/excellent-dystopian-literature/"><![CDATA[<p>Sometimes, my fiction-reading follows a trail of books of the same genre or with somewhat similar motifs. Lately, the shared motif was being dystopias and I’ve read these ones in quick succession:</p>
<ul>
  <li>A Scanner Darkly, P K Dick</li>
  <li>The Testaments, M Atwood</li>
  <li>The Man in the High Castle, P K Dick</li>
  <li>Brave New World, A Huxley</li>
  <li>Do Androids dream of Electric Sheep?, P K Dick</li>
</ul>

<p>They’re all beyond excellent - This blog pens some thoughts about the first one.
<strong>Warning</strong>: There are spoilers here, so if you want to read the book you know what to do.</p>

<p>I’d never read anything by Dick before and I can’t remember how I ended up starting with this one (it was probably a recommendation from somewhere) but it was brilliant: themes are the surveillance state, substance addiction, personal identity and the subtle tension between truth and fabricated piles of lies. The book is intensely symbolic and multilayered.</p>

<h2 id="the-title">The title</h2>

<p>The title is inspired by a biblical passage, <a href="https://philipdick.com/literary-criticism/frank-views-archive/digressions-on-allusions-in-p-k-dicks-a-scanner-darkly/">possibly</a> taken from the King James version of the Bible (a translation commissioned by King James VI/I of Scotland/England &amp; Ireland):</p>

<blockquote>
  <p>“For now we see through a glass, darkly; but then face to face: now I know in part; but then shall I know even as also I am known.”&lt;/br&gt;
– 1 Corinthians 13:12 <a href="https://www.biblegateway.com/passage/?search=1%20Corinthians%2013%3A12&amp;version=KJV">(King James version)</a></p>
</blockquote>

<p>The passage reflects on the fact that while at the moment we don’t see and understand clearly, we will at some point (in the afterlife?). Here’s the identity theme front and center.</p>

<h2 id="the-setting">The setting</h2>

<p>This wasn’t clear for me immediately but it gradually emerged with reading: we are in a violent authoritarian society, in an unspecified near future, near enough that it would preserve common traits with Dick’s present (the book was written in the ’70s) but far enough that it would exacerbate them into a sinister atmosphere.
The main factor is a widespread drug addiction, which the government fights via several branches of law enforcement and (purported) addict rehabilitation programs.</p>

<p>It is also a heavily consumeristic society flattened onto the repetition of the same offerings over and over again, a setting that must have been very much inspired from reality. Two of my favourite quotes:</p>

<blockquote>
  <p>In Southern California it didn’t make any difference anyhow where you went: there was always the same McDonaldburger place over and over; like a circular strip that turned past you as you pretended to go somewhere.</p>
</blockquote>

<blockquote>
  <p>Life in Anaheim, California, was a commercial for itself, endlessly replayed.</p>
</blockquote>

<p>There’s a pervasive sense that nobody knows who others really are: addicts (the “dopes”) lose perception of themselves and others, but non-addicts (the “straights”) are also partially in the dark with respect to identities. A significant element is the special suit that governmental narcotic agents (the “narcs”) wear in order to conceal their identity to everyone (including other narcs) - the suit hides someone’s features by showing a rotation and overlap of random faces: narcs have no identity, they are slaves to the system as much as dopes are slaves to the drugs.</p>

<p>The most popular drug is “Substance D” (never made explicit but likely standing for “death”, though it could also be in combination with “destruction” or “damage”), which has extreme addiction power and ability to destroy: users will gradually lose use of their brains because the connections between hemispheres get severed. People get onto weird obsessions (the book opens with a guy who’s convinced there’s insects all over him and his place and takes a shower after the other). The all vibe is, appropriately to the title, dark.</p>

<p>The protagonist Bob is a narc, but he’s also a user, because he’s undercover and needs to fit in. He thought he was above abuse, but he’s not - he gradually turns into one of the same people he’s supposed to investigate. Bob is tasked with discovering who produces Substance D, as it is known there’s only one root source and the drug is organic.</p>

<p>There’s lots of clever turns of the story, like when Bob, who lives in a house with other dopes, finds himself in the condition to have to investigate himself - the narcs have installed cameras all over the house, they know there is a narc in there but not who’s who, and Bob has to watch the recordings and report, reporting himself. His degradation follows a spiral, and he does realises he’s losing his cognition, medics give him tests which prove it, yet he has to keep going. Who is who becomes even more mixed-up.</p>

<p>Bob’s surname, Arctor, reminisces of “actor”, <a href="https://philipdick.com/literary-criticism/frank-views-archive/digressions-on-allusions-in-p-k-dicks-a-scanner-darkly/">perhaps a reference to his acting of himself</a>.</p>

<p>Dick was a heavy drug user himself. In the appendix he writes</p>

<blockquote>
  <p>This has been a novel about some people who were punished entirely too much for what they did.</p>
</blockquote>

<blockquote>
  <p>There is no moral in this novel, it is not bourgeois […] it just tells what the consequences were.</p>
</blockquote>

<blockquote>
  <p>If there was any “sin”, it was that these people wanted to keep on having a good time forever, and were punished for that, but […] the punishment was far too great</p>
</blockquote>

<p>There’s also a list of names with the “punishment” they received, in which a “Phil” with “permanent pancreatic damage” is himself.</p>

<h2 id="the-language">The language</h2>

<p>The rhythm is fast and the prose is raw, at traits jargony, the conversations reported are often trippy, which is a testament to Dick’s writing ability: despite the addicts’ nonsense (words and actions), you as a reader never fail to understand what is happening.</p>

<p>There’s lots of science-like talk about the human brain too, which proves research.</p>

<p>Then, there’s some German quotes, which is a constant in Dick. He <a href="https://archive.nytimes.com/opinionator.blogs.nytimes.com/2012/05/20/philip-k-dick-sci-fi-philosopher-part-1/">appreciated philosophy</a>, including German thinkers, and <a href="https://philipdick.com/mirror/websites/pkdweb/The%20Mainstream%20that%20through%20the%20ghetto%20flows.htm?utm_source=chatgpt.com">understood the language</a>.</p>

<p>It’d be largely reductive to brand him as “just” a science-fiction writer. There are plenty of cultural allusions and references you can pick up, the more I read reviews and opinion pieces about this book the more I see them myself.</p>

<h2 id="the-ending">The ending</h2>

<p>Bob ends up in a rehab clinic, one of those cruel government-sponsored places for the supposed recuperation of addicts. Except that they get enslaved instead. He’s been “betrayed” by his girlfriend who is in fact a upper-level governmental agent herself.
In the very final scene, while there in a working farm, in a rare epiphany now that his brain is completely gone to mush, he finds the source of Substance D (some flowers) and hence realises that the government itself is the producer.</p>

<p>Authoritarian regimes always need control systems, and control systems need to control each other. It is a chain of power where each element only has partial information and acts on a limited sphere. The narcs are actually a cover-up the government uses to pretend to be on the side of citizens and have superior moral status while in fact they, unbeknownst to themselves, are just pawns, serving the purpose of pretend law-enforcement and really doing the dirty job of citizen surveillance. This has Soviet-style elements and points of connection to Orwell’s 1984.</p>

<p>The oligarchs in power perpetuate a rich-gets-richer way of living and for that it is essential to both create the addiction problem and make sure it can never be solved, while at the same time pretending to tackle it. Everyone else is unknowingly playing into the game. This is redolent of <a href="https://www.goodreads.com/book/show/69807523-pain-killer">Pain Killer: An Empire of Deceit and the Origin of America’s Opioid Epidemic</a>.</p>

<h2 id="the-derived-movie">The derived movie</h2>

<p>There is a <a href="https://www.rottentomatoes.com/m/scanner_darkly">movie</a> (2006) but IMHO its only worth of note element is the fact that it’s fully rotoscoped. Interesting visuals, but it’s not up to the book at all. It felt to me like it was made into a “story” stripped of all the deeper elements and reduced to a quite shallow rendition. But <a href="https://www.rottentomatoes.com/m/scanner_darkly">not everyone agrees</a>.</p>]]></content><author><name>Martina Pugliese</name></author><category term="books" /><category term="literature" /><summary type="html"><![CDATA[Some thoughts on a novel with many layers]]></summary></entry><entry><title type="html">Elizabeth Evelyn Wright</title><link href="https://martinapugliese.github.io/data/elizabeth-evelyn-wright/" rel="alternate" type="text/html" title="Elizabeth Evelyn Wright" /><published>2025-06-21T00:00:00+01:00</published><updated>2025-06-21T00:00:00+01:00</updated><id>https://martinapugliese.github.io/data/elizabeth-evelyn-wright</id><content type="html" xml:base="https://martinapugliese.github.io/data/elizabeth-evelyn-wright/"><![CDATA[<p>Elizabeth Evelyn Wright was the first black woman to found a place of high education in the USA: <a href="https://voorhees.edu">Voorhees University</a>, South Carolina, 1897. Voorhees University classifies as an <a href="https://en.wikipedia.org/wiki/Historically_black_colleges_and_universities">HBCUs</a> (“Historically Black Colleges and Universities”), named and recognised in 1965 as part of the <a href="https://en.wikipedia.org/wiki/Higher_Education_Act_of_1965">Higher Education Act</a>, which provided also dedicated funding.</p>

<p>I found 102 of them, currently active - there’s probably more depending on how you count, with merges and renamings, and I’ve collected data about their year of creation, state and founder.</p>

<figure style="width: 600px" class="align-center">
  <img src="https://martinapugliese.github.io/assets/posts_images/hbcus.jpg" alt="Hand-drawn diagram of the timeline of founding of all HBCUs per state. Each one is represented by a circle, coloured according to the founder type (religious, state, private individual); circles are placed on a horizontal timeline." />
  <figcaption>Foundations of HBCUs in time, active ones represented, for each US state and colour-coded by the type of founder. Some notable ones are marked. The list is as comprehensive as can be and includes currently active HBCUs.</figcaption>
</figure>

<p>At the age of 16, Wright moved from her native town of Talbotton, Georgia to Tuskegee, Alabama after learning of the opening there of a new university for black students, <a href="https://www.tuskegee.edu/">Tuskegee University</a>, founded in 1881 with the name of <a href="https://en.wikipedia.org/wiki/Tuskegee_University#Planning_and_establishment">Tuskegee Normal School for Colored Teachers</a>. The year was 1888 and she enrolled there.</p>

<p>Wright was the child of formerly enslaved parents, an African-American man and a Cherokee woman. The family was poor - conditions were hard for black folks, especially in the South. She likely used trains for the ride to Alabama as the railway was the most popular mode of transportation for long distances at the time: a train journey in 1888 costed circa <a href="https://truewestmagazine.com/train-and-stagecoach-ticket-prices/">2-3cents/mile</a>, which makes for a total expenditure for Wright between $2 and $3, using an estimate of a total of 100 miles covered. In <a href="https://www.in2013dollars.com/us/inflation/1880?amount=3">today’s money</a>, that’s a ballpark of $60-$100 - likely a lot of money for someone of Wright’s background.
It is very interesting to note that the same train journey today wouldn’t be possible - several railway lines that used to operate in the area have been discontinued or operate just as freight services, like for instance the <a href="https://en.wikipedia.org/wiki/Western_Railway_of_Alabama">Western Railway of Alabama</a>.</p>

<p>She worked as a servant/domestic during the day, frequented courses at night and found herself mentors who would support her education, eventually graduating in 1894. Then she moved to South Carolina, held teaching jobs and decided to give back to the community by founding schools for black students herself, with the special aim to provide industrial and agricultural education: she was keen to equip black folks with the abilities to succeed in life beyond the menial jobs they had been enslaved to do before abolition.
Things really didn’t go well initially: not only did she encounter huge resistance by many white people, but several of her attempts at building schools were burned down. Until the one in <a href="https://en.wikipedia.org/wiki/Denmark,_South_Carolina">Denmark</a>, with an early name of Denmark Industrial School, succeeded. It was 1897 and the school started as a single room on top of a grocery store. She then secured a generous donation from benefactor <a href="https://www.newnetherlandinstitute.org/history-and-heritage/dutch_americans/ralph-voorhees">Ralph Voorhees</a>, a wealthy local man with an interest in spending his large fortune in good causes, which enabled the school to move to a new location and expand, and eventually take his name (it was Wright’s decision to do so). Since then, the school has offered education to many people, remains predominantly black-frequented and has largely expanded its offering in terms of programs and degrees.</p>

<p>Raised in the Episcopal faith, religion has been a strong driver behind Wright’s ambition; the church, as well as individual figures, have also supported her all along and financed her endeavours. She travelled a lot around all her life, often on foot, to raise money and also awareness and her legacy remains huge.</p>

<p>Of a sick disposition all her life, she died aged only 34 in 1906. Her grave lies on the university campus.</p>

<h1 id="sources">Sources</h1>

<p><em>I’ve used a combination of generative AI tools to aid the collecting of information about this, specifically chatGPT and Gemini with Deep research. I’ve checked information and sources for reliability.</em></p>

<ul>
  <li><a href="https://en.wikipedia.org/wiki/Elizabeth_Evelyn_Wright">Wikipedia page for E E Wright</a></li>
  <li><a href="https://voorhees.edu/elizabeth-evelyn-wright/">Page about her on the Voorhees University website</a></li>
  <li><a href="https://www.youtube.com/watch?v=P2FSh0ymEsk&amp;ab_channel=VoorheesUniversity1897">Voorhees University documentary about E E Wright</a></li>
  <li><a href="https://uncficb.org/about/timeline/">A brief timeline of HBCUs on UNCF</a></li>
  <li><a href="https://www.scencyclopedia.org/sce/entries/wright-elizabeth-evelyn/">E E Wright on the South Carolina Encyclopedia</a></li>
  <li><a href="http://www.nationalregister.sc.gov/bamberg/S10817705009/S10817705009.pdf">The Nomination Form of Voorhees University</a> on the National Register of Historic Places</li>
  <li>There’s a <a href="https://www.hmdb.org/m.asp?m=206971">marker plaque</a> in Talbotton about E E Wright</li>
</ul>]]></content><author><name>Martina Pugliese</name></author><category term="data" /><category term="pathbreakers" /><summary type="html"><![CDATA[How a black woman from Georgia founded a university in 1897]]></summary></entry><entry><title type="html">We made an AI agent</title><link href="https://martinapugliese.github.io/tech/we-made-an-agent/" rel="alternate" type="text/html" title="We made an AI agent" /><published>2025-05-25T00:00:00+01:00</published><updated>2025-05-25T00:00:00+01:00</updated><id>https://martinapugliese.github.io/tech/we-made-an-agent</id><content type="html" xml:base="https://martinapugliese.github.io/tech/we-made-an-agent/"><![CDATA[<p>Creating AI agents is every techie’s favourite pastime these days, so my <a href="https://www.linkedin.com/in/bernardo-monechi-25699359/">friend</a> and I also made one. It aims at aiding the finding and consumption of scientific research and is aptly (we think, at least) called <a href="https://github.com/martinapugliese/askademic">askademic</a>. It works off of the arXiv API, so it can only really consider papers submitted to arXiv - unfortunately, and we’re aware, this leaves out a bunch of research in fields whose communities use other platforms to share research, but unless we’re wrong there’s no other freely available repository that’s also equipped with an API - if you know any, let us know! 
Of course, the other caveat is that arXiv is for preprints only, some stuff there hasn’t been peer-reviewed at all, it’s incomplete as some journals prohibit pre-publication, and so on. But again, it’s the best we (as society) have.</p>

<p>The amount of research appearing is on a clear <a href="https://arxiv.org/stats/monthly_submissions">upward trend</a> and <a href="https://info.arxiv.org/about/reports/submission_category_by_year.html">some fields are particularly popular</a>, especially, and maybe unsurprisingly, Artificial Intelligence. It’s really hard, if not virtually impossible, to keep up with the pace, so we built something that could help us distill and easily find information.</p>

<h2 id="what-is-it-what-can-it-do">What is it, what can it do</h2>

<p>Askademic is a CLI tool - we’ve chosen to start with the <a href="https://en.wikipedia.org/wiki/Path_of_least_resistance">path of least resistance</a>, we may create a UI later on, maybe when the tool will have more features. It’s powered by <a href="https://ai.pydantic.dev/">PydanticAI</a> as its agents framework and currently based on Gemini (2.0 Flash) as its underlying LLM, though we’re shipping support for other families too.</p>

<p>Askademic can:</p>
<ul>
  <li>summarise recent papers (those from the latest available day) published in a <a href="https://arxiv.org/category_taxonomy">arXiv category</a></li>
  <li>find specific information in the literature by answering questions</li>
  <li>talk about a given paper</li>
</ul>

<p>All instructions are on GitHub but more details on what it can do and how are on the excellent <a href="https://deepwiki.com/martinapugliese/askademic">DeepWiki</a> (which IMHO is itself one of the best applications of LLMs).</p>

<p>The whole process of developing agents and orchestrating them is made very easy by PydanticAI, which has some rough edges being a juvenile project itself, but it is overall really good.</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/askademic_flow.png" alt="Flowchart showing the working of askademic with the agent orchestration described in the text." />
  <figcaption>High-level overview of askademic. Image by DeepWiki.</figcaption>
</figure>

<h3 id="science-or-pun">Science or pun</h3>

<p>Your question first passes through an agent (<strong>“allower”</strong>) that decides whether you’re asking something scientific or not - in the latter case, it gives you a pun related to what you asked, e.g.:</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/askademic_pun.png" alt="Question: how's the weather today? Answer: I don't mean to rain on your parade, but that's not exactly a scientific enquiry!" />
  <figcaption>An example of a pun you get for a non-scientific request.</figcaption>
</figure>

<p>Other things I’ve seen are</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>What is the capital of Italy?
I <span class="s2">"Rome"</span> around looking <span class="k">for </span>science <span class="k">in </span>this question, but it<span class="s1">'s just geography!
</span></code></pre></div></div>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>What is the meaning of life?
I<span class="s1">'m not sure, but I bet it'</span>s not something you can find <span class="k">in </span>a <span class="nb">test </span>tube!
</code></pre></div></div>

<p>… and so on.</p>

<h3 id="orchestrating">Orchestrating</h3>

<p>Passing this first layer there’s an <strong>“orchestrator”</strong> agent that triages the request to either of 3 sub-agents, depending on its nature: if you ask for summaries of recent papers by category it’ll go to a <strong>“summary”</strong> agent, a particular research questions will be given to a <strong>“question”</strong> agent, anything about a specific paper, given by title or URL go to an <strong>“article”</strong> agent.</p>

<p>The orchestrator has memory (message history), so to make use of what was discussed before in case you want to ask for follow-ups (which is particularly useful for the second use case, the questions one). The memory feature, courtesy of PydanticAI, works of course by attaching all previous messages exchanged to the prompt so that the LLM has context. This is where Gemini comes in handy with its 2M window context - it’s pretty hard to exceed it. We are adding support for Claude and one of the pain points is the limited (in comparison) context window - we are still experimenting with the best choices.</p>

<h3 id="summarising">Summarising</h3>

<p>Summarising works by using the <a href="https://arxiv.org/category_taxonomy">taxonomy of arXiv categories</a> and is based on abstracts. We feed the categories as hardcoded, each one is given as ID-description pair (e.g. “hep-ex” - “High Energy Physics - Experiment”). 
We were originally scraping the arXiv taxonomy page but that obviously proved to be the best way to get banned by arXiv (that is, to get given a captcha wall). Because there isn’t endpoint for categories, we resorted to just hard-coding the list.</p>

<p>The summary agent is instructed to:</p>
<ol>
  <li>first, choose the most relevant category according to your request</li>
  <li>then, run a first API query to figure out what’s the most recent available day of publications</li>
  <li>finally, do another query for <em>all</em> abstracts of papers published in that category, which we then curtail to the ones belonging to the day from step 2.</li>
</ol>

<p>We say <em>all</em> abstracts because we use a max of 300: this comes from a heuristic, the max we’ve seen published in a category in a day is below that mark so it seemed sensible, and a cap is useful not to explode the context too much anyway. Certainly a point of improvement.
We use abstracts rather than full papers for the same reason, they’s add too much text.</p>

<p>You can e.g. ask</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Give me the latest papers in group theory
</code></pre></div></div>

<p>and you get this kind of output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>category: category_id='math.GR' category_name='Group Theory'
latest_published_day: 2025-05-08
summary: The latest research covers several topics in group theory and related areas.
One paper focuses on proving a fixed point theorem for the action of $SL_n$ over local
fields on symmetric spaces of infinite dimension and finite rank
(http://arxiv.org/pdf/2505.05220v1). Another paper deals with the conjugate generation
of sporadic almost simple groups, determining a specific value related to their
automorphisms (http://arxiv.org/pdf/2505.05173v1). There is also a classification of
proper partial linear spaces affording imprimitive rank 3 automorphism groups,
including the construction of infinite families and individual examples
(http://arxiv.org/pdf/2505.05124v1). Further research characterizes nilpotent bicyclic
groups, generalizing a previous result related to abelian bicyclic groups
(http://arxiv.org/pdf/2505.05065v1). Finally, one article studies flag-transitive
automorphism groups of 2-designs with prime λ, identifying specific designs with
exceptional or sporadic simple groups as socles (http://arxiv.org/pdf/2505.04985v1).
recent_papers_url: https://arxiv.org/list/math.GR/new
</code></pre></div></div>

<h3 id="answering-questions">Answering questions</h3>

<p>This agent answers questions by looking in the literature. It’s made in such a way to:</p>
<ol>
  <li>decide what queries to run agains the API, searching for matches within the abstract</li>
  <li>run said queries and collect abstracts</li>
  <li>evaluate all abstracts for relevance to the question and only pick the most relevant ones</li>
  <li>finally, read the papers related to those abstracts (but we curtail the number of them)</li>
  <li>produce an answer with all this info</li>
</ol>

<p>For 1, it works pretty well because the arXiv API is based on Apache Lucene, the relevance retrieval is pretty good.</p>

<p>The paper retrieval is done by pulling from the site because the whole paper text isn’t reachable via API - this is something we need to improve to as again you may get locked out.</p>

<p>An example, with question <code class="language-plaintext highlighter-rouge">How is the performance of LLMs measured when it comes to mathematical reasoning?</code></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>response: LLMs\' performance in mathematical reasoning is measured using various metrics and
benchmarks. Here\'s a breakdown of how the articles describe it:

*   **Metrics**: The primary metrics used are Accuracy (ACC), Reasoning Score (RS), and Clarity Score
(CS), reflecting different dimensions of mathematical understanding and communication
(http://arxiv.org/pdf/2503.10573v2.pdf).
*   **Benchmarks**: Several benchmark datasets are used to evaluate LLMs, including Math Competition
(MATH), Grade School Math (GSM8K), and Massive Multitask Language Understanding (MMLU) math subset
(http://arxiv.org/pdf/2503.10573v2.pdf).
*   **Evaluation Framework**: A process-oriented framework is used to evaluate LLMs\' ability to
construct mathematical models, using solvers to compare outputs with ground truth
(http://arxiv.org/pdf/2405.13144v3.pdf).
*   **MATHHAY Benchmark**: MATHHAY is an automated benchmark designed to assess the long-context
mathematical reasoning capabilities of LLMs. It includes questions of varying difficulty levels to
assess LLMs’ reasoning abilities across different input lengths (32K, 64K, 128K)
(http://arxiv.org/pdf/2410.04698v1.pdf).

article_list: ['http://arxiv.org/pdf/2503.10573v2.pdf', 'http://arxiv.org/pdf/2405.13144v3.pdf',
'http://arxiv.org/pdf/2410.04698v1.pdf']
</code></pre></div></div>

<p>Note that these are the API queries it ran (sorting parameters are not left for the LLM to choose, they’re imposed):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://export.arxiv.org/api/query?search_query=abs:llm mathematical reasoning performance metrics&amp;start=0&amp;max_results=20&amp;sortBy=relevance&amp;sortOrder=descending
http://export.arxiv.org/api/query?search_query=abs:evaluating mathematical abilities of large language models&amp;start=0&amp;max_results=20&amp;sortBy=relevance&amp;sortOrder=descending
http://export.arxiv.org/api/query?search_query=abs:mathematical reasoning benchmarks for llms&amp;start=0&amp;max_results=20&amp;sortBy=relevance&amp;sortOrder=descending
http://export.arxiv.org/api/query?search_query=abs:llm performance on math problems&amp;start=0&amp;max_results=20&amp;sortBy=relevance&amp;sortOrder=descending
http://export.arxiv.org/api/query?search_query=abs:measuring mathematical understanding in llms&amp;start=0&amp;max_results=20&amp;sortBy=relevance&amp;sortOrder=descending
</code></pre></div></div>

<h3 id="talking-about-an-article">Talking about an article</h3>

<p>This agent really is quite simple. It retrieves a paper by link or title, “reads it” as whole text (although we also curtail the input here, to a max of chars) and responds to whatever you asked.</p>

<p>For instance, asking <code class="language-plaintext highlighter-rouge">what was the point of paper 'attention is all you need'?</code></p>

<p>will lead to something like</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>response: The point of the paper 'Attention is all you need' is to introduce a new and simple network architecture called the Transformer. This architecture relies solely on attention mechanisms,
dispensing with recurrence and convolutions entirely, to draw global dependencies between input and
output. The paper demonstrates that the Transformer is superior in quality, more parallelizable, and
requires less training time compared to complex recurrent or convolutional neural networks, achieving
state-of-the-art results on machine translation tasks.
article_title: Attention is all you need
article_link: http://arxiv.org/pdf/1706.03762v7
</code></pre></div></div>

<h2 id="reducing-agency">Reducing agency</h2>

<p>Initially, each of the agent was doing all the steps on its own in a full pipeline. They had been conceived as single agents, with a detailed prompt giving all instructions at once.</p>

<p>We’ve now moved to a system where the summary and question agents are actually cascades of small agents each doing a step. Some of the steps may not be “agentic” at all, for example for the summary agent, there’s first an agent that chooses the matching category, then some other code runs the API queries (as opposed to an agent equipped with a tool doing thi), finally passing the results an agent that summarises the results.
Similarly applies to the question agents, small agents in cascade “decide” what API queries to run, judge relevance of abstracts and then summarise the whole thing from the papers retrieved.</p>

<p>Why this? Because we realised that sometimes the pipeline as executed the whole way by a single macro-agent would go on forever, it would still run queries even when the answer/response was clearly there. <em>We had to remove some agency</em>!</p>

<p>This process is ongoing and we’re experimenting with the best compromises, also noting that there may be differences depending on what the LLM family is. For example, we’re currently in the process of reducing the agency of the “article” agent too, because when asked for a paper that doesn’t exist it tends to go on forever trying to find it. The old adage of <strong>do things the easiest and most reliable way, do not complicate your life adding uncertainty, still stands strong</strong>. Agents/LLMs are great for things that you can’t really do otherwise, things like searching for info within text or creating summaries or matching up content, but if they have too much to do at once the chances that they bomb it may not be worth the risk.</p>

<p>It is known that LLMs tend to be <a href="https://www.reddit.com/r/ChatGPT/comments/1ev21jm/is_chatgpt_and_any_other_llm_for_that_matter/">yes-minions</a>, they’ve been trained to produce answers for a positive query (as in, when the answer exists) rather than for judging whether the query makes sense in the first place. 
We are learning about when it is best to leave them to decide things on their own and when a pipeline of intertwined agents/IFTTT statements suits our needs much better instead. This point will likely be explored in another post, hope you enjoyed this!</p>]]></content><author><name>Martina Pugliese</name></author><category term="tech" /><category term="GenAI" /><category term="agent" /><category term="research" /><category term="llm" /><category term="AI" /><summary type="html"><![CDATA[Genesis and some considerations on agents' agency]]></summary></entry><entry><title type="html">Building myself a tool to summarise AI papers with Gemini</title><link href="https://martinapugliese.github.io/tech/gemini-summarise-papers/" rel="alternate" type="text/html" title="Building myself a tool to summarise AI papers with Gemini" /><published>2025-02-15T00:00:00+00:00</published><updated>2025-02-15T00:00:00+00:00</updated><id>https://martinapugliese.github.io/tech/gemini-summarise-papers</id><content type="html" xml:base="https://martinapugliese.github.io/tech/gemini-summarise-papers/"><![CDATA[<p>One of the problems I have is lack of time and energy to read all papers that I’d like to read, especially in the area of AI. There is just …a lot of material. Of course the most significant papers will naturally find their way to me by virtue of word of mouth, socials, media coverage, newsletters etc, but many other get lost if you don’t make a proactive effort to look for them.</p>

<p>So I thought that it may be a great idea to go a bit meta (not the company, the original meaning) and use GenAI to summarise papers which will mostly be …about GenAI. Summarising text is one of the most popular use cases for LLMs, and a growing area of research deals with their effectiveness on this, see this <a href="https://arxiv.org/abs/2406.11289">relatively recent review paper</a>.</p>

<p>Wanting to also try out the Gemini LLM family, I’ve decided to build a lil’ tool - which for now is nothing more than a <a href="https://colab.research.google.com/drive/1vSsKggNon9HwiY4qx-45H3JySsz7oonF?usp=sharing">Colab notebook</a> (see notes below) - that parses the <a href="https://arxiv.org/list/cs.AI/recent">AI “recent papers” page from the ArXiv</a> and spits out a digestible summary of all the papers published on the latest available day.</p>

<p>The ArXiv was built in the early 1990s by the physics community as a way to share preprints and gather feedback before submitting them for publication to a journal. It has since grown to host a plethora of disciplines and sub-disciplines and is the main general reference for scientific “papers” - the quotes are because what’s there are preprints, largely material that’s not been peer-reviewed, so always to take with a pinch of salt.
The ArXiv is however undoubtedly a great manifestation of the power of open science/access/source, and allows lots of people, not just scientists but software engineers, practitioners and just about anyone to access knowledge as it comes. It doesn’t have everything but it has a lot.</p>

<p>For AI in particular, I have the feeling that it is <em>the</em> main reference. 
Thankfully each discipline is organised in classes and subclasss which each have a “recent entries” page with listings per day, from Monday to Friday. I don’t know for a fact but wouldn’t be surprised to learn that the AI subclass (subclass of Computer Science) is the most popular these days - the last few days have exceeded the 150 papers/day. Physics as a whole class has seen 49 entries on Feb 14.</p>

<h2 id="procedure">Procedure</h2>

<p>My Colab notebook is <a href="https://colab.research.google.com/drive/1vSsKggNon9HwiY4qx-45H3JySsz7oonF?usp=sharing">here</a>, versioned on Github <a href="https://github.com/martinapugliese/summarise-sci-literature/tree/main">here</a>. My idea for now is to run this daily, check out what it comes up with and maybe later on move it to an automated tool (it can be something as simple as a cron job or scheduled on the cloud, maybe allowing it to send myself daily emails with the summaries).</p>

<p>The whole idea is to let Gemini “read” papers published to the ArXiv in the latest available day and summarise them in a succint, clear and to-the-point way. Of course, there is no substitute for reading a paper yourself, understanding and appreciating it in detail. Further, LLMs can hallucinate, fail to point out the main points or just do a bad job, I’m aware. But still, caveats considered this is very valuable to me because I wouldn’t be able to ingest all the daily info myself - it’s been a pain point for a while.</p>

<p>I’ve used this prompt, which seemed to work well after a few anecdotal tests:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sys_instruct</span> <span class="o">=</span> <span class="s">"""
                You are an experienced reader of academic literature and
                an expert in distilling important findings in a way that is understandable and clear.
               """</span>

<span class="n">prompt</span> <span class="o">=</span> <span class="s">"""This is a paper on AI. 
            Summarise its results in 3 lines, avoiding obscure jargon and going to the point.
            If there are valuable examples that aid understanding, report them in a nutshell.
            """</span>
</code></pre></div></div>

<p>The paper will be given as a file directly, without the need to first extract its text - this is easily doable via the Gemini API.</p>

<p>You may have thought “why don’t you just read the abstracts that the ArXiv aggregation page already presents”? Well, often they’re not enough, they don’t necessarily present a clear idea devoid of jargon that you’d only learn when reading the whole thing. Also, I am very keen on getting the example(s) in the summary, they significantly speed up my understanding. But I can test prompts to check how much quick info that’s not just the abstract I can get - the more I run this the more I’ll see.</p>

<p>I initially thought of using an LLM to parse the “recent papers” page but decided against it as it felt like a waste: it is easy enough to extract info with a bit of good old <code class="language-plaintext highlighter-rouge">requests</code> and <code class="language-plaintext highlighter-rouge">BeautifulSoup</code>. The page is well structured and I don’t expect it to change in time, so it’s a case of finding the ordered list of paper IDs, URLs and titles. After figuring out what’s the latest day of data available and how many entries it has (from parsing the first H3 header, e.g. “Fri, 14 Feb 2025 (showing first 50 of 159 entries )”), this is it:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># find the URLs to these papers for the latest day only (up to n_entries as per above)
</span><span class="n">paper_links</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="p">{</span><span class="s">"title"</span><span class="p">:</span> <span class="s">"Download PDF"</span><span class="p">})[:</span><span class="n">n_entries</span><span class="p">]</span>

<span class="c1"># Extract the paper IDs and links
</span><span class="n">paper_ids</span><span class="p">,</span> <span class="n">paper_urls</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">paper_links</span><span class="p">:</span>
    <span class="n">paper_url</span> <span class="o">=</span> <span class="s">"https://arxiv.org"</span> <span class="o">+</span> <span class="n">link</span><span class="p">[</span><span class="s">"href"</span><span class="p">]</span>
    <span class="n">paper_id</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="s">"href"</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"/"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"v"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>  <span class="c1"># Extract the ID
</span>
    <span class="n">paper_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">paper_id</span><span class="p">)</span>
    <span class="n">paper_urls</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">paper_url</span><span class="p">)</span>

<span class="c1"># separately, find all titles (this is due to how the DOM is structured)
# they'll appear in the same order so order counts
</span><span class="n">paper_title_divs</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span> <span class="p">{</span><span class="s">"class"</span><span class="p">:</span> <span class="s">"list-title mathjax"</span><span class="p">})[:</span><span class="n">n_entries</span><span class="p">]</span>

<span class="n">paper_titles</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">title_div</span> <span class="ow">in</span> <span class="n">paper_title_divs</span><span class="p">:</span>
    <span class="n">paper_titles</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">title_div</span><span class="p">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)[</span><span class="mi">1</span><span class="p">].</span><span class="n">lstrip</span><span class="p">())</span>
</code></pre></div></div>

<p>Then, I download all the papers locally (to the file system Colab is giving me, which, note, is ephemeral) with <code class="language-plaintext highlighter-rouge">urllib.request</code> - this is very fast, takes about 20 seconds for 150 papers (files are rather small).</p>

<p>Finally, I can ask Gemini to process them with the prompt above (and ingesting the file itself), one by one with:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">client</span> <span class="o">=</span> <span class="n">genai</span><span class="p">.</span><span class="n">Client</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">userdata</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'GEMINI_API_KEY'</span><span class="p">))</span>

<span class="c1"># configure which Gemini to run 
</span><span class="n">model</span> <span class="o">=</span> <span class="s">"gemini-2.0-flash"</span>

<span class="n">file_</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">files</span><span class="p">.</span><span class="n">upload</span><span class="p">(</span><span class="nb">file</span><span class="o">=</span><span class="sa">f</span><span class="s">'pdfs/</span><span class="si">{</span><span class="n">filename</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="n">start_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">()</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">generate_content</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
    <span class="n">config</span><span class="o">=</span><span class="n">types</span><span class="p">.</span><span class="n">GenerateContentConfig</span><span class="p">(</span><span class="n">system_instruction</span><span class="o">=</span><span class="n">sys_instruct</span><span class="p">,</span>
                                       <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>  <span class="c1"># use greedy decoding
</span>    <span class="n">contents</span><span class="o">=</span><span class="p">[</span><span class="n">prompt</span><span class="p">,</span> <span class="n">file_</span><span class="p">])</span>
</code></pre></div></div>

<p>I’ve used Gemini 2.0 Flash but I’ll check out if Lite can do the same job and maybe faster. About the API key, you can store it in the <a href="https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75">“secrets” feature</a> in Colab which works a bit like environment variables.
Gemini has a free tier of 15 requests/minute which I’ve not exceeded with my daily run, because a document gets processed in about 5 seconds (median across all), which means in a minute I can expect to send ~12 requests. I’ll see how it plays out in time, more than happy to pay a little for this, pricing is <a href="https://ai.google.dev/gemini-api/docs/rate-limits#free-tier">quite friendly</a> too. I may be wrong but Gemini is the only proprietary (non open-source) LLM family with a free API tier.</p>

<p>I save the summaries as they come, keep track of tokens consumed in input and output and latency just for the sake of having stats. At the end, I generate a lil’ HTML page with the summary, title and href to URL of each paper (I’ll spare you the code for that here, check the notebook in case). For Feb 14, the summary page looks like this</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/gemini-paper-summaries.png" alt="A very basic HTML page with title and link of each paper and the summary below, no design element applied." />
  <figcaption>First 3 papers from Feb 14, 2025 as summarised via Gemini 2.0 Flash, from the HTML page I am generating.</figcaption>
</figure>

<p>It’s bare, but I don’t need anything else - all I need is reading/skimming it every day and, if desired, go check out the paper itself (hence the links are important). I could make it send it to me via email I suppose.</p>

<h2 id="what-can-be-improved">What can be improved</h2>

<p>A lot of things. What comes to mind immediately:</p>
<ul>
  <li>The design of the HTML summary page!</li>
  <li>Send the summary somewhere, e.g. via email</li>
  <li>Add a preliminary check to see  if the latest available day of papers has been already summarised</li>
  <li>Automate it as a job - for now a daily manual run is good enough while I check results out</li>
  <li>A better prompt?</li>
  <li>The Colab runtime has a time cutoff - for now running the notebook to summarise ~150 papers takes about 20 mins overall so it’s well within limits (of course the slowest part is the LLM call, but Gemini seems fast enough)</li>
  <li>The LLM doesn’t have to be Gemini (although I have to say I quite like it!), I could experiment with other ones, and I’d particularly love to use open source ones</li>
</ul>

<h2 id="new-additional-features-i-could-add">New additional “features” I could add</h2>

<p>AI papers don’t get published just under the “AI” subclass, another popular one is e.g. <a href="https://arxiv.org/list/cs.CL/recent">“Computation and Language”</a> - I could do the same for each section of interest.</p>

<p>Inevitably, these’d be too many papers. I could then add a second layer than collates all the daily paper summaries and gives me an overview of the themes.</p>

<hr />

<p>Let me know what you think - find me on BlueSky or LinkedIn (see sidebar). Any feedback is greatly appreciated, including if you find mistakes.</p>

<p>Also, if you’re one for newsletters, my post can get regularly sent to you:</p>

<iframe scrolling="no" style="width:100%!important;height:220px;border:1px #ccc solid !important" src="https://buttondown.email/martinapugliese?as_embed=true"></iframe>
<p><br /><br /></p>]]></content><author><name>Martina Pugliese</name></author><category term="tech" /><category term="python" /><category term="GenAI" /><category term="Gemini" /><category term="science" /><category term="llm" /><summary type="html"><![CDATA[And little learnings along the way.]]></summary></entry><entry><title type="html">Gambling, probability and Bernoulli trials</title><link href="https://martinapugliese.github.io/excursus/probability/" rel="alternate" type="text/html" title="Gambling, probability and Bernoulli trials" /><published>2025-01-24T00:00:00+00:00</published><updated>2025-01-24T00:00:00+00:00</updated><id>https://martinapugliese.github.io/excursus/probability</id><content type="html" xml:base="https://martinapugliese.github.io/excursus/probability/"><![CDATA[<blockquote>
  <p>It is remarkable that a science which began with the consideration of games of chance should have become the most important object of human knowledge.<br />
<strong>Pierre-Simon, Marquis de Laplace</strong>, Théorie Analytique des Probabilités, 1812</p>
</blockquote>

<p>The developments of gambling and probability theory have gone hand in hand throughout history: humans have been playing dice and cards for centuries, humans have then started formalising strategies for wins and losses.</p>

<p>Primitive forms of dice are among <a href="https://daily.jstor.org/the-ancient-origins-of-dice/">the first playthings invented by humanity</a>; card games were likely invented in <a href="https://en.wikipedia.org/wiki/Playing_card#History">ancient China in the 9th century</a>. There are references to games in numerous pieces of literature, for instance the “Zara” dice game, from the Middle Ages, is mentioned in <a href="https://it.wikipedia.org/wiki/Zara_(gioco)">Dante Alighieri’s Comedy and  other works</a>. 
The Zara game is very simple: you throw three 6-sided dice at once after choosing a number 3-18 (respectively the minimum and maximum achievable) - because results in the middle of the scale are obtainable with a larger number of dice combinations (you get a 3 with 1-1-1 only, a 18 with 6-6-6 only, a 4 with 1-1-2/1-2-1/2-1-1, a 17 with 6-6-5, 6-5-6, 5-6-6, …, but a 10 or 11 with 27 combos each, out of the 216 possible ones), obviously it’s convenient to bet on them. It does feel like an uninteresting game because of this, but maybe they hadn’t figured it out yet back then! This game is very useful as toy example to <a href="http://www.syllogismos.it/education/Mcots2.pdf">teach concepts in probability</a>.</p>

<p>Starting from the 17th century a whole new area of mathematics developed to investigate and formalise the concepts and measurements around chance, odds, wins and losses. Fermat, Pascal, Huygens, Bernoulli, Cardano, de Moivre, Laplace are some of the famous names involved in the founding of this new intellectual endeavour. I’d love to know if the pioneers of probability were also gamblers themselves, but I’m not sure I found reliable information. Either way, it was during their times that playing and gambling <a href="https://theconversation.com/how-the-18th-century-probability-revolution-fueled-the-casino-gambling-craze-228347">took on a whole new level in Europe</a>. 
It is not uncommon anyway to have people involved in mathematics and forms of gambling or investing to this day - <a href="https://en.wikipedia.org/wiki/Jim_Simons#">Jim Simons</a> is a brilliant example: a mathematician by background and career with important academic contributions and prizes won, he also worked at the NSA and eventually launched a hedge fund company which made him a billionaire. He’s also been a great philantropist who contributed all his life, monetarily and with time and energy to causes related to the dissemination of scientific education and research, establishing and funding, amongst other nonprofits, <a href="https://www.mathforamerica.org/">“Math for America”</a>.</p>

<p>We’ll give a brief overview of something from the early era of probability theory: Bernoulli trials and the related probability distributions.</p>

<h2 id="the-bernoulli-distribution">The Bernoulli distribution</h2>

<p>Let’s consider a binary variable \(X\), that is, one that can take only 2 values, say “1” (usually representing <em>success</em>) with probability \(p\) and “0” (usually representing <em>failure</em>), with probability \(1-p\). Phenomena behaving like this are known as Bernoulli trials - examples are a coin flip or randomly asking people in the street whether they had cereal for breakfast or not.</p>

<p>We can write the <a href="https://en.wikipedia.org/wiki/Probability_mass_function">probability mass function</a> (the mathematical form of probability for each value) as</p>

\[P (X=x) = p^x(1-p)^{1-x}\]

<p>because when \(x=1\) we are left with \(p\) and when \(x=0\) with \(1-p\). This is the Bernoulli distribution.</p>

<p>Expected value (the mean where each value is weighted by its probability) and variance (measuring the spread of values) calculate respectively as</p>

\[\mathbb{E}[X] = \sum_{x \in \{0,1\}} x p^x(1-p)^{1-x} = 0 + 1p(1-p)^0 = p\]

<p>and</p>

\[Var[X] = \sum_{x \in \{0,1\}} x^2 p^x(1-p)^{1-x} - p^2 = p - p^2 = p(1-p)\]

<p>The Bernoulli responsible for this formalisation was Jakob (there are many famous Bernoullis), son of a pharmacist and brother to Johann. The boys were a bit naughty and disobeyed their father who wanted them to study medicine and theology and <a href="https://www.britannica.com/topic/Bernoulli-family">did mathemathics instead</a>. We must be very thankful they did as they left us with a huge corpus of important work. Also, Johann’s sons and their sons continued the mathematical dynasty: the Bernoulli family was <a href="https://en.wikipedia.org/wiki/Bernoulli_family#Family_Tree_of_the_Basler_Bernoullis">quite gifted</a> for sure.</p>

<h2 id="more-successes-and-more-failures-the-binomial">More successes and more failures: the Binomial</h2>

<blockquote>
  <p>In this business it’s easy to confuse luck with brains.<br />
<strong>Jim Simons</strong> [undocumented]</p>
</blockquote>

<p>The Bernoulli is a special case of a more general distribution, the Binomial: you have \(k\) successes in a set of \(n\) Bernoulli trials. The probability mass function writes as</p>

\[P(X=k) = {n\choose k} p^k (1-p)^{n-k} \ .\]

<p>The first element, \({n}\choose{k}\), quite uncreatively called the <a href="https://en.wikipedia.org/wiki/Binomial_coefficient">“binomial coefficient”</a>, tells you how many ways of creating groups of \(k\) from \(n\) values are there and writes as (with all expansion)</p>

\[\begin{split}
{n\choose k} &amp;= \frac{n!}{k!(n-k)!} \\
 &amp;= \frac{n(n-1)(n-2) \ldots 1}{(k(k-1)(k-2)\ldots1)((n-k)(n-k-1)(n-k-2)\ldots1)} \\
 &amp;= \frac{n(n-1)(n-2) \ldots (n-k+1)}{k (k-1)(k-2)\ldots1}
\end{split}\]

<p>If you have a bag with 3 balls, named say A, B and C, and want to extract 2 of them (at the same time, that is, with no replacement), you can pick any of 3 different sets: AB, AC, BC, \({3 \choose 2} = \frac{3\cdot2}{2}=3\). If the balls are 4 (A,B,C,D) and you still want a group of 2, you can pick AB, AC, AD, BC, BD, CD, which makes 6 groups or \({4 \choose2} = \frac{4\cdot3}{2\cdot1}\). For  group of 3 with 4 balls you can do ABC, ABD, ACD, BCD, \({4\choose 3} = \frac{4\cdot3\cdot2}{3\cdot2\cdot1}\). And so on.</p>

<p>Expected value and variance of the Binomial calculate respectively as \(\mathbb{E}[X] = n p\) and \(Var[X] = n p (1- p)\) (see <a href="https://en.wikipedia.org/wiki/Binomial_distribution#Expected_value_and_variance">Wikipedia</a> for the proofs).</p>

<p>Let’s see this with a bit of code.</p>

<p>First the imports and setting up of a pseudo-random number generator in Numpy:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">()</span>
</code></pre></div></div>

<p>then we simulate 1000 trials of a fair coin flip, that is, one where each face is equiprobable (probability 0.5)</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span>    <span class="c1"># choose how many trials
</span><span class="n">trials</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">rng</span><span class="p">.</span><span class="n">choice</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">))</span>   <span class="c1"># numpy.choice will by default use a uniform distr over values
</span></code></pre></div></div>

<p>and we count successes and their probability</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">successes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">trials</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">successes</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<p>which we can plot</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">p</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'p_success'</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'g'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'p_failure'</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Probability of success and failure'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Number of trials'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Probability'</span><span class="p">)</span>
</code></pre></div></div>

<p>obtaining these trends:</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/bernoulli-trials-05.png" alt="A plot showing the probabilities of success and failure in 1000 fair-coin trials, you see that they stabilise around 0.5 after about 200 trials." />
  <figcaption>Trials with a fair coin. The probabilities of success and failure take about 200 trials to stabilise around 0.5.</figcaption>
</figure>

<p>If we vary the probability of success (using arg <code class="language-plaintext highlighter-rouge">p</code> in <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html"><code class="language-plaintext highlighter-rouge">numpy.random.choice</code></a>), so to represent unfair coins, we get these trends</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/bernoulli-trials-p.png" alt="Plots showing the probabilities of success and failure in 1000 trials with coins of varying success probability." />
  <figcaption>Trials with unfair coin of specified probability.</figcaption>
</figure>

<p>The Binomial distribution generalises into a <a href="https://en.wikipedia.org/wiki/Multinomial_distribution">Multinomial</a> when possible results are more than two.</p>

<p><em>Probability theory is full of nice and fun questions.</em></p>

<h2 id="good-reads">Good reads</h2>

<ul>
  <li><a href="https://www.britannica.com/science/probability">Probability and Statistics</a> on <strong>Encyclopedia Britannica</strong></li>
  <li>J MacDonald, <a href="https://daily.jstor.org/the-ancient-origins-of-dice/">The ancient origins of Dice</a>, <strong>JSTOR Daily</strong> 2018 - with lots of links to archeology studies</li>
  <li>J Roebke, <a href="https://web.archive.org/web/20090217190816/http://seedmagazine.com/news/2006/09/putting_his_money_where_his_ma.php">Putting his money where his math is</a>, <strong>Seed Magazine</strong> (defunct) 2006, via Web Archive - nice article about J Simons and his passion for maths education</li>
  <li>D T Max, <a href="https://www.newyorker.com/magazine/2017/12/18/jim-simons-the-numbers-king">Jim Simons, the numbers King</a>, <strong>The New Yorker</strong> 2017</li>
  <li>J Eglin, <a href="https://theconversation.com/how-the-18th-century-probability-revolution-fueled-the-casino-gambling-craze-228347">How the 18th-century ‘probability revolution’ fueled the casino gambling craze</a>, <strong>The Conversation</strong> 2024</li>
  <li>J I García-García, N A Fernández Coronado,E H Arredondo, I A Imilpán Rivera, <a href="https://www.mdpi.com/2227-7390/10/15/2680#B41-mathematics-10-02680">The Binomial Distribution: Historical Origin and Evolution of Its Problem Situations</a>, <strong>Mathematics</strong> 2022 (open access)</li>
</ul>]]></content><author><name>Martina Pugliese</name></author><category term="excursus" /><category term="probability" /><category term="history" /><summary type="html"><![CDATA[Exploring some of the simple concepts in the early theory of probability]]></summary></entry><entry><title type="html">Initials of German verbs</title><link href="https://martinapugliese.github.io/data/german-verbs/" rel="alternate" type="text/html" title="Initials of German verbs" /><published>2025-01-04T00:00:00+00:00</published><updated>2025-01-04T00:00:00+00:00</updated><id>https://martinapugliese.github.io/data/german-verbs</id><content type="html" xml:base="https://martinapugliese.github.io/data/german-verbs/"><![CDATA[<p>I am learning German and amongst the things I do is keeping track of the verbs I master. I do it the old school way: on paper, alphabetically, one sheet per initial letter. It sounds crazy but it is really satisfying to see my verbs pile up and count them.</p>

<p>German builds some of its verbs via prefixes, which either add nuance to the original meaning or alter it completely. Those in the latter class are the hard ones to absorb, at least for me. Out of struggling to distinguish my aufmachen from my ausmachen, I started this exercise: at the very start, I made a list of all verbs I could think of whose meaning and usage I was sure of; now, I just add one every time I feel like I’ve learned it, as in, I’ve stored it in my (hopefully) long-term memory and I can comfortably use it.</p>

<p>The whole of German is wonderfully Lego-brickable, but I only do this for verbs because with no other part of speech have I had such struggles, and also because without a rock-solid foundation in verbs it’s hard you’ll be able to build coherent discourse.</p>

<p>I am between B1-B2 levels now and the below is what I’ve got. It doesn’t look like many, but there will be several (many?) verbs that I somewhat know but I’m not really sure about. Also, there <del>may</del> will be verbs I forgot to add. And, I am not tracking all possible easy combinations from a root (the first class from above): I haven’t added <em>wegrennen</em> (run away) for instance, despite understanding it very well and intuitively.</p>

<p>These days the adding-to-list frequency is quite low, maybe I’m approaching the stage where I need to level up and/or diversify sources to have a new burst of novelty like in the early glorious days of this.</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/german-verbs.jpg" alt="A bar plot, hand-drawn on yellow paper. Alphabetically-ordered letters on the x and count of verbs on the y, each letter has the count shown. Counts are A:36, B:27, C:0, D:8, E:21, F:23, G:14, H:13, I:3, J:1, K:18, L:13, M:12, N:4, O:4, P:8, Q:1, R:14, S:44, T:13, U:11, V:27, W:15, X:0, Y:0, Z:7" />
  <figcaption>My verbs, counted per initial. Some letters have also examples.</figcaption>
</figure>

<p>S really is queen! We have all the <em>sein</em>, <em>singen</em>, <em>stattfinden</em>, <em>schencken</em>, <em>schicken</em>, …
There’s 0, nichts, for C - If I really think hard now, nothing more than an English loan like <em>chatten</em>, should it exists at all (not sure), comes to mind. X and Y have the same fate but that’s not surprising, most words there will be actual loans. 
Only 1 entry for J, that’s <em>joggen</em>, also a loan.</p>

<p>Likely, and this is important, my verbs are the most common ones (because I’m a learner and don’t yet consistently consume texts aimed at natives) so this isn’t necessarily representative of the whole vocabulary of German verbs.</p>

<h2 id="can-i-pull-this-data-from-actual-german">Can I pull this data from actual German?</h2>

<p>What if I want to compare my totally personal German verbs distribution with one derived from German general texts, not mostly those aimed at learners? There’s lots of sources I could use for this but I thought I’d query the mighty Wikipedia. Oh, how lucky, Python has a <a href="https://wikipedia.readthedocs.io/en/latest/code.html">great package</a> that lets you do just that easily-peasily. And spaCy has <a href="https://spacy.io/models/de">models trained for German</a> (and many other languages), that I can use for POS-tagging (POS: part of speech) off the shelf. I don’t necessarily need much power so I’ve used the small model.</p>

<p>You can just do this:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">spacy</span>
<span class="kn">import</span> <span class="nn">wikipedia</span>

<span class="c1"># settings
</span><span class="n">wikipedia</span><span class="p">.</span><span class="n">set_lang</span><span class="p">(</span><span class="s">"de"</span><span class="p">)</span>  <span class="c1"># set it to use Wikipedia German
</span>
<span class="n">lm</span> <span class="o">=</span> <span class="n">spacy</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'de_core_news_sm'</span><span class="p">)</span>  <span class="c1"># choose the spaCy language model
</span>
<span class="n">page</span> <span class="o">=</span> <span class="s">'Deutschland'</span>   <span class="c1"># title of a Wikipedia page
</span><span class="n">wikipedia</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>  <span class="c1"># this will give you best guesses at matching pages
</span>
<span class="c1"># to get text content
</span><span class="n">dl</span> <span class="o">=</span> <span class="n">wikipedia</span><span class="p">.</span><span class="n">page</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>

<span class="c1"># and to create a spaCy document
</span><span class="n">doc</span> <span class="o">=</span> <span class="n">lm</span><span class="p">(</span><span class="n">dl</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
</code></pre></div></div>

<p>Initially I thought I’d pull content from some specific long and detailed pages, like the <a href="https://de.wikipedia.org/wiki/Deutschland">one about Germany</a> itself (with all the history), some <a href="https://de.wikipedia.org/wiki/Das_Leben_der_Anderen">film ones</a>, some about the <a href="https://de.wikipedia.org/wiki/H%C3%A4nsel_und_Gretel">brothers Grimm Märchen</a> (fairytales) and so on. These are all pages rich in text and prone to contain narration, hence verbs. But the counts would be too few to be somewhat representative, and also the set would be biased by whatever set of topics I could think of. 
So I’ve added a bunch (many) random pages, which you can easily pull with this API:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wikipedia</span><span class="p">.</span><span class="n">random</span><span class="p">(</span><span class="n">pages</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>   <span class="c1"># a 100 random Wikipedia page titles
</span></code></pre></div></div>

<p>To extract verbs, following the spaCy POS glossary <a href="https://github.com/explosion/spaCy/blob/master/spacy/glossary.py">here</a> and with the  from above, I do</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">doc</span><span class="p">:</span>
    <span class="c1"># take all the verb POS tags
</span>    <span class="k">if</span> <span class="n">token</span><span class="p">.</span><span class="n">pos_</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'AUX'</span><span class="p">,</span> <span class="s">'VERB'</span><span class="p">,</span> <span class="s">'MD'</span><span class="p">,</span> <span class="s">'VB'</span><span class="p">,</span> 
                      <span class="s">'VBD'</span><span class="p">,</span> <span class="s">'VBG'</span><span class="p">,</span> <span class="s">'VBN'</span><span class="p">,</span> <span class="s">'VBP'</span><span class="p">,</span> <span class="s">'VBZ'</span><span class="p">,</span> 
                      <span class="s">'VAFIN'</span><span class="p">,</span> <span class="s">'VMFIN'</span><span class="p">,</span> <span class="s">'VVFIN'</span><span class="p">,</span> 
                       <span class="s">'VV'</span><span class="p">]:</span>
        <span class="n">verbs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">token</span><span class="p">.</span><span class="n">lemma_</span><span class="p">)</span>  <span class="c1"># get the lemma, it'll be the infinitive of the verb
</span><span class="n">verbs</span> <span class="o">=</span> <span class="p">[</span><span class="n">v</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">verbs</span><span class="p">]</span> <span class="c1"># lower case as some will have a capitalised start
</span></code></pre></div></div>

<p>I’ve done a few rounds with random pages, considering that not all pages are rich in text (in fact, many are short, there’s also stubs, pages linking to others etc). After a while I ended up with</p>
<ul>
  <li>content from 3712 pages</li>
  <li>which sums up to 2899268 tokens (words)</li>
  <li>of which I have 10351 unique verbs</li>
</ul>

<p>That’s very different from my total 336 verbs. It’s been interesting to notice how the count of unique verbs grows quite slowly, it hadn’t saturated yet when I stopped (and it’s another question as to when it would do, maybe deserves another post) but it certainly decelerates. I end up with this (I’m not hand-drawing it this time):</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/german-verbs-wikipedia.jpg" alt="The same bar plot as before but with verb counts from the WIkipedia pages." />
  <figcaption>Wikipedia pages verbs, counted per initial.</figcaption>
</figure>

<p>S is not dominating anymore although it’s still quite high, and in fact (this was unexpected for me), A is the top one! There’s a few prefixes starting with A so maybe that’s why, and maybe I haven’t encountered - or learned - those verbs yet. My K seems overinflated, maybe it’s the common verbs that are particularly concentrated there. For the rest, it’s arguably not that different.</p>

<p>As a note: spaCy does mistakes of course so there’s some things in there that aren’t verbs. When isolating the leading letters I also made sure to remove anything that wasn’t alphabetic, wasn’t a Latin character or was a punctuation sign, but some non-verbs still remain.</p>

<p>Well, this was a quite fun exercise and as I said I may dig some more, any comments please tell me <a href="https://bsky.app/profile/martinapugliese.bsky.social/post/3lewjivcwgc2d">here on Bluesky</a>.</p>

<p>I also have a funky newsletter where I share these kind of things, you can check it out and subscribe <a href="https://buttondown.email/martinapugliese">here</a> if you want.</p>]]></content><author><name>Martina Pugliese</name></author><category term="data" /><category term="linguistics" /><category term="verbs" /><category term="language" /><category term="german" /><summary type="html"><![CDATA[An personal empirical measurement]]></summary></entry><entry><title type="html">The best things in 2024</title><link href="https://martinapugliese.github.io/best-things-2024/" rel="alternate" type="text/html" title="The best things in 2024" /><published>2024-12-31T00:00:00+00:00</published><updated>2024-12-31T00:00:00+00:00</updated><id>https://martinapugliese.github.io/best-things-2024</id><content type="html" xml:base="https://martinapugliese.github.io/best-things-2024/"><![CDATA[<p>Inspired by a <a href="https://werd.io/2024/stuff-i-loved-in-2024">post</a> by B Werdmuller, I’ve decided to also draw a list of shows, books, movies and tools I really enjoyed in 2024.
I hope I’ve listed everything I wanted, but there’s a chance I forgot some stuff.</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/2024-books.png" alt="Image of 3 books mentioned in the text" />
</figure>

<h2 id="books">Books</h2>

<h3 id="war-and-peace-l-tolstoy">War and Peace, L Tolstoy</h3>

<p>This was the year I partook to a year-long readalong of War and Peace. It was hosted by <a href="https://substack.com/home/post/p-151884015?source=queue">Simon Haisell on Substack</a>, which you can join for 2025. The book is long but the chapters are short (it was published in daily instalments in the first place), so it’s a wonderful idea for a yearly project. 
Simon writes a nice daily overview sparking comments and feedback, and participants can have a chat. I really enjoyed the experience and as many said, it’s one of those books that I’d have wanted to read but was put off but the sheer size.</p>

<p>War and Peace draws from themes of human nature, belonging, hypocrisy and truth, but it also contains deep reflections on history and the Russian soul and history. I’ve read the Briggs translation (into English), which is apparently a bit “britishisized”, something I didn’t love. I wish I’d picked up an Italian translation instead, but the other thing about this book is that you can read it over and over again and pick on something else every time, so maybe one day.</p>

<h3 id="poverty-by-america-m-desmond">Poverty, by America, M. Desmond</h3>

<p>Tragic and informative. The tale of how social and economic inequality in the USA came to be a feature, not a bug. You may or may not agree with all suggested causes and the political stance, but it’s nevertheless a great read.</p>

<h3 id="guns-germs-and-steel-the-fate-of-human-societies-j-diamond">Guns, germs and Steel: the fate of human societies, J Diamond</h3>

<p>It’s a cult now, criticised for being simplistic and/or too speculative. Regardless, I’ve learned a lot. Geography and local characteristics played a huge role in making it so that some societies developed faster and more profoundly than others, ending up dominating.</p>

<h3 id="the-patriarchs-the-origins-of-inequality-a-saini">The Patriarchs: the origins of inequality, A Saini</h3>

<p>This is great - in fact, I’ve drawn many <a href="https://martinapugliese.github.io/quotes/">quotes</a> from it. How one gender ended up giving itself the right to oppress the other didn’t have to be a necessary development, nor it was always and everywhere the case.</p>

<h3 id="animal-farm-g-orwell">Animal Farm, G Orwell</h3>

<p>Another classic of course. Doesn’t age.</p>

<h2 id="movies">Movies</h2>

<h3 id="conclave">Conclave</h3>

<p>So good. Power, centuries-old tradition, patriarchy and a lot of twists with some hilarious moments. Clever too.</p>

<h3 id="one-life">One Life</h3>

<p>The real story of a British young man who saved many Jewish kids from Central Europe organising the Kindertransport - trains that would bring them to the UK in a scheme that allowed them to be temporarily hosted by British families. Many stayed.</p>

<h3 id="small-things-like-these">Small things like these</h3>

<p>Patriarchy and religion in rural Ireland, the ’80s. Inspired by the real events of the Magdalene Laundries. Cillian Murphy is amazing and this, despite being a bit slow, is really good.</p>

<h2 id="shows">Shows</h2>

<h3 id="fallout">Fallout</h3>

<p>It’s on <a href="https://www.amazon.co.uk/gp/video/detail/amzn1.dv.gti.8276269a-402e-4ece-a2b0-4eb5e2504a05?autoplay=0&amp;ref_=atv_cf_strg_wb">Prime</a>. You don’t have to know about the game before watching it. Dystopian, disturbing. A critic of excessively unregulated capitalism, because we humans can’t moderate ourselves?</p>

<h3 id="hornblower">Hornblower</h3>

<p>This is an old (‘90s) British show, about the naval battles of the Napoleonic period. I like period dramas but I was surprised at how gripping I found this.</p>

<h3 id="young-sheldon">Young Sheldon</h3>

<p>There was something light I watched too! This is really good, I’ve been late to the party as I felt like it would be hard to replicate TBBT but this, despite being very different in style, is very funny, lighthearted and just enjoyable.</p>

<h2 id="tools">Tools</h2>

<h3 id="readwise--reader">Readwise &amp; Reader</h3>

<p>A year after I signed up, still very happy with it. Good for bookmarking articles, sign up to newsletters, annotate.</p>

<h3 id="buttondown">Buttondown</h3>

<p>Ever since <a href="https://martinapugliese.github.io/doodling-data-reloaded/">I moved from Substack</a>, very happy. Smooth UI, easy customisation, and most importantly, focus on the writing.</p>

<p><em>Here’s to another year of new great stuff.</em></p>]]></content><author><name>Martina Pugliese</name></author><category term="shows" /><category term="movies" /><category term="books" /><category term="tools" /><summary type="html"><![CDATA[Personal favourites, of course]]></summary></entry><entry><title type="html">A simple movie recommender based on similarity</title><link href="https://martinapugliese.github.io/tech/movielens-cf/" rel="alternate" type="text/html" title="A simple movie recommender based on similarity" /><published>2024-12-28T00:00:00+00:00</published><updated>2024-12-28T00:00:00+00:00</updated><id>https://martinapugliese.github.io/tech/movielens-cf</id><content type="html" xml:base="https://martinapugliese.github.io/tech/movielens-cf/"><![CDATA[<p>The fast.ai <a href="https://course.fast.ai/">“Practical Deep Learning for Coders” course</a> is a great start to DL, but it’s also packed full of great clever side ideas and insights here and there, so that every rewatch/re-read is valuable. I was recently looking at the Collaborative Filtering module, specifically this <a href="https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive">notebook</a> which gave me something to piggy-back for an old pet peeve I have.</p>

<p>I don’t normally find movie recommenders valuable because of a variety of reasons:</p>
<ul>
  <li>I want to make sure a movie is highly rated before even considering it for a watch, Rotten Tomatoes is still my to-go source - which means manual checks</li>
  <li>Usually, I want a movie that’s somewhat <em>similar</em> to a specific one I have in mind, depending on mood and feel. I’m not normally interested in lists of movies recommended to me because of an overall estimation based on everything I rated at once. Sometimes I’m in the mood for another ’50s comedy, sometimes for a new history film or a biopic, other times I just want a new stunning nature documentary.</li>
</ul>

<p>This means it takes me time before deciding what to watch and if I could cut that time it’d be awesome.</p>

<p>That fast.ai module uses a subset of Movielens data as a toy dataset to illustrate how to build a recommender, from scratch as well as using fast.ai utilities. I’ll do just the same but will use the latest edition of the Movielens <strong>full</strong> data dump and curate it to my needs (more on this below), looking at how results can help me with the above goal. Collaborative filtering is a methodology that essentially matches you with new items based on your ratings on items as well as the ratings of many other users - it learns your preferences and who’s similar to you.</p>

<p>I’ll run a quick model to get a sense of what I end up with, I will favour speed over quality, so long as I end up with something decent, and leave potential improvements to another day.</p>

<h2 id="movielens-data">Movielens data</h2>

<p><a href="https://movielens.org/explore/top-picks">Movielens</a> is a University of Minnesota project. It is a movie recommender based on collaborative filtering whose <a href="https://grouplens.org/datasets/movielens/">datasets</a> are regularly released freely. It exists since the late ’90s, which means its data is the longest-standing one in this area and for this reason it is used for research and to teach ML concepts.</p>

<h3 id="downloading-the-data">Downloading the data</h3>

<p>You can download the latest full version of Movielens movie-ratings dataset from <a href="https://grouplens.org/datasets/movielens/latest/">here</a>. At the time of writing, it contains data up until mid-2023.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">zipfile</span>
<span class="kn">import</span> <span class="nn">io</span>

<span class="n">MOVIELENS_LATEST_URL</span> <span class="o">=</span> <span class="s">"https://files.grouplens.org/datasets/movielens/ml-latest.zip"</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">MOVIELENS_LATEST_URL</span><span class="p">)</span>
<span class="n">response</span><span class="p">.</span><span class="n">status_code</span>  <span class="c1"># just to make sure we're getting a 200
</span>
<span class="c1"># Unzip to local folder
</span><span class="k">with</span> <span class="n">zipfile</span><span class="p">.</span><span class="n">ZipFile</span><span class="p">(</span><span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">))</span> <span class="k">as</span> <span class="n">zfile</span><span class="p">:</span>
  <span class="n">zfile</span><span class="p">.</span><span class="n">extractall</span><span class="p">(</span><span class="s">'.'</span><span class="p">)</span>   <span class="c1"># will create a local folder ml-latest
</span></code></pre></div></div>

<p>This will download a bunch of CSVs with movies info (tags for genre), ratings-movies mappings (what I care about) and some other ones. Metadata is taken from <a href="https://www.themoviedb.org/">The Movie Database</a>.</p>

<p>Note: I won’t report the whole code here but you can follow this notebook: <a href="https://colab.research.google.com/drive/1-R55oOpN1vzZD1t5zx2Q9AmbaZFHYzFI?usp=sharing">Colab</a> (view only, I may change things in time), <a href="https://github.com/martinapugliese/doodling-data-cards/blob/master/culture/movies_tv_shows/movielens_recomms.ipynb">Github</a> (versioned).</p>

<h3 id="basic-eda">Basic EDA</h3>

<p>There are ~33.8M ratings in the set, over ~83k movies, from ~330k unique users. The data, as of the README, contains ratings from the start of 1995 to July 2023.</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/movielens-ratings-time.png" alt="A time series plot of the number of ratings in time where you see peaks around 2000, 2005, 2016 and 2020." />
  <figcaption>Ratings in time (month by month).</figcaption>
</figure>

<p>It is out of scope to look at whether peaks correspond to a larger number of movies released, but it’s nevertheless interesting to see that Movielens is still very much used by raters.</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/movielens-hists.png" alt="Histogram plots, ratings per movie and per user, you can see a power-law behaviour with many  having very few ratings and a few many." />
  <figcaption>Histograms of the number of ratings per movie and per user (semilog scale). Unsurprisingly, it's power laws and yep, there's someone who created more than 30k ratings!</figcaption>
</figure>

<p>“The Shawshank redemption” (released in 1994) has more than 122k ratings - obviously there’s inverse recency bias, in that older films will have been rated more simply because they’ve been around longer.</p>

<h3 id="curating-the-set">Curating the set</h3>

<p>For reasons mostly related to being able to run a model quickly and ideally without the need for a GPU, I do some cleansing/curation as the data is quite large:</p>
<ol>
  <li>remove movies with an avg score below 3.5, because I’m only interested in movies that are generally considered good. Of course, this will remove a lot and will bias the set;</li>
  <li>remove movies with less than 1000 ratings, to exclude less reliable data - there’s plenty of movies with a lot of ratings so I can afford this. I want to use the results as a filter so I’d rather lose potential recommendations than have bad ones;</li>
  <li>remove all ratings by users who rated less than 500 movies in total - this is to favour opinions from more committed raters who are power users of Movielens. It will bias the set because I may end up with a userbase that’s less representative of the general population, but I feel that’s fine, I’d rather have qualified opinions (the idea is that people who rate a lot are more likely to rate honestly).</li>
</ol>

<p>Rating scores in Movielens range from 1 to 5.</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/movielens-ratings-hist.png" alt="Histogram of the avg rating per movie before/after cleansing: the first image is much more noise with a lot of peaks." />
  <figcaption>Histogram of the avg score per movie before and after operations 2 and 3 from above (1 not applied). You can see how it polishes up noise - movies with little ratings have unreliable avg scores.</figcaption>
</figure>

<p>I end up with 2145 movies with ~2.1M total ratings from ~3k users. These movies-ratings may still a bit too many for a quick simple model though (read: I tried the below and it was very slow), so I decided to sample them in such a way to preserve the fraction of ratings in each score, taking 10% of all ratings per movie.</p>

<h2 id="collaborative-filtering">Collaborative filtering</h2>

<p>I’ve then used <a href="https://docs.fast.ai/collab.html"><code class="language-plaintext highlighter-rouge">collab_learner</code></a> from fast.ai in the non-neural net modality, which performs gradient descent to learn embeddings for movies and users in the forms of weight factors and biases over a latent space of chosen dimensionality. Embeddings for movies in this latent space encode features of the movies like their vibe, subgenres, quality. They are learned from the users ratings only and not informed by any metadata, which is the fascinating part of collaborative filtering. I won’t go into the details of this but again the fast.ai lesson is great.</p>

<p>With <code class="language-plaintext highlighter-rouge">df</code> being our dataframe of movie-ratings (the matrix), I run</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># build dataloaders
</span><span class="n">dls</span> <span class="o">=</span> <span class="n">CollabDataLoaders</span><span class="p">.</span><span class="n">from_df</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> 
                                <span class="n">user_name</span><span class="o">=</span><span class="s">'userId'</span><span class="p">,</span> 
                                <span class="n">item_name</span><span class="o">=</span><span class="s">'title'</span><span class="p">,</span>
                                <span class="n">rating_name</span><span class="o">=</span><span class="s">'rating'</span><span class="p">,</span>
                                <span class="n">valid_pct</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span>
                                <span class="n">bs</span><span class="o">=</span><span class="mi">64</span><span class="p">)</span>

<span class="c1"># run a collab_learner (EmbeddingDotBias model) with 50 latent factors
</span><span class="n">learn</span> <span class="o">=</span> <span class="n">collab_learner</span><span class="p">(</span><span class="n">dls</span><span class="p">,</span> <span class="n">n_factors</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">y_range</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">5.5</span><span class="p">),</span> <span class="n">use_nn</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>   <span class="c1"># 5.5 because 5 is max score in the data
</span>
<span class="c1"># and fit it
</span><span class="n">learn</span><span class="p">.</span><span class="n">fit_one_cycle</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mf">5e-3</span><span class="p">,</span> <span class="n">wd</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
</code></pre></div></div>

<p>The model learns decently, doesn’t overfit and is relatively quick: I can run an epoch in about 30s on CPU (I’ve been using Google Colab). I run it for 5 epochs which gives me decent stats.</p>

<p>The bias for a movie encodes information about how much that movie is generally liked; the bias for users encodes how much that user generally appreciates movies, so these are important terms to use for a general assessment of what I’ve got.</p>

<h3 id="results">Results</h3>

<p>I get these as the movies with the highest movie bias, that is, the most generally appreciated ones:</p>

<table>
  <thead>
    <tr>
      <th>movieId</th>
      <th>title</th>
      <th>genres</th>
      <th>year</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>296</td>
      <td>Pulp Fiction</td>
      <td>Comedy,Crime,Drama,Thriller</td>
      <td>1994</td>
    </tr>
    <tr>
      <td>318</td>
      <td>Shawshank Redemption, The</td>
      <td>Crime,Drama</td>
      <td>1994</td>
    </tr>
    <tr>
      <td>593</td>
      <td>Silence of the Lambs, The</td>
      <td>Crime,Horror,Thriller</td>
      <td>1991</td>
    </tr>
    <tr>
      <td>858</td>
      <td>Godfather, The</td>
      <td>Crime,Drama</td>
      <td>1972</td>
    </tr>
    <tr>
      <td>2571</td>
      <td>Matrix, The</td>
      <td>Action,Sci-Fi,Thriller</td>
      <td>1999</td>
    </tr>
  </tbody>
</table>

<p>and these are the one with the least appreciation (remember, these are still good movies overall because of how I curated the dataset):</p>

<table>
  <thead>
    <tr>
      <th>movieId</th>
      <th>title</th>
      <th>genres</th>
      <th>year</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>334</td>
      <td>While You Were Sleeping</td>
      <td>Comedy,Romance</td>
      <td>1995</td>
    </tr>
    <tr>
      <td>1688</td>
      <td>Anastasia</td>
      <td>Adventure,Animation,Children,Drama,Musical</td>
      <td>1997</td>
    </tr>
    <tr>
      <td>5991</td>
      <td>Chicago</td>
      <td>Comedy,Crime,Drama,Musical</td>
      <td>2002</td>
    </tr>
    <tr>
      <td>49286</td>
      <td>Holiday, The</td>
      <td>Comedy,Romance</td>
      <td>2006</td>
    </tr>
    <tr>
      <td>98243</td>
      <td>Rise of the Guardians</td>
      <td>Adventure,Animation,Children,Fantasy,IMAX</td>
      <td>2012</td>
    </tr>
  </tbody>
</table>

<p>This sounds credible enough. Now I can use cosine similarity between movie embeddings to retrieve movies most similar to a given one.</p>

<p>For “Mary Poppins” (the 1964 original), I get (asking for 5):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- 'Sleeping Beauty (1959)',
- 'Lady and the Tramp (1955)'
- 'Cinderella (1950)'
- 'King and I, The (1956)'
- 'Beauty and the Beast (1991)'
</code></pre></div></div>

<p>which are all old animation/family ones, looks good.</p>

<p>For “Paddington” (2014, the first):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> - 'Paddington 2 (2017)',
 - 'Kubo and the Two Strings (2016)'
 - "Dr. Horrible's Sing-Along Blog (2008)"
 - 'Dope (2015)'
 - 'Bo Burnham: Inside (2021)'
</code></pre></div></div>

<p>it’s good it’s pulling the sequel - these should all be family-friendly except possibly the <a href="https://www.theguardian.com/stage/2022/may/31/bo-burnhams-inside-outtakes-netflix-standup-comedy">last one</a>, but the model may have picked on the fun/quirky element and/or it’s just not always good.</p>

<p>For “Hidden Figures” (one of my favourite movies of all time):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> - 'Best Exotic Marigold Hotel, The (2011)'
 - 'Hacksaw Ridge (2016)',
 - 'Sully (2016)'
 - 'Impossible, The (Imposible, Lo) (2012)'
 - 'Saving Mr. Banks (2013)'
</code></pre></div></div>

<p>this is interesting: the first seems to be a comedy, the fourth is a disaster movie (not a fit, I’d say), the rest are biopics/historical ones which seem like a decent fit.</p>

<p>I could definitely improve the model (see below) but this is already a very good filtering system I can use as-is whenever I’m in the mood for something that would be hard to articulate in a search! Of course, these days I could <a href="https://chatgpt.com/share/6770296f-f86c-8010-b989-724b8a80f092">use chatGPT</a> or the like for this purpose and chances are I’d get decent recommendations, but where would be the fun in that?</p>

<h3 id="using-pca">Using PCA</h3>

<p>Principal Component Analysis is an algebraic technique to reduce dimensionality of large matrices while preserving most of the information and variability. It can be very useful even just to visualise data, like we’ll do here. This part is also piggy-backed from the fast.ai lesson.
I have vectors of movies embeddings in the latent space of the 50 dimensions I’ve asked for, but if I PCA them to then draw the 2 components with highest variance I get this plot, which is a good proxy to visually see similarity “clusters”:</p>

<figure class="responsive">
  <img src="https://martinapugliese.github.io/assets/posts_images/movielens-pca.png" alt="A scatter plot of just some of the movies after PCA." />
  <figcaption>PCA as applied on the weights movies embeddings in order to visualise how movies "cluster" together, showing some notion of similarity to one another. Visualising the 2 highest-variance components for the movies with the highest number of ratings.</figcaption>
</figure>

<p>The clustering isn’t amazing but you can see some patterns. For instance, the model seems to have picked that “The big Lebowski” and “Fargo” (which makes a lot of references to the former) are similar; there is a Sci-fi cults area (with occasional infiltrates like “Die Hard”); there’s an area of quirky crime ones and so on.</p>

<h2 id="conclusions-and-things-i-could-improve">Conclusions and things I could improve</h2>

<p>It wasn’t my aim to create and tweak a great model, but I’ve (quickly, thanks to fast.ai!) got my filter which I can now use to have a list of movies similar to one I like with no effort. The movies are all highly rated by the Movielens community, which doesn’t necessarily mean they’d be highly rated on Rotten Tomatoes, but I’m sure the correlation will be high (it will be interesting to find cases where this isn’t true). 
Every time Movielens produces a new data dump I can update my model.</p>

<p>A lot of things could be improved:</p>
<ul>
  <li>first of all, reduce the draconian size-reduction operations on the dataset I had to do to keep this fast in training</li>
  <li>train on GPU - I didn’t invest time or money into this and Google Colab isn’t reliable in the free tier because one moment you have a GPU and the next you don’t anymore, but with a little expenditure you can train more properly. Or, you can use the Kaggle free GPUs too.</li>
  <li>tie the results to Rotten Tomatoes (doesn’t have an API AFAIK though, so you’d have to scrape) as a further post-processing filter</li>
</ul>

<p>Hope you enjoyed this, chat to me on <a href="https://bsky.app/profile/martinapugliese.bsky.social">Bluesky</a>.</p>]]></content><author><name>Martina Pugliese</name></author><category term="tech" /><category term="python" /><category term="machine learning" /><category term="collaborative filtering" /><category term="recommender" /><category term="movies" /><summary type="html"><![CDATA[Piggy-backing an idea from fast.ai]]></summary></entry><entry><title type="html">LLMs, reliability &amp;amp; the scientific process</title><link href="https://martinapugliese.github.io/llms-reliability/" rel="alternate" type="text/html" title="LLMs, reliability &amp;amp; the scientific process" /><published>2024-12-23T00:00:00+00:00</published><updated>2024-12-23T00:00:00+00:00</updated><id>https://martinapugliese.github.io/llms-reliability</id><content type="html" xml:base="https://martinapugliese.github.io/llms-reliability/"><![CDATA[<p>In a recent post titled<a href="https://jfkirk.github.io/posts/trustworthiness-ai/"> “Trustworthiness in the age of AI”</a> (highly recommended, go read it!), J. Kirk says</p>

<blockquote>
  <p>“Now, more often than not, I find myself explaining to colleagues, to bosses, to investors, and to customers that the AI <em>actually is wrong</em> even while it <em>sounds right</em>.”</p>
</blockquote>

<p>This is an interesting situation indeed. Working in data science/Machine Learning has always been a fun ride where rapid change is the norm and the best skill is adaptation, but generative AI has partially flipped the very nature of some workflows.</p>

<p>If you work with LLMs/VLMs you are likely in a situation where:</p>
<ol>
  <li>building <em>some</em> things is easier and quicker</li>
  <li>testing results for quality is shakier</li>
  <li>convincing others that what you built is valuable is <strong>much</strong> easier (that’s not a good thing)</li>
</ol>

<h2 id="building-some-things-is-easier">Building <em>some</em> things is easier</h2>

<p><em>Some</em> classes of problems and <em>some</em> instances of those classes can now be well solved with an LLM/VLM. Trivial examples are classifying a piece of text or image (it won’t necessarily apply to all use cases though). 
If you’re devising a cookery app and need to group recipes into cuisines, chances are an LLM can do that for you very well. I’ve asked ChatGPT to extract text from <a href="https://www.bbcgoodfood.com/recipes/pasta-salmon-peas">this BBC Good Food recipe</a>, create a Markdown for it and then tell me what kind of cuisine it is, the response was</p>

<p><em>“The “Pasta with Salmon &amp; Peas” recipe combines elements from both Italian and British cuisines. The use of pasta and crème fraîche reflects Italian culinary traditions, while the inclusion of salmon and peas is common in British cooking. Therefore, this dish can be considered a fusion of Italian and British cuisines.”</em></p>

<p>which, though I’m not sure about the crème fraîche part, I don’t overall disagree with.</p>

<p>However, I’m not sure how to call it (conventional?), but non-GenAI-based ML still has a lot to offer - in fact, I bet most of the ML around the business world is still linear regressions and random forests. Not to mention it’s often the best for your buck: the LLM fauna now has many cheap-enough options and the general thinking is that prices of established models will keep decreasing because that’s how basic economics works. But there’ll always be shinier, more capable models that are expensive due to the need to recover research and training spend. Also, LLMs can’t help with everything.</p>

<h2 id="testing-quality-is-shakier">Testing quality is shakier</h2>

<p>This is my main point of nervousness with LLMs. You usually test their behaviour on a bunch of data. If you’ve been careful, you selected the data to be representative of your domain, stratified, large enough and unbiased. You make this dataset into an eval set which you run periodically to check for robustness.
But the fact that results may look good doesn’t tell you how the LLM will behave <em>at scale</em>. Hallucinations, gibberish and garbage may arise, and there’s always the risk of <a href="https://arxiv.org/abs/2309.00770">discriminatory language</a>, something to be very wary of especially when building tooling that impacts on people’s lives. For some things, there’s <a href="https://github.com/guardrails-ai/gibberish_text">ways to cope</a>.</p>

<p>Confirmation bias looms above us humans in every circumstance but I wouldn’t be surprised to learn that we are easier to fall victim to it when LLMs are involved. It can be easier to see goodness when responses are in the form of human speech rather than categories and numbers.</p>

<p>Further, similarly to what happens with clinical trials for drugs, side effects may be rare enough to only appear when the product is out in the population at large. Things like these are not exclusive to LLMs, but the chance for non-reproducibility is higher. The only way<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> is to monitor, build alerts, trace and fix when possible.</p>

<p>It doesn’t help that for closed LLM models we don’t know the details of their structure. The scientific process is in part violated: you don’t control the tool, you can’t inspect it, you can only see what it does ex-post and infer what it is capable of.</p>

<h2 id="convincing-stakeholders-is-much-easier">Convincing “stakeholders” is much easier</h2>

<p>Model interpretability has always been an important part of M/stats work. In fact, it’s a field of its own. 
Back in the pre-LLM days, when you as a data/ML scientist built an algorithm, other people (the “stakeholders”) may have had a skeptical attitude towards it unless they had a sense of what determined its results. They wanted to know why the churn model predicted Mr. Smith will cancel the subscription in 3 months, or what was the driver behind that forecasted growth rate of app downloads next quarter. Often, their skepticism was well meant and actually useful to the data/ML team, as a large part of building good ML tooling is being able to communicate their value. It is hard to trust a system which gives you back arid numbers and nothing else.
The job of the data team is to build something good enough that is wrong a small fraction of the time and that, ideally, when it’s wrong it doesn’t screw things up, and to communicate this in a friendly, factual and metrics-oriented way.</p>

<p>Fast-forward to now and the same people, as well as (almost) everyone else, seem to just trust LLMs. I think there’s two reasons for this:</p>
<ol>
  <li>everyone can interrogate LLMs themselves, non-tech teams don’t feel at the mercy of the data/ML team anymore and this is empowering;</li>
  <li>LLMs talk to you in natural, polite and human-like ways -  after all, <a href="https://bigthink.com/high-culture/7-of-the-greatest-public-speakers-in-history/">who wouldn’t trust a competent orator?</a></li>
</ol>

<p>It is hard to distrust a system that gives you back nicely packaged natural-language explanations for its choices.</p>

<p>The data/ML team now has the added job of instilling a healthy dose of sane lost-skepticism back so that everyone can do better than accepting things at face value.</p>

<p>The script-flipping means we should now be even more focused on extensive testing, even more alert about possible bad output, even more solid in our skepticism-bearing and question-asking.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Somebody give me better ideas please. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Martina Pugliese</name></author><category term="GenAI" /><category term="llm" /><category term="science" /><summary type="html"><![CDATA[Are we becoming too lenient?]]></summary></entry></feed>