My summer of data science

3 minute read

This summer (it’s not over yet technically but the warm parts of it are, though someone could argue that we never see summer in Edinburgh anyway) I’ve participated in a project called “Summer of Data Science”, launched by Renee Teate, the animator behind the community at Becoming a Data Scientist, which promotes the sharing of resource around the field and help lots of people around the world interested in it. I’m pretty sure anyone working in this field knows her and her blog/website quite well, maybe mainly because of the podcasts she does.

Anyway, every year they launch this idea of doing something DS-related and sharing the progress on Twitter under hashtag #SoDS so I just thought I’d give my contribution.

I have already started dumping down all the stuff I know/am learning/am refreshing into a Github repo containing Jupyter notebooks. The idea is quite simple: I had lots of material from when I was studying at University, then other lots from more recent times when I sat down to learn some things on my own, and it was easily growing too untidy. I’ve always been of the type that needs to do a brain dump every time I learn stuff, to get it organised my own way. Back in the days it was paper notebooks, now it’s digital. In addition, the material was growing in a way that I wanted to add and complement topics all the time with new things, so paper snippets started stacking up one on top of the other, which didn’t really add to the readability and overall experience.

The choice of Jupyter is supported by the fact that it allows using markdown and it naturally supports LaTeX for mathematics, but it also allows you to code for demonstrations and plotting, so seemed quite apt for the job. Initially, the idea was a Gitbook, which however seemed quite clunky to generate in terms of the combination of all these elements. Furthermore, I didn’t want to generate a book as it seems too much of an academic endeavour, which isn’t quite the point. After a few iterations of finding the best tool, think I’m quite happy with the choice.

There is a notebook for each topic of interest, and the file structure is made in such a way that everything refers to other topics in other notebooks. This way I am easily able to improve a notebook, or change it completely, the next time I want to add something. Also, references at the bottom allow me to know where to look for more detailed and/or better material (including the research paper for algorithms and such) for each single topic: it is important as a way to organise the relevant literature.

I called the project Tales of Science & Data as a sort of homage/mock to Poe’s “Mistery and Imagination” as this Data Science field is a mix of several things. It can be rather confusing and, all in all, requires good imagination in terms of business acumen to be actually useful to an organisation. I’m not that bamboozled about this title though and I’d like to change it but haven’t found a better suited one yet. So it stays like that until I do.

I have started the project before the summer and still haven’t finished (and will never, as it’s meant to grow with me and with the time I can dedicate to it), but I think I got it to a good point. There are now sections on several topics which intersect, one way or the other, with Data Science. Some are quite clean and relatively good (though nothing is comprehensive, as clearly stated in the disclaimer!), some are just shaping up. The “summer of data science” idea gave me a way to commit a bit more through the months from June to now, and I occasionally posted my progress against the hashtag on Twitter. I’ve mainly devoted some time in the evenings and weekends (among other things I am trying to carry forward) so not really much time, but hey, it’s all good.

It’s been a good journey so far and looking forward to continue it for a “all year round of data science”.