Technical tasks for data science interviews

6 minute read

Note: the content here applies to hiring for junior positions.

When doing an interview for a tech position, you typically have to pass a “technical” stage, the part where your domain knowledge and ability with technical work is meant to be evaluated. This usually comes after an initial screening which discusses the general fit and onboards you to the process while chatting about who you are and what you have done.

My opinion on technical assessments for a data science interview is that they should be emailed beforehand and carried out as take-home exercises, rather than on the fly, and then discussed at the interview. This is the way I have personally been using to hire people and have always found it helpful and good. Note that the advice I express here is solely about data science interviews (and non-senior ones), not tech ones in general; data science has its peculiarities and frailties and the hiring process reflects that. Plus, I am not qualified enough to talk about other areas of tech!

A data science technical task is never something to be tackled on the go, it deserves thinking and a process. The interviewer should not be testing the knowledge of notions or how well people recall the API of a library, they should test the reasoning that leads from framing a question to finding an answer with data, and then evaluating its quality - possibly repeating the cycle: something that cannot be achieved when working under clock-ticking stress. Data science is nothing else than a problem-solving exercise: the aim of the technical task has to be to assess the ability to be creative, reason critically and self-judge results. Last but not least, storytelling skills will naturally come into play with this procedure, because the candidate would have to present all the findings in a coherent form. A good data task should be open-ended and non-prescriptive: it should provide a dataset, a broad question to tackle and ask for analysis and modelling. It should set no strict choice of libraries and packages (unless there is a requirement by the job that the candidate is well versed in something in particular) and it should provide no indication as to what methods to use. An example could be something along the lines of: “given this dataset of house features and prices, try to predict the price of a new property with any method you find suitable, analyse the quality of the data and feel free to retrieve other sources of data to aid/complement the resolution if you wish”. The candidate would have time and calm to reflect, address the question with all the artillery known to them and package it up in the best way (I have personally always asked for a submission in the form of a well commented and organised Jupyter notebook). The interviewer will review the submission and, if good, the interview will focus on discussing it, digging into the details of the choices made, the methods used, and the quality of the results presented, as well as the blockers and, in many cases, data weaknesses. It is crucial that the candidate assesses how good the provided data is for the goal, so “bad” results (when data is not informative, too incomplete or simply of debatable quality) are not failures, they are analyses - this is an often overlooked part that an interview should encourage!

A common objection to this methodology is: “but people can cheat!”. Let’s see - what does cheating mean? That they would make someone else do the task in their place? This is a risk the interviewer would have to run: because the subsequent interview would clearly surface this, there would be some awkward five minutes, then all over. A bit of a waste of time for the interviewer, but a risk worth taking. If cheating means, instead, that candidates would make extensive use of Google to research tools and answers/help, I think it is a welcome thing and something to actually encourage. In data science, there is no way you can know everything and you simply cannot remember the APIs of every library. The ability to Google effectively and learn quickly is a skill - in data science it is probably the one you will use the most! Again, the interview will discuss this more and will make it clear which parts were weaker than others in terms of the familiarity with concepts and tools, leading to outcomes accordingly. I would be very happy to hire someone who had never used a certain method but researched it and attempted to use it to solve the problem, developing a narrative and manifesting competence in learning efficiently. One of the nicest things I remember from a recent interview is one candidate writing back to thank for the opportunity - she said she had never worked with JSON datasets before (the data was furnished in that format) and was happy to have been given the chance to practice dealing with them.

It is very important to make clear that the data task should not be set to take the candidate more than a few hours to do. People have lives, families to take care of and potentially other interviews to go through: the interview process should always be respectful of this (it will also speak volumes about the company culture when it does not). For the candidates: the worst ones are those that are quite subtly asking you to solve a problem in the realm the company operates which smells like they are asking for free work and ideas (and it does happen, unfortunately). Interviewers: always frame the data task in an area different from your industry, even when you are not intending to exploit people, it will just look better and will also show some imagination from your part.

Is the time taken to work on the task an important variable in assessing the candidate’s expertise, from the interviewer’s perspective? Yes and no. Obviously, every company wants results to be delivered as quickly as possible, but this should not compromise the quality. Data science is still largely a research-like endeavour, so requires the time to do deep-thinking and the time to fail and retry. However, a good data scientist understands when something needs a sophisticated approach and when a quick heuristic/Fermi calculation would do instead. It is paramount to know how to settle the right balance between depth of work and speed of delivery - this is a topic out of scope here, but in the context of tech assessments it will be the interviewer’s job to judge this based on the submission and how the candidate discusses it at interview time.

Setting up the task should not take the interviewer long either, but some work in conceiving a fair and balanced one is required. First of all, a dataset must be provided to start with: a link to some Google Datasets will do (normally sets have to be public ones). It is good practice (and fair) to also have a look at the data to evaluate its quality for the goal: an ideal set is large, varied and not sparse. Even with weaknesses that the candidate is meant to catch, the dataset should be usable.

A final thought: do tech tasks turn people away, generally speaking (fear, impostor syndrome, lack of willingness to put in hours to do something that may not turn into a practical outcome)? I wish there was a way to avoid doing them, but there is not - the company has a need to screen the candidates’ technical fit. I think the one proposed here is the least painful one, and it is just. For the candidate not getting the job, it will not be a waste of time because there is not only job-seeking experience acquired along the way, but also data science one, and the chance to learn more about the way one’s brain processes information too.