Things you should know before you become a Data Scientist

A friend of mine called me the other day to ask about the questions that really matter. He was thinking about changing his career path to Data Science. I’ve kept myself involved in Machine learning projects for about 5 years now, currently working as a Data Scientist in a senior role at an Indian start-up. After about 20 minutes of discussion with him, I realised my chat with him from that day could help a lot of folks who’re thinking of becoming a Data Scientist / ML Engineer or equivalent.

So, here we are. Let’s begin. I’ll try to keep it to the point, so it’s in a simple format of questions and answers.

1. Can you give me a summary of what you actually do as a Data Scientist?

Answer. There are 4 sides to the work of a Data Scientist.

1. Research (5%): Regardless of whether you’re a Data Scientist (more hardcore) or an ML engineer, some part of your time will be going into reading research papers of the field. Mostly these papers will be related to the problem you’re solving for the business. This part is mandatory and expected of the role.

2. Data Pre-processing (55%): The key component of any Data Science project is neither the data scientist nor the ML algorithm. It is the Data. You need to have proper data for your ML algorithm to perform well. The data that is available to you usually is filled with imperfections such as missing rows, values, or incorrect labels, etc.
There’s a bunch of steps that a Data Scientist usually takes to transform the data before feeding it to an ML algorithm. These steps are called Data pre-processing steps. Essentially you’ll be writing scripts in Python/R or some other language to analyse and fix your data. And this step takes up the majority of your time!

3. ML Modelling (25%): This is the “cool” part of the job where you write an algorithm that learns to solve the problem on its own. This is the part that makes Machine Learning Machine Learning.
You’re usually coding this in Python, using one of the frameworks — scikit-learn, Keras, TensorFlow, or PyTorch. Once you’re done coding it, you feed this algorithm with the data you pre-processed in the earlier step. As your algorithm eats this data, it trains to solve the problem.
The process here is iterative though — the ML algorithm you code might not work the best. So then you try another one. And maybe one after that too. For some this iteration is frustrating, while some like it. There are lots of knobs and dials here that you keep turning until you have coded an ML algorithm that works well once data is fed to it.

4. Communication (10%) :
Look, unless you work in one of the bigger companies, communication is going to be a very important aspect of your work. Solely because due to all the hype of Machine Learning, miracles would be getting expected of you. I use the term ‘miracles’ not because you’ll be solving problems that are otherwise unsolved. But because most people around you, including engineers, product managers, analysts, etc would not really understand what you do.
At the end of the day, you’re actually in a field that is super recent and involves a lot of math and programming. So, for most people, Data science is rocket science. This creates a problem. Since many folks won’t really understand the nitty-gritty of your work, communicating your challenges, your tasks, your timeline estimates in a way that seems justified becomes a difficult task to do.
Moreover, Data Science has the word Science in it for a reason. It is experimental in nature. Nothing’s solved or known apriori. This is very different from the way the engineering workflow goes, wherein once you have the design discussed, you are sure it will work. In Data Science, once you’ve figured out the approach, you try it but it may not really work. A wide portion of business folks who will assign and expect work from you have not yet adjusted their mindset to the uncertainties of the field yet. The only way I’ve seen data scientists ace dealing with this is with good communication to stakeholders.

2. What do you wish someone had told you before you decided to take up the field full-time?

Answer. I’ve already mentioned how there’s a lack of understanding of how the field of data science works. The uncertainty and its reasons are not thoroughly understood, which can sometimes lead to unrealistic management expectations.

This can cause frustration at times, but it is you only who has to spread awareness about the nature of work and improve people’s perspectives on how the field works. So, don’t come expecting that your CEO or your product manager would be okay when your tasks of the sprint spill-over because you didn’t get the results in the first iteration. You have to fix that mindset, slowly and gradually.

Be very very careful about promising things. Promise only for the next one or two steps you’re certain about. Not the entire pipeline.

3. A thing that you really like, and dislike about your profession?

Like:
Your work resembles that of a scientist. There is a lot of experimentation, a lot of uncertainty, but at the same time a genuine promise and hope for solving a complex problem that is just ridiculously hard to do via programming.

Dislike:
A lot of times, after countless iterations on data pre-processing, research and tuning dials and knobs of ML modelling when you finally get to an ML model that works well and solves the problem, the project just gets paused indefinitely.
This causes some data scientists to feel dejected and it is understandable. This solely happens because Data Science is often an RnD team. Everything that RnD team builds doesn’t go into production. You work on POCs (Proof of Concepts), showcase them, and if there are clients and revenue streams for it, you move ahead. Else, it just gets paused, and mostly you never open that codebase again.

4. Common misconceptions regarding the work of a data scientist?

Answer. Let me bullet points this answer else it can get too long.

  • People assume they’ll get to work on state-of-the-art models from day 1. That doesn’t happen. You’re in a company and not in a Google Research team. There’s limited funding, limited time. Hence you usually try something that is more deterministic, more basic and far far away from state of the art. Even today, most companies use OpenCV as their first approach instead of Deep learning for computer vision problems.
  • Most of the time is spent in working and iterating on ML models. Wrong! Amount of time spent on miscellaneous tasks like data clean-up is highly underestimated.
  • The role ‘Data Scientist’ means you’ll be working on data science projects. Wrong! I’ve seen recruiters throw in this title solely to attract more candidates when it was ultimately a Data Analyst role. So, read the Job description carefully, my friend.

5. What kind of people excel in this field?

Answer. Honestly, I believe if you still feel excited after reading all the above, you should be good to enter the field. It is this determination and passion that matters the most and it will get you around the problems of programming, algebra, statistics and communication if they ever block your path.
However, candidates with a solid math background (rooted in statistics and linear algebra) and sound programming fundamentals do have an edge as the field is like tailor-made for them.
Willingness to learn from and implement research papers helps if you wish to stay updated with the hottest field of the decade.

— —

Well, that was it. I hope there was something useful in there for you. If you’ve come so far, please help me by liking/applauding/sharing this article as this is one of my firsts. Any and all sorts of criticism is also welcome.

Occassionally writes poetry, thoughts and tech blogs when there are thought surges. Other times, I code, travel, build stuff with AI and enjoy art.