Data scientists are highly in demand. Data are generated everywhere — from your smartphone/TV, car sensors, machines, to every camera. And the amount of data will rise tremendously (industry 4.0, Internet of things, etc.). Some companies already invest money and people, others wait and observe. I am convinced, we will see many more technologies generating data and then revolutionizing entire industries. Nevertheless, unprocessed data on its own do not have any value, rather they generate costs of linking (sensors to servers), storage and man power. Generating business insights from data is the key point of a data analyst/data scientist for many firms. I will therefore focus on the business side.
The Skills a Data Scientist needs
So, what are the key skills of a data juggler? In short they are:
- Finding valuable use cases for the firm.
- Understanding data/statistics/mathematics.
- Derive and Communicate Business Insights.
As you can see, I put emphasis on the business side, because that’s where most data scientist will work. Only a few will develop new estimators, libraries etc. and work for Google, Facebook and so on. Most of us will work for companies, and be honest here, will help to maximize the profit of their employer. Hence, you need to advertise your skills and the added value of your work (visualizations, analysis, estimator etc.). Additional to your personal preferences the analysis most valuable for the firm will have the most impact and visibility.
How do you grow on this? Basically, think about the firm you want to apply to and the surplus you can generate using data. Be real here, not everything can be observed, especially the bricks and mortar industries have their issues gathering data automatically. However, most companies do not base their management decisions using data. Your skills and expertise will be welcome in almost every department, because hard numbers overrule gut feelings and are easier to communicate upstream for your boss.
Here is where your main skills are and I want to talk about all three points. After you identified your business case, you need to think about the correct data set which you want to apply your fancy estimators, neural nets etc. on. But be cautious! The data selection and quality determines the reliability of your recommendations. When you want to analyse the churn rate of different sub groups, be sure that everyone can be part of each sub group. In case there is a natural selection for one of each group, your statistics become invalid.
How do you improve your skills in data thinking? Reflect about your analysis beforehand helps a lot. This is linked to statistics, but think about selection biases, omitted variables or the independence of your variables. Working with individual data mostly means you cannot infer a causal relationship between your predictors and your target variable (see my discussion here). You are therefore asked to be cautious when it comes to interpretation and management advices.
Statistics and Mathematics go hand in hand, this is where the magic really happens. Why? Because everyone can use a Python script to score a model, apply a cluster analysis and so on. Knowing what happens under the hood is important. Do you need to center your variables? Do you fill your missings? If so, with the mean, or the median value? How do different machine learning methods perform if you have only a few positives (success) values in your data set? Which accuracy value does perform best regarding the method you use? Those are the questions where you dig deep and credit the term scientist in data scientist. Understanding fundamental terms like matrix, probability (frequentist and Bayesian approach), and different optimizations are a must.
So how do you improve your skills in statistic/mathematics? The simple answer is study statistics, mathematics or a social science with a statistic focus. But the field is quiet new and combining all the skills above are uncommon for most studies. So self-learning is the way to go. There are plenty of online courses helping you to improve on statistics, math and machine learning. I can recommend the courses on coursera. One example is the course by Andrew Ng. He is the godfather of smart algorithms if you want to put it like that. Other sources include the scikit-learn website. They do have a big usage and recommendation part at the end of every method. Reading books like Python Machine Learning, 1st Edition, The Elements of Statistical Learning or start with books emphasising the language used like Practical Data Science with R. Actually, there are many other very good books out there. Reading a lot and apply your knowledge to your real world data set helps. This brings me to the programming skill.
The skill most people think of when it comes to data science is programming. As I said before, I personally value the statistical part even more, but programming is the second biggest skill for a data scientist. And trust me there are huge differences between good and average programmers (some people talk about the 10x programmer). Utilizing your pc/servers correctly is key in terms of efficiency. Good programmers use advanced libraries and take smart solutions to “simple” problems. I learn something new every day when browsing the web for efficient answers. Questioning the way you did something in the past, just because it worked is important here.
Where to find resources you may ask? Well as mentioned above, coursera, udemy and other e-learning platforms, as well as books and even YouTube are valuable resources. Make sure the speed of the course suits your needs, rely on the recommendation and read the reviews. I took the r-programming course and they got me switching from Stata 11 to R and therefore changing my perspective on working with data entirely. I also learned, that working on your own projects helps you the most. Reading about sub-setting, and other basic concepts are not helpful for remembering it when necessary. Your own projects on the other hand capture you way more and help you to grow because of your intrinsic motivation.
Wait, which languages should I learn? Well this is up to you, but I recommend R and Python as the go to languages when you talk about data science. This could change in the future of course, but as of 2017 you do nothing wrong learning these two. Google supports Python and their library TensorFlow intensively, the community grows every day. R has been around for a long time and is backed by a scientific community. A lot social scientist around the world use it and write libraries (packages) for it. You can find everything — starting with OLS to the most advanced Bayesian model averaging methods here. Most people will claim that Python is faster, and they are probably right, your choice depends on the objective and your infrastructure.
Derive and Communicate Business Insights
I know, I know, the fun lays in part two and three, but only a very few firms pay you for doing this. Most companies want you to deliver fast and quick business insights. And of course you are the one who explains everything. So be prepared for generalizations and explanations of complex approaches. With all this, do not forget that communicating the results and implications for the business are the most important things to do. Nobody is interested in the problems you faced and overcome during the process (except your DS colleagues maybe). Remember: A nice visualization is gold when talking to business people, because they can show it to their bosses. Therefore use efficient tools to generate vivid viz/statistics. I can recommend Tableau. Which is extremely efficient even on small machines and very fast to learn and individualize. Plus it is interactive — this dashboard approach impresses most managers more than every neural net. You may prefer open source stuff like Plotly, or Bokeh, but as I said, efficiency is key. Programming every chart can take quite a while if the task is not repetitive (like a weekly reporting).
- Before you even start to analyse your data lake, talk to your client about his or her request. Mostly, you have way more knowledge about the data and can therefore make things clear in advance! This can save you a lot of repetitive work.
- Try to make your work as reproducible as possible and comment your code! Your colleagues may need to conduct a second analysis using your code. Or! Or you need to conduct a second more deep dive analysis using your own code. So try to reduce the manual adjustments in advance. I have seen man times when a one-time analysis became a regular reporting.
I described the most important skills of a DS working for a firm. Your role can of course differ. It is also possible that you will never talk to a business person, but have the skill to derive business insights helps you when applying for jobs. Some DS will work closer with the IT department and will conduct jobs related to data quality and ETL processes, this can also be an important feature of your position. Nevertheless, being patient and motivated are the two main characteristics of a DS.