I studied statistical models as an econometrician, meaning I was trained to care a lot about assumptions and potential selection biases in the data. The final aim was to find causation. In applied data science people emphasis on predicting class labels/values for out of sample observations (similar to time series econometrics), but they tend to focus much less on assumptions and causation. This is for me the most fundamental distinction between those two, but the devil is in the detail, as you can see below. However, this is extremely dangerous. Forgetting about assumptions, but argue with causal effects can lead to pretty wrong decisions.
A second fundamental, but minor thing are naming conventions. People from different fields have their own terminology, but mostly they talk about the same things as you do, e.g. when people talk about scoring they simply want to predict out of sample observations. Individual observations was a universal term for a single row — meaning one observation. Some people call this one row data set, which is somehow confusing. Other things like boosting and bragging are however the same.
Most economists are used to well-prepared data sets (at least as a student). In microeconomic applications from the internet you find data from very different sources. Depending on your firm’s size, it could be your job to merge the different data types. Nevertheless, companies save their huge amount of data in a relational or non-relational database. So, when you want to estimate a model, you need to prepare your data set to begin with. This means you need quite some SQL queries to access the data at hand and collect all the information about your customers/products etc. Depending on the company’s data base, this can consume a significant amount of your time.
Prediction v. Causation
As mentioned in the introduction, most economist know the struggle to establish causal relationships in models. Finding good quasi-experimental data was like a lottery jackpot in the past and lead to a good publication. However, it was not easy to observe such data, there are numerous reasons why we fail to establish causality (omitted variables, correlation in the exogenous variables and so on). When data scientists talk about prediction their first objective is not to find causality like an economist. Don’t get me wrong here, it is still more valuable to argue with causality, and derive some opportunities for action. But in a first step most data analysts want to use a training data set to estimate parameters and apply those e.g. to a more recent data set (prediction, scoring). So, when you want to estimate a next best offer for your customers, you use a time frame approach to take the customers who bought the product in the past and identify the most useful predictors (mostly the customers’ attributes). You then take the parameters and apply them to a more recent data set. Most prominent data science methods would be classifiers such as logistic regression, supported vector machines and decision trees or random forests. Your final objective is then to address every customer with individualized offers. If you think about Amazon for example, they could charge more for financially independent and busy people whereas well-informed discount shoppers will compare prices and pay less. This helps you to improve the conversion rate, reduce your marketing expenditure and raise the revenue at the same time.
The Window of Opportunities
Differently from economics it is much easier to carry out A/B testing. Therefore, keeping a group of people who do not see your new webpage, get no discount offer etc. can be easily conducted in a controlled environment, such a company. You can test if your marketing campaign was effective and efficient. Using these control groups is the industry standard and is heavily recommended. Depending on your employee, you may have much more right hand side variables (features) than in your microeconometrics applications during your studies. Your firm either buys those information from a data broker, or collects these variables by itself. Using those is thrilling and interesting.
Conclusion — Data Science and Econometrics
In conclusion, you can say that data science uses some parts of economics, but the objective is different. Causal relations are useful, but are not the priority. Being an economist in the field of data science is very helpful to understand data structures and methods, however, new programming languages and data storage can be fields you need to invest effort in. If you want to learn more about data science, read this article.