## Introduction

Decision trees being around since the 70’s. They are applied to many problems in modern machine learning and have different implementations. Why should you use trees? Well, decision trees are celebrated for their interpretability. You can explain the decision, why a customer is highly likely to buy a product based on the independent variables (features) the algorithm has identified. Other reasons why trees or an ensemble of trees are a valuable algorithm in machine learning are their robustness against outliers, when properly specified and trained. Additional to a broad explanation of decision trees and random forests, I provide some code using Python’s skitlean library to estimate both. But let’s start with the basic idea of information content and gain.

## Basic Idea

The following explanation is taken from Kubat’s book “An Introduction to Machine Learning”. The algorithm tries to find the variables, which help to seperate the groups of the dependent variable “best”, meaning it determines the variables with the highest **information gain**. To elaborate this in greater detail we want to recall the concepts of information content, information gain and entropy.

### Information Content and Gain

The information content of a message depends on the distribution of the underlying variable. Let’s assume you have a number of customers and you are interested in their buying behavoir. When all customers bought your product, the information gain in knowing that a customer bought your product is zero. Because even though you randomly draw someone from your sample, you are already sure that this person bought your product. If only 25% of your customers bought your product, you gain some insights when receiving the message. When only a very few customers bought your product, knowing which did and which did not, is very helpful, and you therefore gain information by receiving the message.

### Entropie

is defined as the information content, or , we have just described above. Entropy is defined as the average information content of a message. So if we track the buying behaviour of 500 customers, we receive 500 messages, which we label .

(1)

To average the information content of every message, we use the probability to observe a buying/non-purchasing customer. So, if 100 customers bought our product, is and the opposite is . If we recall our example from above, and assume there are only buying customers, the entropy defined as yields

## Feature Selection by the Algorithm

After this brief excurse into information theory we want to apply our knowledge and explain the way the algorithm finds new variables and decides which are helpful and which can be omitted in the estimation process. Let’s consider a second variable, additional to the dependent variable (customer buying behaviour), which is the gender of our customers named

(2)

Whereas **and** the gender of our buyers. The algorithm does this for all your variables and chooses the one with the highest information gain for the first split of the whole sample. This node is called the root (of the tree). Repeating this process and finding the next best split, results in purer nodes. Pure meaning that the share of one group (buyers or non-purchasers) increases compared to the whole sample. Your objective are mostly pure nodes. This allows you to identify a subgroup of e.g. males between 25 and 35 years old with an income above 2000 Euro to be more likely to buy your product than others.

How does this work with non-categorical or numeric features? If you have metric variables such as income etc. the algorithm defines thresholds to split them into Booleans. So, the algorithm asks is the actual value of your income below 100? The newly created variable can be “True” or “False”. Be aware that not all occurring income values (attributes) need to be used as thresholds. The algorithm rather, sorts the income values in ascending order and uses only those which are associated with a change in the dependent variable. In the end the information gain maximizing threshold is chosen.

**Pruning**

Because decision trees tend to overfit — noise or non-systematic events influencing the results are treated as an important factor — it is recommend to restrict the number of features used to build a tree. This reduces the amount of modelled noise and improves the out-of-sample prediction performance. Additionally, the interpretability and the readability of the tree increases. Pruning is “simply” the manual restriction of the number of attributes used. Restricting the number of branches of a tree may increase the error rate and therefore the amount of misspecified individuals. The best amount of pruning can be determined using the error rate after pruning minus the error rate of the fully specified tree. For more information read this brief wikipeda article.

I recommend playing around with the code below and see the effect on the tree yourself.

### Remarks on the Algorithm

The attentive reader recalls that there is no way one can infer about a causal relationship between the dependent variable and the features. Why? Because the buying behaviour is a complex process with many dependencies. To gain information about your customer’s choice based on a characteristic, does not mean that there is a causal relationship, the variables are just associated. A third underlying variable can be the root cause of a decision. This is **very** important because managers tend to interpret the features as if they where causal and management decisions are then based on this pseudo “insight”. Be aware of this phenomena when presenting a model or the results in front of any non-statistician. As mentioned above, the decision tree method tends to overfit if the tree is not purned.

## Python Example

To reproduce this example, download this file and adopt the path in the script to read it.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | """ Created on Sat Jan 20 20:37:19 2018 @author: Clemens from fitted-data.com """ import pandas as pd import numpy as np from sklearn import metrics from sklearn.model_selection import train_test_split # credit to zackthoutt (see https://www.kaggle.com/zynicide) for this dataset! DF = pd.read_csv("Your_Path/WineData.csv", index_col='Unnamed: 0') DF.head(n= 10) ''' Most Machine Learning Algorithm cannot handle NANs. Dropping them is the easiest way for now ''' DF = DF.dropna() DF = DF.drop("region_1", axis = 1) DF["Dep"] = np.where(DF["points"] >= 90, 1, 0) DF["Dep"].describe() # 38% are ones DF = DF.drop("points", axis = 1) ''' Additionally, you need to convert the strings. We want to use them as single dummies. The easiest way doing this is pandas.get_dummies()''' def PrePreper(MyColumn): # This function generates the dummies, and concatenates them Dummies = pd.get_dummies(MyColumn) Merging = [DF.drop(MyColumn.name, axis = 1), Dummies] return(pd.concat(Merging, axis = 1)) DF = PrePreper(DF["country"]) DF = PrePreper(DF["province"]) # Split the data set in train and test parts X_train, X_test, y_train, y_test = train_test_split( DF.drop('Dep', axis=1), DF["Dep"], test_size=0.33, random_state=42) # Finally estimating the tree from sklearn import tree clf = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = 5) clf = clf.fit(X_train, y_train) # Which features where used and how important where they? FeatImp = pd.DataFrame({'Feat': X_train.columns.values.tolist(), 'Importance': clf.feature_importances_}) FeatImp.sort_values(by=['Importance'], ascending = False) # Export the Tree as a graph using graphviz, which needs to be installed! tree.export_graphviz(clf, feature_names = X_train.columns.values.tolist(), filled = True, rounded=True, class_names = ["Okay", "Top"], out_file="Your_path/tree.dot") # in CMD or Terminal navigate to Your_path and "dot -Tpng tree.dot -o tree.png" |

This is the decision tree generated by sklearn. Remember, it’s maximum depth was restricted to five, but we still have a bulky object. The tree used the price as a first criterion to separate between Top and Okayish wines. On the right blue coloured side, we have the leaves containing mostly Top wines. Every leave contains:

- the split criterion (e.g. price is below 65.5),
- the entropy (average information content of a message),
- the number of wines (samples) in this leave, after all conditions have been fulfilled,
- the number of Okayish wines and of Top wines (value =),
- the dominating class in the leave.

At the bottom of the tree, you can see only 4 end-nodes are pure, meaning there is only one wine category in there. This is because of the restrained number of variables — or max_depth argument of 5. One objective in modelling is the purity of the leaves, meaning if I randomly select a wine meeting all the criteria on the bottom right end-node, I want them to be a Top wine, or a customer purchasing my good. However, this objective can be restrained by the desire to avoid overfitting.

1 2 3 4 | # Use one simple measurement to have a benchmark for the forest y_pred = clf.predict(X_test) metrics.accuracy_score(y_test, y_pred) # 74 % of all labels were correctly specified. |

With the code above we calculate one of many performance metrics. I will write a whole post about performance measurements for classifiers, but for now we keep to the **accuracy score**, which represents the fraction of correctly specified classes. We want to use this simple number to have a benchmark for our random forest below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # Estimate the Random Forest from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators = 25, max_depth = 7, random_state=0, criterion = "entropy") clf = clf.fit(X_train, y_train) y_pred = clf.predict(X_test) metrics.accuracy_score(y_test, y_pred) # this Forest in particular also yields 74% accuracy FeatImp = pd.DataFrame({'Feat': X_train.columns.values.tolist(), 'Importance': clf.feature_importances_}) FeatImp.sort_values(by=['Importance'], ascending = False) ''' Some provinces were not used by any of the 25 trees, they may have no information gain''' |

I have some remarks concerning the whole code above:

- I kept the data preparation part to show common problems in preparing data for data science (those are only a very few).
- You can download graphviz here, or install it on your mac using homebrew (homebrew install graphviz).
- One would tweak the hyperparameters of a random forest to find the optimal model regarding several performance metrics, however this is not the objective of this post.

It is quite interesting that the random forest did not perform better than the single tree (without hyperparam. opt.). When looking at the feature importance, you will notice that some features will never be used by the forest. This can be the case when only a few or no attributes exists matching the positive label of the dependent variable. So in our example there are only seven “Top” wines from Massachusetts — very shocking indeed.

## Decision Trees — Conclusion

Decision trees are mostly easy to use and communicate machine learning algorithms. Their extensions random forest, are reliably ensemble methods. However, the routine looks for the information gain of a message, you **cannot infer any causal** relationships based on the feature selection process. Be aware of overfitting when not restricting the tree to a maximum number of variables. Additionally, the routine does analyse the covariance of your features, only indirectly. When the first feature makes the best split criterion, the second may not contribute enough information gain to be chosen. You could save computing time when pre-assessing your features correlations. When modelling a classifier for an important business object, take your time and estimate different models and compare their performances.