Introduction

When you get your hands on a new dataset, describing the variables automatically can be extremely useful. Their distribution, quality (missings) and correlation to other variables determines their suitability for different statistical methods. As mentioned in the R-intro there are three important classes in R. The class of a variable determines the descriptive statistics available. Continuous numeric vectors can be plotted against other variables, we can visualize the density function, or a histogram. Categorical features (factors in R) are quite limited to their frequency. Contingency tables can be helpful for multivariate analysis. The third class are strings or unstructured data and can also be analyzed and summarized with different measures, which is done in this post about sentiment analysis.

Automatic Analysis using R-Markdown

The code below is as generic as possible to automatically label and describe your variables according to it’s class. The code is written to be used in R-Markdown, an easy to use mark-up language which produces e.g. html or pdf as an output. Let me say a few things about the code and its result:

  • This code is far from finished, extensions to multivariate relations are possible. However, in large datasets the number of possible combinations may be too many to visualize.
  • The example below uses the in-build iris dataset. I tried several other sets which worked fine, but the classes of your columns need to be specified correctly when reading data.

 

R-Markdown example Output

R-Markdown example Output

 

You can have a look at the R-markdown output here. If you have any notes on the code or ideas for extensions, share it with me!

The Code