Recipe 6: Statistics? I thought we were doing ML!

Larry Wasserman, professor in the Department of Statistics & Data Science and in the Machine Learning Department at Carnegie Mellon puts it best: “What is the difference between Statistics and Machine Learning? None. They are both concerned with the same question, how do we learn from data.”

However, as he notes, Machine Learning emphasizes high dimensional prediction problems whereas statistics emphasizes formal statistical inference in lower dimensional problems. Therefore, if your problem isn’t a high dimensional problem, perhaps it is best to explore some statistical techniques first.



Statistical tests

Useful Statistics for a Single Sample

A statistic, properly defined, is a function of your data. Oftentimes, if we have a single sample, such as the amount of time in between engine repairs, all we want to do is to understand what the average value of our data is. In this case it suffices to calculate the sample mean or the sample median. However, it also is often useful to understand how much the data vary on average. In this case, you might calculate the sample variance or the interquartile range. This will give you a feel for not only how long you expect between engine repairs but also how wrong you could be if you give a prediction for the next time between repairs.



Statistical Inference/Hypothesis Testing for a Single Sample

A key difference between statistics and machine learning is that statistics often focus not only on estimating parameters such as the population mean, but also on computing confidence intervals or hypothesis testing for those parameters. For instance, statistical tests provide answers to questions such as, are we confident that the time between engine repairs is no more than five weeks? In this case some popular tests are a t-test (if you can assume your data are normally distributed), a z-test (if you have above about 30 observations), or Wilcoxon signed rank test (no assumptions needed but not as good of a test). If your data is binary, you could use either a z-test or an exact test, which both can be easily implemented in R or Python.



Statistics / Tests for Multiple Groups

More commonly, we have multiple groups that we are comparing. Within a ML framework, this also occurs whenever you have one-hot encoding for your predictors. For instance, we might be interested in whether salary is different depending on whether you have a high-school degree only, a college degree, or a post secondary degree. Here we have three groups that we would be interested in examining the differences between. The most useful statistic here is the F statistic, which, roughly, compares the variability between groups to the overall variability. This can be analyzed using an ANOVA test (assuming your data are normally distributed) or Kruskall-Wallis.

If you only have two groups, you could conduct a two-sample t-test, however that is exactly equivalent to conducting an ANOVA with two groups.



Regressions

While regressions are used in Machine Learning, they were actually first employed in statistics often for predictions, but also to understand which factors significantly impact the outcome. They allow you to further characterize the data and explain how a dependent variable is affected by the independent variables and the strength of their relationship. These characteristics can be provided by the coefficients and p-values. When your data is normally distributed, it is common to use a linear regression model. If your data is binary, a logistic regression model is more appropriate.



Monte Carlo Simulations

If you want to analyze the risk of making various decisions and you don’t have a lot of “truth” data, monte carlo simulations may be what you’re looking for. For instance, you may be looking to systematically decide what weapon system to use in a given scenario, or how long it will take for soldiers to pass through a physical training course. The only things this analysis requires is enough data and/or subject matter knowledge to be able to estimate a probability distribution of the input variables and an equation for the dependent variable(s) of interest. Knowing the range/accuracy and its confidence interval(s) of various weapon systems, or the amount of time the slowest/fastest/average soldier takes to cycle through each PT station is the bare minimum data needed to conduct a monte carlo simulation. The simulator will take random values from these distributions to calculate an end state hundreds, if not thousands, times over to produce a most likely, range, and variance of potential results of a given decision. Depending on the size of your simulator, a monte carlo simulation can be easily implemented in excel.



Optimization

If you have well defined constraints and simply want to know what the “best” decision (objective) is without any data to compare it to, modeling it as an optimization problem may be your best bet. All you need is your goal written as an objective equation and your rules modeled as constraint equations/inequalities. You can then apply various optimization algorithms depending on whether it is linear/nonlinear, or integer/mixed integer, etc. For example, if you want to figure out the best aircraft maintenance schedule cycle based on clear rules, you could model it as an integer program and estimate a solution using solver packages like Gurobi, in python, R, matlab, or ideally, Julia. Technically you can also solve small optimization problems in excel…Fun fact, optimization problems are also the behind the scenes of all those machine learning algorithms (which are really just error minimization problems).



Network Analysis

When your data points are naturally interconnected with one another, modeling it as a network and conducting network analysis may be a useful technique. For example, if you’re trying to understand the spread of covid across your unit, you may want to model people as nodes and edge weights as closeness of interactions. You can then identify key spreaders or potential persons at risk using various network statistics (e.g. degree, page rank, distance). You could also build directional networks to analyze the one-way effects of nodes/changes on one another; compare networks to one another; measure the strength of a network. Using the natural structure of your data by modeling it as a network graph may illuminate unique characteristics.



Data Visualizations

So you don’t have a lot of data (or maybe you do), but your customer also can’t digest what’s going on with just one look at a table: sometimes, all an end user needs from their data scientist are easy ways to understand what’s going on. Choosing the right kind of graph for your data and problem is an artform and a science. In order to choose the right graph, you must know whether your data is quantitative or qualitative, categorical or ordinal, continuous or discrete, etc. As to the art of data visualization, you must know how to display your data without misrepresenting it while also telling your desired story: how to group the variables, color code, scale the y axis, etc.

If you want to build user friendly and visually appealing dashboards, powerBI, Tableau, Streamlit, Dash, and Kibana (in order from least to most ‘coding’ required) may be your friends. These are especially helpful if you want to have live updates to the dashboard based on change within a database. For more information about building and deploying a dashboard, check out Recipe 10.



» Previous recipe
» Next recipe


» About
» Ingredients
» All recipes
» Resources
» Code examples