Recipe 5: Problem Exploration

Exploratory Data Analysis (EDA)

Before diving into a problem, you need to understand your data. EDA is the first hands-on-a-keyboard step and will frame how you proceed.

Here’s an excellent EDA checklist of questions to work through:



What question(s) am I trying to solve?

Don’t just dive into coding! If you don’t know what question you are trying to solve, you can’t begin. Getting the right answer to the wrong question might be cool machine learning, but doesn’t add value to your organization. You need to know whether it’s a classification or regression problem, if you have enough data, if the data supports the problem you’re trying to solve, and more! As we specified in Recipe 3, make sure the information you are seeking is actually present in the data. For example, if you are trying to predict engine failures based on engine telemetry, the label of ‘this engine broke and required a repair’ may be included in your data, or may need to be brought in separately and merged with the telemetry data.

Clarify your objectives and make sure you have the correct variables to support those objectives. For a more on the importance of data quality and an in-depth pipeline to cleaning data and clarifying objectives, check out this HBR article.



What data do I have?

Which features are numerical (e.g. price), which are categorical (e.g. sex), which are objects (e.g. text)? A domain expert or subject matter expert can be very helpful in order to understand what each feature means in relation to the problem. Understanding your data will allow you to manipulate it so you don’t violate any of the assumptions inherent in your model. (Seriously, read that assumptions link). Below are just two examples of datatypes (categorical and numerical) and the feature engineering required for each. For more examples on each datatype and feature engineering, see Recipe 7.

Categorical data

May need to be encoded into numbers, depending on the type of algorithm you use. Be careful how you do categorical encoding, it can greatly increase dimensionality (in the case of one-hot encoding) or you may encode improperly. For instance, ordinal values (strongly disagree, disagree, neutral, agree, strongly agree, can be encoded sequentially (0,1,2,3,4) but nominal values (e.g. cat, dog, bird, fish) should not be encoded sequentially because if 0 is cat and 2 is bird, an algorithm may perceive that 2 > 0 therefore bird > cat.

Numerical data

May be correlated with one another, making some features unnecessary or at least very little value-add. Removing those features could be useful if you’d like to keep your model simple without losing too much ability to explain variance of your predictor. You may want to build your first simple model using the features most correlated with your target variable. Normalizing numerical features can also improve algorithm performance. Some numerical features may actually only be nominal, so you have to treat them as categorical variables.

If you have data that’s not relevant to answering your problem question, feel free to drop it. This could be duplicate features (‘url’, ‘image_url’), unrelated features, or outliers that might bias your model.



What’s missing from my data?

Just as important as the data you have is the data you don’t have. Whether it’s a feature column filled with NaN (not a number) values, or a feature that doesn’t exist within your data, or one you can engineer. For NaN (null) values, you can impute, or replace them with substituted values like the mean or median, though there are other options. If you are missing a key feature you’d like to have, you may need to find that data from another source, or merge multiple data sources. For more missing data feature engineering examples, see Recipe 7.



What is the distribution of my data?

Analyzing your data distribution informs you of underlying characteristics and informs you on what you can (or should) do with it. Uninformed modeling or modeling that violates assumptions (most regression based methods assume normality) may result in unexpected and inaccurate results.

For starters, basic statistics (i.e. mean, median, maximum, minimum, standard deviation) will tell you the possible values in your dataset. These also provide initial information on possible skew and outliers within your data. Skews and outliers can then be easily validated through graphing the data. Box plots assess if outliers exist; histograms assess normality and skewness; scatterplots can be used to assess linearity; QQ plots compare the distribution of your data with the assumed distribution or another population’s data.

Once you’re armed with these characteristics, you can address any violated assumptions. You could remove outliers from your data to make it more normal and prevent your model from learning these values. You could apply transformations to the data to achieve linearity or normality assumptions. If the distribution of your data remains inconsistent across features, you will want to apply non-parametric statistical and modeling techniques.



Metrics

Choosing a metric to evaluate your model should be based on your desired end state and the data you’re using. Note that machine learning models never learn to actually answer your underlying questions (e.g., “Why does X happen”, “What happens to Y if I increase X?”). Instead, they learns rules and parameters such that their outputs will maximize or minimize a metric. In this sense, metrics are always proxies for your true questions, so choose them carefully! Here are some popular ones in machine learning:



Classification Metrics



Precision and Recall
A handy image to help remember precision and recall. Source



Regression Metrics

When finally evaluating your model, you should take the confidence interval of its performance (from different random samples of your training data) to ensure that one instance of poor or good performance isn’t just a fluke of your data.

Evaluating for unsupervised learning is understandably a more difficult task since you don’t have a truth to compare your model to. For clustering, many of the common measures are on how close together your clusters are internally and how far apart they are from each other (e.g. silhouette coefficient). Silhouette score is usually not a good metric for high dimensional data. Look at an “elbow,” where the slope of the silhouette plot is high, then flattens out. In the end, you should seek an unsupervised model that you can interpret well. For instance, you may have a very high silhouette coefficient, but if it does not translate to your final application then you may want to choose a different model.

A final, oft forgotten measurement is the amount of compute resources it takes to train and run your model. If you’re building a tool for end users that requires fast delivery of results or being able to be deployed on a local machine, you may have to sacrifice some degree of accuracy for speed and memory resources. For more details, check out the computation and memory tradeoff in this blog post.



Learn more



» Previous recipe
» Next recipe


» About
» Ingredients
» All recipes
» Resources
» Code examples