Exploratory Data Analysis (EDA)
Before diving into a problem, you need to understand your data. EDA is the first hands-on-a-keyboard step and will frame how you proceed.
Here’s an excellent EDA checklist of questions to work through:
- What question(s) am I trying to solve?
- What kind of data do I have?
- Types, relationship between data
- What’s missing from my data? How will I deal with it?
- What is the distribution of my data?
- Are there outliers?
- Do I need to change the distribution of the data for my algorithms?
- What feature engineering is needed?
What question(s) am I trying to solve?
Don’t just dive into coding! If you don’t know what question you are trying to solve, you can’t begin. Getting the right answer to the wrong question might be cool machine learning, but doesn’t add value to your organization. You need to know whether it’s a classification or regression problem, if you have enough data, if the data supports the problem you’re trying to solve, and more! As we specified in Recipe 3, make sure the information you are seeking is actually present in the data. For example, if you are trying to predict engine failures based on engine telemetry, the label of ‘this engine broke and required a repair’ may be included in your data, or may need to be brought in separately and merged with the telemetry data.
Clarify your objectives and make sure you have the correct variables to support those objectives. For a more on the importance of data quality and an in-depth pipeline to cleaning data and clarifying objectives, check out this HBR article.
What data do I have?
Which features are numerical (e.g. price), which are categorical (e.g. sex), which are objects (e.g. text)? A domain expert or subject matter expert can be very helpful in order to understand what each feature means in relation to the problem. Understanding your data will allow you to manipulate it so you don’t violate any of the assumptions inherent in your model. (Seriously, read that assumptions link). Below are just two examples of datatypes (categorical and numerical) and the feature engineering required for each. For more examples on each datatype and feature engineering, see Recipe 7.
Categorical data
May need to be encoded into numbers, depending on the type of algorithm you use. Be careful how you do categorical encoding, it can greatly increase dimensionality (in the case of one-hot encoding) or you may encode improperly. For instance, ordinal values (strongly disagree, disagree, neutral, agree, strongly agree, can be encoded sequentially (0,1,2,3,4) but nominal values (e.g. cat, dog, bird, fish) should not be encoded sequentially because if 0 is cat and 2 is bird, an algorithm may perceive that 2 > 0 therefore bird > cat.
Numerical data
May be correlated with one another, making some features unnecessary or at least very little value-add. Removing those features could be useful if you’d like to keep your model simple without losing too much ability to explain variance of your predictor. You may want to build your first simple model using the features most correlated with your target variable. Normalizing numerical features can also improve algorithm performance. Some numerical features may actually only be nominal, so you have to treat them as categorical variables.
If you have data that’s not relevant to answering your problem question, feel free to drop it. This could be duplicate features (‘url’, ‘image_url’), unrelated features, or outliers that might bias your model.
What’s missing from my data?
Just as important as the data you have is the data you don’t have. Whether it’s a feature column filled with NaN (not a number) values, or a feature that doesn’t exist within your data, or one you can engineer. For NaN (null) values, you can impute, or replace them with substituted values like the mean or median, though there are other options. If you are missing a key feature you’d like to have, you may need to find that data from another source, or merge multiple data sources. For more missing data feature engineering examples, see Recipe 7.
What is the distribution of my data?
Analyzing your data distribution informs you of underlying characteristics and informs you on what you can (or should) do with it. Uninformed modeling or modeling that violates assumptions (most regression based methods assume normality) may result in unexpected and inaccurate results.
For starters, basic statistics (i.e. mean, median, maximum, minimum, standard deviation) will tell you the possible values in your dataset. These also provide initial information on possible skew and outliers within your data. Skews and outliers can then be easily validated through graphing the data. Box plots assess if outliers exist; histograms assess normality and skewness; scatterplots can be used to assess linearity; QQ plots compare the distribution of your data with the assumed distribution or another population’s data.
Once you’re armed with these characteristics, you can address any violated assumptions. You could remove outliers from your data to make it more normal and prevent your model from learning these values. You could apply transformations to the data to achieve linearity or normality assumptions. If the distribution of your data remains inconsistent across features, you will want to apply non-parametric statistical and modeling techniques.
Metrics
Choosing a metric to evaluate your model should be based on your desired end state and the data you’re using. Note that machine learning models never learn to actually answer your underlying questions (e.g., “Why does X happen”, “What happens to Y if I increase X?”). Instead, they learns rules and parameters such that their outputs will maximize or minimize a metric. In this sense, metrics are always proxies for your true questions, so choose them carefully! Here are some popular ones in machine learning:
Classification Metrics
- Accuracy: How many classified correctly vs incorrectly. This is great when you have (near) perfectly balanced data but can be misleading when you have unbalanced data.
- Precision: How many selected items are relevant (true positive / true positive + false positive)
- Recall: How many relevant items were selected (true positive / true positive + false negatives)
- Example: You use a search engine to answer a question. Let’s say 60 pages with relevant information exist, but your search engine returns 30 pages and only 20 of those are relevant, while failing to return 40 additional relevant pages. Precision is 20/30 = 2/3, which tells us how valid the results are. Recall is 20/60 = 1/3, which tells us how complete the results are.
- F1: A scaled measurement of accuracy that does not take into account true negatives. This is great when you have highly unbalanced data and a small positives class (e.g. identifying spam emails out of regular emails).
- AUC/ROC: Area Under the Receiver Operating Characteristic Curve. The ROC is a plot of the True Positive Rate vs False Positive Rate at various decision thresholds. This allows you to compare models regardless of the chosen threshold.
- Log Loss (or cross-entropy loss): calculates how close some probability estimates for each class in the dataset are to the true values. See more info here.
A handy image to help remember precision and recall. Source
Regression Metrics
- MAE: Mean Absolute Error. Average of the absolute value of all the errors. Absolute value means no error difference between positive and negative.
- MSE: Mean Squared Error. Average of errors squared. Squaring the errors handles negative errors, and squaring punishes larger errors. Be careful, MSE units will be squared (e.g. dollars squared) not the original units of the prediction (dollars).
- RMSE: Root Mean Squared Error. Square root of average of errors squared. More easily interpretable than MSE because error units match prediction units. Still punishes large errors because errors are squared, then averaged, then square rooted.
- R^2 and Adjusted R^2: used for explanatory purposes. They show how well your independent variables explain the variability of your dependent variables.
When finally evaluating your model, you should take the confidence interval of its performance (from different random samples of your training data) to ensure that one instance of poor or good performance isn’t just a fluke of your data.
Evaluating for unsupervised learning is understandably a more difficult task since you don’t have a truth to compare your model to. For clustering, many of the common measures are on how close together your clusters are internally and how far apart they are from each other (e.g. silhouette coefficient). Silhouette score is usually not a good metric for high dimensional data. Look at an “elbow,” where the slope of the silhouette plot is high, then flattens out. In the end, you should seek an unsupervised model that you can interpret well. For instance, you may have a very high silhouette coefficient, but if it does not translate to your final application then you may want to choose a different model.
A final, oft forgotten measurement is the amount of compute resources it takes to train and run your model. If you’re building a tool for end users that requires fast delivery of results or being able to be deployed on a local machine, you may have to sacrifice some degree of accuracy for speed and memory resources. For more details, check out the computation and memory tradeoff in this blog post.
Learn more
- Towards Data Science Guides to EDA
- Kaggle EDA Notebooks
- Plotting with Seaborn (a popular data visualization library)
- Titanic EDA through to Prediction (Titanic is the most popular beginner classification data set used on Kaggle)
- Metrics
- More basic information on metrics
- “20 Popular Machine Learning Metrics. Part 1: Classification & Regression Evaluation Metrics” (Toward Data Science)
- “How To Evaluate Unsupervised Learning Models” (Toward Data Science)
Navigation
» Previous recipe
» Next recipe
» About
» Ingredients
» All recipes
» Resources
» Code examples