Recipe 1: Defining The Problem



What is the business value of doing this?

If someone in your organization says “we should use machine learning to do [X]!” You should ask, “why?” and keep asking “why?” Machine learning is a means, not an end. Ask “Why are we solving this problem in the first place?” What is the underlying user pain point that we are trying to solve? Does it add value to the organization? Is it the most valuable thing to focus on now or are there other things that need to be solved first? How is it going to make a difference? Has this problem already been solved (or tried to be solved) by someone else? Don’t solve fake problems for the sake of doing something. That’s movement without progress. Also, check to see if your problem has been solved before either outside of DoD—with a publicly-available model/walk-through—or inside DoD by another unit. No need to replicate work!



Do I just need machine learning? Or just visualization and metrics capture?

Some problems first need to be better understood from a quantitative perspective before leaping to advanced analytics as the answer. This should be seen as a distinct step from machine learning because what you learn in this step in data discovery and visualization can greatly influence the machine learning approach to the problem. See the “Problem Exploration” and “Storytelling with Data” recipes for more on this.



Can I use rules instead?

Having a problem worth solving is not the same as having a problem that you need machine learning to solve. If your problem can be solved by a manageable set of rules, heuristics, or a few filters in a spreadsheet, don’t overcomplicate the problem with machine learning.



How should I think about bias?

Machine learning models are not a panacea that always yield “objective” results. The quality of machine learning models is limited by the data that they are fed. In other words, putting junk in gets junk out. Biased data is data that is fundamentally skewed in some way. It differs from reality and produces models that reinforce the bias embedded in the data. If someone says they can make an algorithm that is “unbiased,” they are either intentionally misleading you or (possibly worse and more dangerous) they don’t understand the data for the model they are building.

Bias exists in historical data, unbalanced data sets, and even problem framings. The gap between the data and reality can impact concrete conclusions such as troop allocations or imagery analysis. It can also have significant intangible effects on underrepresented demographic classes: Facial analysis, recidivism predictors, and mortgage algorithms have all been shown to have bias against black people. Hiring algorithms have been shown to have bias against women. Be careful in considering your use cases, your data, and your implementation. Do you have the proper data to solve the problem? Is your solution fair? What’s the potential downside of a bad recommendation?

Trying to apply voodo-magic (i.e., black box methods which people don’t understand) to get information from data that, fundamentally, doesn’t have it leads to bias. Such an example is the unfortunate application of AI that can “read your personality from a selfie”. Even though some aspects of your personality might show up in a selfie, one image does not contain enough information to relay entire personality traits. Instead, models trained on such data will tend to overfit and pick out obscure characteristics (e.g., race, hair color, backgrounds) as predictors that will propagate negative biases.



Does my model need to be explainable?

There’s often an interpretability—accuracy trade-off. Complex models, particularly deep learning neural networks, are often more opaque. Simpler models, like linear regression or decision trees, may be more transparent and allow you to inspect what input data affects the model. If you’re making a decision that’s not life altering—like predicting the price of a house—you may not care if you can understand how the number of windows affects the home price prediction. If your deep neural net is opaque but better at predicting home prices, you may not care. But if you’re making a very important decision that affects a human being’s life—like whether or not someone should receive a loan or an organ donation—you should probably not have a black box, even if it’s more accurate, (or maybe not use a machine learning model at all).



Do I have data?

No data means no machine learning. If it’s too hard to acquire usable data, then no amount of machine learning models will help. If this is the case, the problem that you and your team should solve is generating usable data in daily processes, going for low-hanging fruit like analyzing existing data. Then make a note on your calendar for sometime in the future to return and then start training models on that data that you set yourself up for success on.



Okay, so what can machine learning do?



Process

Listen to users, understand their problems and data, build quick prototypes, get user feedback, and iterate. Having an understanding of agile / scrum / design thinking can be helpful, but beware of the “innovation theater” of agile / scrum / human centered design for the sake of design and problem admiration instead of identifying problems to build solutions to solve those problems. Your end goal is always to add value for users.



Further resources



» Next recipe


» About
» Ingredients
» All recipes
» Resources
» Code examples