Creating A.I. Models Is An Exercise In Understanding Your Business
Knowing how to create relevant training datasets requires you to truly understand what matters
In the past few years, creating a new predictive model has become extremely easy. There are a ton of tools out there that let you easily take in a dataset and spit out a trained model. Training A.I. is the easy part (let’s ignore generative A.I. for now).
It’s commonly said that the most time consuming part of a data scientist’s job is “data munging” — the aggregation, transformation, and cleaning of data that will eventually be used for model training and evaluation. While that’s true, I generally find that part of the job to be straightforward to implement.
One of the most important parts of creating a predictive algorithm is deciding the structure and content of data that you use to train your models. Some would say this is feature engineering, and while I don’t disagree, I find that name to be overly-limiting.
When I think of feature engineering, I think of mathematically combining or transforming known data points in ways that could help a model more accurately make predictions.
Rather, what I’m referring to is conceptualizing and creatively envisioning the data that makes the biggest impact on the metric you’re trying to predict.
An example
This sounds a bit vague, so let me provide an example.
Let’s say you’re a data scientist at a software company that sells to enterprises and you’ve been asked to create a model to predict which of your existing customers are most likely to spend more with you. The output of your model will be provided to your sales and marketing team, who will focus their efforts on the top 10% of accounts that your model has predicted.
You as a data scientist now need to come up with a training dataset that is highly likely to predict future customer spend. In order to do so, you’ll need a deep understanding of why your customers are using your product, how big your customers are, what percentage of their employees are using your product, where there are opportunities for improvement, and how you can quantify all of these opportunities.
You’ll need to deeply understand product, value-to-customer, your internal sales and marketing processes, and the broader problem area that your product is solving for your customers. To do this effectively, you as a data scientist need to be extremely well-informed about all aspects of your business.
Data scientists that create the best, most predictive models, tend to be data scientists that best “understand the data,” and in many cases, they understand why that data is helpful in the first place.
The most creative part of building a predictive model is in deciding what data is most helpful to include. There are no guides to help you. There are no quantitative metrics to help you understand the impact of data before you even think to include it. You, as a data scientist, need to come up with a hypothesis of what works.
Once you have a list of ideas for potential features, you can then proceed to understanding which of those datapoints have predictive power, and can proceed to optimize those points with feature engineering.
But in order to even get there, you need to start with an understanding of the business.
My process for creating a new predictive model
It’s probably no surprise, then, that whenever I start with creating a new model, the first few things I do are to decide the objective (frequently I’ll start with the business or product objective before I even think about the label), come up with a list of potential features, and then start analyzing how long it will take to get those features into an organized structure (large scale companies will frequently have feature stores, or structured data stores of common features that can be used to easily and quickly look up feature values).
In order to come up with a list of potential features, I’ll frequently chat with a lot of stakeholders, and I’ll also do my own research to understand what could potentially affect the objective metric. It’s important to gain a more comprehensive understanding of all of the areas that could potentially affect my objective metric before I jump in.
Once I have a list of potential features, and if I have time (frequently not the case for smaller businesses and startups), I’ll do an exploratory data analysis (EDA) to better understand the data and see if I should be editing/removing/transforming any of it.
Then I’ll try to build a basic model as quickly as possible and analyze the performance on validation data to see if it’s worth putting any more resources into the model.
Hopefully I’ll have an accurate model that I can then work on optimizing. If not, I’ll spend more time trying to understand what datapoints I may need to add to more accurately understand the objective metric. Once it’s all done, I’ll work to productionize the model (a beast of a topic itself) and then move on to monitoring and tweaking it.