Let’s talk data

Read about our company, how we work and the latest developments in the industry.

← Back to articles

What does it take to perform a churn prediction model?

One of the most important aspects in business is customer retention. It is not only much more difficult to gain new customers rather than retaining existing ones, but also way more costly. Taking this into consideration, we want to show you, in layman's terms, the steps and tools needed to create a basic churn prediction model.

Steps for creating a churn prediction model for your data team:

1. Data Preprocessing / Data Preparation

As its name suggests, data preprocessing refers to the transformations applied to data before feeding it to a ML algorithm, basically preparing the data for future analysis. Data from different sources and in raw format cannot be used for analysis, since it is often incomplete and inconsistent. For instance, if your customer data is stored both in abbreviated and normal form, it is inconsistent and unreliable for use in a ML model. The same if your CRM stores your client score in numbers and your marketing automation in words. In order to find discrepancies, clean, organize and prepare your data for future use, you can use Spark or Jupyter.

Thus, data preprocessing is a first and extremely necessary step in a churn prediction model. That is, if you don’t want to fail before you even start, of course.

2. Exploratory Data Analysis (EDA)

The second step is the process of carrying out investigations on data before you run a machine learning model, namely Exploratory Data Analysis. Your data is now clean, but you still need to make sense of it and gain some insights from it, before you actually throw your data in the ‘cooking pot’. Discovering patterns and testing hypothesis before running a ML model is of importance, so when you first encounter a data set, you need to perform EDA. It is usually done through numerical or graphical representations.

Although it is the first step when encountering a new cleaned data set, that does not mean it is also the last time you have to do it. Reiteration is key. Just like learning something new about a topic of interest where you have to permanently test your beliefs to see if they still apply, you have to do the same with each new addition of data. The tools to perform EDA are Jupyter and the Pandas library.

3. Model Development and Training

Reaching the third step means you can finally feed the model! Most data scientists' job actually starts at this step. However, in smaller data teams, where colleagues regularly help each other out and wear multiple hats, the previous two steps are usually also done by data scientists.

What they need to do is find the right model and train the pipeline. There are a number of algorithms that you can use to train the model (i.e. Random Forest, Logistic Regression, Tree-based models), but keep in mind that whatever machine learning algorithm you choose, you always need to train it and evaluate it. And this can be easily done in Jupyter.

4. Model Testing

What does testing the model actually refer to? What do data teams have to test? They need to test the quality of data, features and, of course, the ML algorithm. This step can be performed using Jupyter to write some code that sets and monitors the quality metrics of a model.

Based on these performance metrics, you can decide, which model, with which set of parameters has performed best on test data and then you can deploy the model to production using a model server that creates a REST API interface on top of the model. 

5. Model Interpretability

Time to see how the model is making decisions and if any features from the model are not working properly or should be excluded. Interpretability is the ability to explain or to present to a 6-year old why the decision was made.

Why is this step so important? In some cases, knowing the ‘why’ behind the predictions is not really relevant, but more often than not, knowing the decision process or how a prediction was made can actually help you understand the problem, the data and the reason why a model might fail with new data (the overfitting problem). That is why data scientists have been focusing on developing solutions and methods to make model interpretability easier to grasp.

6. Model Deployment

Deploying a model in production is only the start of a journey of tuning it to perform well on production data. In very few cases, the model developed on training data performs well in production from the get-go. Usually, fine tuning the model takes around 6 months until one can confidently use it to improve decisions and automate tasks. In this period, keeping track of the model performance metrics, being able to retrain the model easily and deploy a new version into production, while maintaining traceability of the model, are all on a data team’s plate. 


As we can see, setting up a churn prediction model is obviously a lot of work, sometimes done by multiple professionals, so it feels only natural to be able to reuse the model, and not reinvent the wheel each time, doesn’t it? However, most data teams simply start from scratch when they need a new model. Let’s say you and your team went through all the hard work, crossed all your t’s and dotted all your i’s, tested it again and again, and it finally worked! Great job! Now you can do it all over again for another range of products or client segment. Congratulations! You are the victim of your own success.

That simply seems like a waste of time to us.

What if, after you create a tested churn prediction model, this model can be reused by other teams or departments? What if it doesn’t even matter if they work in different regions or in different clouds, as this platform is multi-cloud and supports AWS, GCP, as well as on-premises? Interconnectivity and collaboration at its best.

More so, what if you could create a workflow that runs the churn prediction model periodically? For instance: new data each Friday? No problem, simply automatize model training for new data, and set it to automatically retrain on Fridays. It’s as simple as that.

Now, imagine you could do all this in one single platform. Sounds surreal? It isn’t. What Lentiq does is to offer a platform where all these things are automated and teams can focus on models’ optimization rather than devops.

Performing churn prediction is a must in this day and age, and it should be easier to accomplish. We, at Lentiq, put together in one platform-as-a-service all the tools you need, plus the infrastructure management layer, so that data teams can focus on the scientific part, without having to worry about data integrity, processing power or compatibility among different apps. On top of that, Lentiq makes all models future-proof by enabling the users to transform them in reusable workflow tasks that can be shared, adapted, or reused. It’s the copy and paste in data science projects.


Readers also enjoyed:

Business Perspective: What is churn prediction and how to think about it in a business context?

This is the first article in a series on churn prediction for a music streaming service. While the objective of the series is to illustrate how to think…

Lentiq – The Freedom to Innovate

Lentiq reimagines the vision of the data lake concept by moving away from a centralized, unified data repository to a fully distributed architecture.…

New in Lentiq: analytics & ML workflow support, reusable code blocks, and model server

We are releasing several new features in our data lake as a service platform designed to help you put data projects in production faster and easier: a…

SMB CEO Guide: Implementing machine learning in a small business

I had an interesting conversation the other day with Barry Moltz during his famous Small Business Radio Show about ML and small businesses. I then realized…

Try Lentiq with your team. 14 days free trial.

Create a Free Account