The objective of the series:
One of the benefits of being a data scientist is that there are so many great resources available to learn from. Take for example linear algebra. As one of the building blocks of machine learning, it’s a subject every data scientist should know about. Fortunately, there are a plethora of resources available online for free. Just watch Professor Gilbert Strang’s MIT lectures, and you’ll know more about linear algebra than you’ll ever need.
However, while technical resources can be found in abundance, there’s very little focus on data science frameworks. How could there even be any? Data science is still in its infancy, and people have yet to define a set of universal rules and best practices that all professionals can adhere to. As a result, I thought it would be a good idea to create a project from start to finish to illustrate how to think about data science projects in general, how to structure them and how to think about their role from a business perspective.
The use case I’ll be working on is predicting customer churn for a subscription-based music streaming service. While I would love to claim ownership of the idea I have to confess that I’ll be basing my work on Will Koehrsen’s wonderful series on how to create value using machine learning.
Along with the articles, I’ll also post all of my code on GitHub so you’ll be able to follow along and run the code yourself. I’m using Python and Spark and intend to make the project accessible for anyone with a little bit of Python experience. To run the code, I’m using Google Cloud in combination with our new tool called Lentiq. If you’ve never worked in a cloud environment, then this project is perfect for you as you’ll see just how easy it is to provision and run a Jupyter Notebook in the cloud when using Lentiq EdgeLake. All of the tools used are available for free. However, you will have to provide your credit or debit card details. Don’t worry though, as Google gives you 300$ worth of credits you most certainly won’t consume. Find out how to use Lentiq EdgeLake with Google Cloud.
What is churn prediction and how to think about it in a business context?
A. Why business acumen is an important skill to have:
As explained in a recent article by Scott Berinato, successful data science teams are built upon a carefully chosen mix of talents. And one of the most important talents a data science team must have is knowledge of the business and the strategy. Every team needs a person who can inform project design and data analysis and keep the team focused on business outcomes, not just on building models. The aim of the work, after all, is to bring value to the business. Unfortunately, many companies overlook how important this requirement is and fail to create a bridge between analytics and the enterprise. As a result, even well-run operations that generate strong analysis fail to capitalize on their insights.
As Scott perfectly put it, efforts often fall short on the last mile when it comes time to explain results to decision makers. According to a recent Kaggle survey for data scientists four of the top seven “barriers faced at work” were related to last mile issues, not technical ones:
- lack of management/financial support
- lack of clear questions to answer
- results not used by decision makers
- explaining data science to others
Although the perfect solution would be to rethink how data science teams are built there are some steps that you, as a data scientist, can take as well. I would argue that three of the four last mile issues listed above can be tackled with appropriate business insights. To understand how to let’s have a look at our churn prediction example.
B. Churn prediction from a business perspective:
Churn, also called attrition, is a measure of the number of individuals or items moving out of a collective group over a specific timeframe. In a business context, it is the number of customers that stopped using a company's product or service during a certain period.
While it is easy to see why a subscription-based service might be interested in customers not using the platform anymore, retention is an issue that all companies are preoccupied with. Customers leaving is always bad for business as it costs more to acquire a new client than to retain an existing one. An increase in customer retention of just 5% can create at least a 25% increase in profit. This is because returning customers will likely spend much more on your company's products and services.
As a result, retaining existing customers is unquestionably critical. However, when measuring churn, it is important to look at specific segments individually and not just the entire customer base. For example, churn on new customers is always higher than it is for customers who have already proven to be loyal and who have come back to do business with you more than once.
To better understand this concept let’s look at an example. Let’s say you are the data scientist at a small retail start-up and your CEO tells you to look into customer churn. You do some quick analysis, and you realize that your churn rate is 15% compared to your competitor’s 5%. While this sounds terrible, if you look at different lifetime segments, you’ll realize that you have the same churn rate as your competitors, but you have much more new customers (see Table 1. below).
This finding allows you to offer some valuable insights on how the output of your model should be used. Instead of focusing efforts on all customers that have a high probability to churn, it would be much more efficient to focus on recent clients.
Another important concept is that not all churn is equally bad. All companies have unprofitable customers and losing an unprofitable customer is not nearly as bad as losing one of your best ones. As a result, it’s very important to look at churn by different customer value segments as well.
Continuing with the previous example, you can notice that valuable customers, those who drive a lot of the company’s revenues, have a higher churn rate than the rest of the clients (see Table 2. below). As a result, it would make much more sense to focus retention efforts on these clients.
If you can calculate the lifetime value of a customer, you can also help the marketing team to decide how much money to spend on retaining each client. It just wouldn’t be wise to spend more money retaining a customer than its lifetime value to the business.
Knowing these details won’t affect the machine learning part of the project but they can help offer additional context into why predicting churn is important. You could use them in a presentation to the CEO and help ensure he understands how a churn prediction model can impact the business. After all, being an effective data scientist is less about building the most complex models and more about delivering real added value to your company.
Stay tuned for the second part of the series.
Andras Palfi, Data Scientist at Lentiq, has a passion for solving real-world business problems using data and machine learning.