Customer Churn Prediction with Text and Interpretability

Predicting if and understanding why customers want to leave.

Sep 14, 2021

13 min read

Notes from Industry

The code repository accompanying this blog post can be found here.

Customer churn, the loss of current customers, is a problem faced by a wide range of companies. When trying to retain customers, it is in a company’s best interest to focus their efforts on customers who are more likely to leave, but companies need a way detect customers who are likely to leave before they have decided to leave. Users prone to churn often leave clues to their disposition in user behavior and customer support chat logs which can be detected and understood using Natural Language Processing (NLP) tools.

Here, we demonstrate how to build a churn prediction model that leverages both text and structured data (numerical and categorical) which we call a bi-modal model architecture. We use Amazon SageMaker to prepare, build, and train the model. Detecting customers who are likely to churn is only part of the battle, finding the root cause is an essential part of actually solving the issue. Since we are not only interested in the likelihood of a customer churning but also in the driving factors, we complement the prediction model with an analysis into feature importance for both text and non-text inputs. The code for this post can be found here.

We focus on Amazon SageMaker in this solution which is used to prepare the data, train the churn prediction model, as well as evaluate and interpret the trained model. We use Amazon SageMaker to store the training data and model artifacts, and Amazon CloudWatch to log the data preparation and model training outputs (Fig. 1).

Figure 1: Architectural Diagram including AWS services [Image by Author]

State-of-the-art natural language models are harder to interpret compared to simpler models like linear regression. The interpretability issues can impede business adoption despite their top-of-the-line performance. In this post, we demonstrate some methods for extracting understanding from NLP models. We use the BERT sentence encoder [2][3] for processing the text inputs, and we provide a way to attribute the model predictions to the input features. While there are different approaches to interpreting language models, we choose to go with an ablation analysis of a subset of relevant keywords which can scale easily to the entire dataset and hence provide us with a global interpretation of the predictions of the language model.

Exploring and Preparing Data

To mimic a categorical and text churn dataset we leverage the Kaggle: Customer Churn Prediction 2020 for the structured data and combined this with a synthetic text dataset created using GPT-2 [1]. The dataset comprises 21 columns with features including categorical ones (State, International Plan, VoiceMail Plan), numerical ones (Account Length, Area Code, etc.), and one column with text holding the chat logs between customer and agent generated with GPT-2.

The following image shows an excerpt of the data.

To prepare the data for modeling, we use one-hot encoding to transform the categorical feature values into numeric form and impute missing numerical feature values with their corresponding mean. Since this post’s focus is on predicting with and interpreting language models, we won’t spend more time with exploring or feature engineering of the categorical and numerical features. Rather, we will place our focal point on the customer-agent interactions.

As previously mentioned, the chat logs were generated with GPT-2 using a sample set of manually created customer-agent conversations. Here is an excerpt from a customer-agent conversation generated by GPT-2.

While the GPT-2 generated conversations have less breadth than actual conversations and (as seen above) sometimes fail to make perfect sense, we believe, in lieu of a public customer-agent dataset, this generated data is a reasonable way to get a large customer-agent interaction dataset concerning churn.

We prepare the textual features on Amazon SageMaker by transforming each chat log into a vector representation using a pre-trained Sentence-BERT encoder (SBERT) from the Hugging Face models repository [4]. The Hugging Face repository provides open-source, pretrained natural language models that can be used, as in our case, to encode text without any further training of the model. SBERT is a modification of the pre-trained BERT network that uses the following network architecture to derive semantically meaningful sentence embeddings.

A pair of sentences is encoded using BERT, each independently from the other, before applying a pooling operation to generate a fixed size sentence embedding per input sentence. As part of the bifold structure, BERT is fine-tuned by updating the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.

SBERT is trained using a combination of objectives including classification and regression. For the objective to classify the relation between two sentences (Fig. 2a), the sentence embeddings are concatenated by calculating the element-wise differences and multiplying them with trainable weights before passing them into a classification layer. For the regression objective (Fig. 2b), the cosine-similarity between two sentence embeddings is calculated.

Figure 2: Example of SBERT architecture with (a) classification objective function and (b) regression objective function [Image by Author]

The benefit of using SBERT over other embedding techniques (such as InferSent, Universal Sentence Encoder) is that it is more efficient and achieves better results in most semantic similarity tasks [3].

Since BERT is built to encode word pieces, there is little to no preprocessing required for our text data. We can directly transform each chat log into a 768-dimensional semantically meaningful embedding vector.

Applying the aforementioned preprocessing steps to all categorical, numerical, and textual features results in the data now being encoded in numeric form so it can be processed by our neural network.

Creating a bi-modal ML model

The network architecture consists of three fully-connected layers and a Sigmoid activation function for binary classification (churn/no churn). First, the encoded categorical/numerical data is fed into a fully-connected layer before it is concatenated with the encoded textual data. The concatenated data is then fed into a second and third fully-connected layer before applying the Sigmoid activation for binary classification. The first fully-connected layer serves to reduce the dimensionality of the sparse categorical/numerical input data (see more details below). The second and third fully-connected layers serve as decoders of the encoded data in order to classify the inputs into churn/no churn. We call the architecture bi-modal because it takes both structured and unstructured data, i.e., categorical/numerical and textual data, as inputs in order to generate predictions.

The following diagram (Fig. 3) illustrates the model’s bi-modal architecture.

Figure 3: Bi-modal model architecture using structured data and text as inputs [Image by Author]

As discussed in the previous section, the categorical data has been transformed into numerical values using one-hot-encoding where the category of each feature is represented by binary values in a separate column. This leads to sparsity in the resulting encoded data (many zeros) when there are many categories as with feature ‘State’ which has 51 categories (including District of Columbia). Additionally, indicator columns that are created by imputing missing values of numerical features also contribute to the sparsity of the encoded data. The first fully-connected layer serves to reduce this sparsity resulting in more efficient model training.

We train the model using the SGD (stochastic gradient descent) optimizer and BCE (binary cross entropy) loss function on Amazon SageMaker and achieve a performance of 0.98 AUC on the test dataset after around 8–10 epochs. For comparison, training a model with just categorical and numerical data achieves a performance of 0.93 AUC, or about 5% lower than when the text data is used (Fig. 4). A similar improvement with real data would be expected since the model with text data has more information with which to make a decision.

Figure 4: PR curves of the models trained with just categorical/numerical data, as well as with both categorical/numerical data and text [Image by Author]

Important Features

However, in order to prevent customers from churning, it is not sufficient to know how likely the churn event is. Additionally, we need to find out what the driving factors are so that preventive actions can be taken.

Categorical and numerical features

We will train an XGBoost model [5] on the categorical and numerical data using the predicted labels of the trained neural network as target values to find out which categorical/numerical features contribute most to our model’s predictions. The built-in method for relative feature importance allows us to get an overview of the most important features which you can see in Figure 5.

From the above illustration, we can see that the top three features for determining whether a customer will churn include number of vmail messages, number of customer service calls, as well as not having an international plan (x2_no). The features x0_xx indicate the state.

Textual features

Now let’s focus on the customer-agent conversations and try to attribute the model’s predictions to its textual input features. While different approaches to interpreting deep learning models for natural language exist, their primary focus seems to be on explaining each prediction individually. For example, the Captum library [6] implements techniques based on gradients or SHAP values for the assessment of each token/word in the text sequence. While these methods provide interpretability on a local level, they don’t easily scale to the entire dataset limiting their use for a global interpretation of the trained model’s predictions.

Our approach to interpreting the trained model that uses SBERT-encoded textual features works well for both local interpretability (single chat logs) and global interpretability (entire dataset), it scales efficiently, and it consists of the following steps that we perform on Amazon SageMaker. We first subset the text into keywords using parts of speech (POS) tagging and semantic similarity matching to churn. Then we perform an ablation analysis to determine the marginal contribution of each keyword to the model prediction. Finally, we combine semantic similarity, marginal contribution, as well as keyword frequency into a single score which allows us to rank the keywords and provide the most relevant ones to churn.

The below flow chart (Fig. 6) illustrates our approach to extracting the keywords:

Figure 6: Flow chart describing the text transformations to find important features. Step 1: Obtain candidate keywords by applying POS filtering, lower casing, and lemmatization. Step 2: Obtain relevant keywords by applying semantic similarity matching of keywords to churn. Step 3: Calculate marginal contribution of keywords via ablation analysis. Circled area (red) indicates the reduction of the model's churn prediction after removing the token cancel [Image by Author] — Figure 6: Flow chart describing the text transformations to find important features. Step 1: Obtain candidate keywords by applying POS filtering, lower casing, and lemmatization. Step 2: Obtain relevant keywords by applying semantic similarity matching of keywords to churn. Step 3: Calculate marginal contribution of keywords via ablation analysis. Circled area (red) indicates the reduction of the model’s churn prediction after removing the token cancel [Image by Author]

The approach starts with the raw text conversation; we reduce the size of our text body by focusing on a subset of candidate keywords that we obtain by applying several token filters as you can see in Figure 6, step 1. We apply Spacy’s POS tagging and keep only adjectives, verbs, and nouns [7]. Then we remove stop words, lower case, and lemmatize the tokens.

Next, we rank the candidate keywords in order of semantic similarity to both class outcomes – here we will focus on churn (Fig.6, step 2). Specifically, we encode each keyword using pre-trained SBERT and calculate each keyword’s cosine similarity to the average embedding of all SBERT-encoded chat logs that result in churn. This allows us to rank the keywords by similarity to churn which further reduces the subset of keywords into a subset of keywords that are relevant to churn. From the above illustration you can see that ranking keywords by semantic similarity already gives us important insights into why customers may be churning. Many of the keywords, including ‘cancel’, ‘frustrated’, or ‘unhappy’, indicate a negative sentiment.

In addition to semantic similarity to churn, we would like to further quantify the impact of keywords by measuring their marginal contribution to the prediction of the model (Fig. 6, step 3). We embed the chat logs with and without the relevant keywords (where they occur) and measure the average prediction difference. For example, the keyword cancel occurred 171 times across all churn chat logs and removing it results in a reduction of the model’s churn prediction by 4.18%, on average, across the 171 instances.

Finally, we merge all three scores, semantic similarity, marginal contribution, and keyword frequency, into one joint metric to achieve our final ranking of important keywords. The joint metric is calculated by setting the individual metrics to the same scale (via range-bound Min/Max Scaler) and calculating a weighted average.

Results

The following table (Fig. 7) shows the 20 most important keywords for predicting churn. They include ‘voicemail’, ‘cancel’, ‘spam’, ‘turnover’, ‘frustrated’, and ‘unhappy’ which are indicative of poor customer satisfaction or other issues.

Figure 7: Table with 20 most important keywords for predicting churn [Image by Author]

To provide more context around the keywords and to help better explain churn events, we added a functionality to query the phrases where the keywords were used in the customer-agent conversation. For example, the keyword ‘spam’ was used when customers complained about being flooded "with emails and phone calls, spamming me with thousands of phony invoices." Original chat logs with mention of spam:

"I just got some spam messages last night, and today it’s been getting a lot of texts that I ‘don’t have my SIM card’ and I need my SIM card."

"TelCom started to flood me with emails and phone calls, spamming me with thousands of phony invoices."

"Basically, I’m getting a lot of spam calls every day from a guy named Michael who’s calling from a really weird number."

Understanding the keywords and their context will allow us to prescribe actions to address customer churn. For example, some customers seem to be suffering high amounts of spam calls and are therefore deciding to leave the service. We could devise a plan to reduce the issue of spam calls which could in turn lower customer churn.

Alternatively, we could gain more insight by categorizing the different keywords or churn phrases into distinct topics and formulate actions based on those topics. However, given the nature of our synthetic dataset with rather narrow conversation topics, we found it didn’t help in our case.

Conclusion

In this post, we showed how incorporating text data based on customer-agent interactions with traditional customer account data can improve performance of predicting customer churn. Furthermore, we introduced an approach that enables us to learn insights from the text, in particular, which keywords are most indicative of customer churn. Given our focus on global interpretability of the language model, our approach efficiently scales to the entire dataset which means we are able to understand the main drivers of churn across all customer-agent conversations. All data transformation steps, as well as model training, evaluation, and interpretation steps were performed on Amazon SageMaker.

References:

[1] Language Models are Unsupervised Multitask Learners, Radford, Wu, et. al., 2019

[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin, Chang, et.al., 2019

[3] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers, Gurevych, 2019

[4] Hugging Face Sentence Transformers

[5] XGBoost model

[6] Captum library for model interpretability

[7] Spacy library for text processing

Written By

Daniel Herkert

See all from Daniel Herkert