Customer Segmentation (Coeur de cible, February 2019 )
We are in the middle of a digital transformation and most of our daily needs such as purchasing items, travelling or searching on the internet (clothes, phones, food, etc.) generate a large amount of data. Companies today rely on this data to analyse and understand their customers’ behaviour and to segment these customers, in order to improve their marketing campaigns. Customer segmentation is a very common method used by retailers.
RFM analysis is a technique often used to perform in customer segmentation. RFM will take into account the recency, (i.e. the date on which the customer made his last order), then it will take into account the frequency of orders and the amount of the purchased items (over a given period of time or the last order) to establish the different customer segments.
As an example of RFM analysis, we will use retail customer data in this study, using Python and some of its visualization libraries and tools.
Our data set contains information about customers in different states of the US. Those customers made 5009 purchasing orders online between 2016–01–02 to 2019–12–30. The features (columns) are:
Order ID: Unique order identifier online.
Order Date: Date when to customer order an item.
Customer ID: Unique customer identifier.
State: The name of the country where each customer resides.
Product ID: Unique Product identifier.
Category: Product Category name.
Quantity: The quantities of each product per transaction.
Price: Unit price of a product.
We would like to remind that data cleaning is done, (i.e. we removed all the errors terms and missing data in our set before using it) We did not mention this previous step, as it is not the main goal of the project.
Let’s do a little data visualization in order to better understand our data.
First of all, we import these libraries.
We import our data set to see how it looks.
Here is our map chart containing clients by state
As you can see, there are some states with a low number of customers. We decided to consider just the first 12 states because they are at least twice as representative of the number of subscribers as the others.
But again, it all depends on what marketing strategy you decide to lead.
A visualization of these selected states will be:
The result is:
Now, let’s start our RFM analysis.
- We start by calculating the recency (R) of a customer.
In order to determine how much time has passed since his last purchase, we need a reference date from which to start calculating. Then we make the difference between the two dates to calculate the amount of time passed.
Suppose that we decide to do our analysis 2 days (this date is our reference date) after the last transaction record date of our data set.
Let’s look at customer distribution based on recency.
Here, we can note that the histogram is biased towards the left side and hence this is a sign of distribution which is a right-skewed distribution and also we can see that the rug plot is crowded between 0 and 400. Based on that we can see that we have a high concentration of customers in the last 400 days, i.e. the last 4 months.
- Now we’re going to determine the frequency (F) at which customers buy the products.
In our database, we can see that the majority of customers do not buy more than 10 times.it is not really enough because our data is over a period of 04 years.
- Now at the end, we’re going to determine the Monetary(M).
Let’s look at customer distribution based on recency
It is marked that not many customers spend more than $3,000.
Now let’s check the RFM table.
Thanks to our RFM table, we are now going to propose a customer segmentation strategy.
Depending on the company’s objectives, customers can be segmented in several ways so that it is financially possible to make marketing campaigns. The ideal customers for e-commerce companies are generally the most recent ones compared to the date of study (our reference date), who are very frequent and who spend enough.
The RFM factors are therefore very important if you want to know that they are the right customers to send promotional e-mails or to offer new services.
Based on the RFM table, we will assign a score to each customer between 1 and 3 for each RFM value of a customer.
3 is the best score. 1 is the worst score. The RFM score of a client is calculated by combining the three scores obtained at R, F and M. For example, the client with ID-1 has a score of 3 in Recency, a score of 3 in Frequency and a score of 3 in Amount (Monetary). His RFM score is therefore 3–3–3. In concrete terms, this is a client who bought most recently and most often, and he spent the most.
To find the R score, we decide to take the lowest monthly interval, because our distribution is very skewed to the right, so we have to choose only 1 month.
Below we are developing a function which allows us to find the recency score for each customer.
We do the same for the frequency of customer purchases.
Another simple way to calculate a Monetary score is to use tertiles. The ‘qcut’ function of pandas will divide the entire range of unique “Monetary” in 3 equal parts. (intervals highlighted in yellow on the image above).
For more information on the “qcut” function see the documentation on “pd.qcute( )”.
These intervals will then be labelled from 1 to 3, in the same way as the recency and the frequency.
So we get our final table with all the scores of recency, frequency, and Monetary
The final RFM score will be obtained by concatenating all the different R, F and M scores. This will allow us to define customer segments. As customer segmentation really depends on companies’ objectives, we will just settle for the most common segments found in the marketing field. We have been inspired by the segments defined by Joao Correia in his article that you will find here.
The common key segments that we will identify in our data set, are shown in the table below.
The code below allows us to create a new column “segment” which represents the segment in which our customer is located. We first start to identify the segments: ‘Lost Cheap customers’,’Lost Customer,’Best Customers’, and ’Almost Customers’. We assign ‘others’ for others
The customers in the ‘Loyal customers’ segment are then identified and assigned to ‘segment’ as well.
We do the same thing for the ‘Big Spenders’ segment.
We, therefore, have our customers in the segments that we have defined.
Of course, these segments can be defined differently according to the objectives that the company sets itself. Customers in the “others” segment can allow us to define even more. Given that we have three groups (R, F, M) labelled from 1 to 3, we have 3X3X3=27 possibilities of segmentation.
We are going to do a small visualization of our segments. Then, we will plot the key segments and then give some marketing recommendations that the company could take to retain those customers.
Another more expressive visualization to show the distribution of segments using the squarify plot of matplotlib gives:
These two graphs show that the “Big spenders”, “Lost Cheap Customers” segment are the highest and “Almost Lost” the lowest.
- Best Customers: Reward them for their multiples purchases. They can be early adopters to very new products. Suggest request them to “Refer a friend”. Also, they can be the most loyal customers that have the habit to order.
- Lost Cheap Customers: Send them personalized emails to encourage them to order.
- Big Spenders: Inform them about the discounts to keep them spending more and more money on your products
- Loyal Customers: Create loyalty cards in which they can gain points each time of purchasing and these points could transfer into a discount
In our next articles, we will show how machine learning can help in customer segmentation and we will focus on unsupervised learning algorithms such as Kmeans.
Jean Marie Madeng is a TBS student at the Toulouse campus pursuing his Master of Science in Big Data Marketing Management. This blog was first published on Analytics Vidhya’s Medium page.