View on GitHub

Customer-Segmentation-Project

A project I created implementing customer segmentation using an artificial dataset from Kaggle.

Customer-Segmentation-Project

A project I created implementing customer segmentation using an artificial dataset from Kaggle.

Author: Brian Docena

Introduction

The motivation for this project was during class when I learned K-Means Clustering. I thought the concept of creating clusters based on points’ feature values was simple, but interesting and I thought about how it would be used in a real life scenario. This situation led me to want to implement clustering based on customer data. It would be hard to come across a dataset for this application because of privacy concerns, so I just searched for data on Kaggle that had interesting features. Sadly, this dataset was artificially generated so the insights and analysis were not as interesting as I wanted it to be. ecommerce_customer_data_large is the data from Kaggle that was artificially generated, which contains 250,000 rows and the following columns:

Column	Description
‘Customer ID’	ID Number of the Customer
‘Purchase Date’	The Date and Time of the Purchase
‘Product Category’	The Type of Category the Product Belongs to
‘Product Price’	The Price of the Product
‘Quantity’	Amount the User Bought of the Product
‘Customer Age’	The Age of The Customer
‘Returns’	Values were assigned 1 or 0 where 0 represented No Return and 1 represented Return
‘Customer Name’	Name of the Customer
‘Age’	Age of the Customer
‘Gender’	Gender of the Customer
‘Churn’	Values were assigned 0 if they were retained or 1 if they were churned

Data Cleaning and Exploration

Since this dataset was artificially generated, there wasn’t much needed to do for data cleaning. However, I did transform the data with the following steps:

I used probabilistic imputation on the return column to fill missing values.
- I saw that the only column that had null values were the returns column. To handle this I wrote code to see how the values were distributed. I saw that the returns were randomly distributed between no return and returned. I thought the best decision was to fill in the return column using imputation, keeping the randomness intact.
I renamed the columns so that instead of spaces there were underscores.
- This was just a personal preference. I am used to words being separated by underscores instead of spaces.
I dropped Customer_Name and Age columns.
- I thought the Customer_Name was not valuable and I would have done this for privacy concerns. The Age column was repeated, so I dropped it.

The final dataframe has 250,000 rows and 13 columns. (Note: the following visualization is a portion of the columns)

Customer_ID	Purchase_Date	Product_Category	Product_Price	Quantity	Total_Purchase_Amount	Payment_Method	Customer_Age	Returns	Customer_Name	Age	Gender
44605	2023-05-03 21:30:02	Home	177	1	2427	PayPal	31	1	John Rivera	31	Female
44605	2021-05-16 13:57:44	Electronics	174	3	2448	PayPal	31	1	John Rivera	31	Female
44605	2020-07-13 06:16:57	Books	413	1	2345	Credit Card	31	1	John Rivera	31	Female
44605	2023-01-17 13:14:36	Electronics	396	3	937	Cash	31	0	John Rivera	31	Female
44605	2021-05-01 11:29:27	Books	259	4	2598	PayPal	31	1	John Rivera	31	Female

Exploratory Data Analysis

For analysis, I was disappointed with the results. Since the data was artificially generated, the analysis was not as insightful as it could have been.

The first thing I did was to see the statistics on the numeric columns of this dataset:

	count	mean	std	min	25%	50%	75%	max
Product_Price	250000	254.743	141.738	10	132	255	377	500
Quantity	250000	3.00494	1.41474	1	2	3	4	5
Total_Purchase_Amount	250000	2725.39	1442.58	100	1476	2725	3975	5350
Age	250000	43.7983	15.3649	18	30	44	57	70

Here we can see that the average transaction is around $255 and customers usually buy 3 items.

Next, I wanted to see the amount that customers bought from each category and the payment method used. Sadly, because of the nature of the dataset, the values within the columns were evenly distributed.

Here is the amount bought for each category:

Amount Bought For Each Category

Here is the amount of customers that use each payment method: Amount Used For Each Payment Method

Continuing with my analysis, I thought that it would be useful to use the purchase date of customers to see the season each customer bought their item. This provides us with insight on what seasons are popular amongst customers. Count of Seasons We can see based on this visualization that the warmer seasons (Spring and Summer) are more popular

Finally, I wanted to see how age correleated with how much a customer purchased in a single transaction. Age vs. Transaction

Here we can see that there is a positive correlation between age and amount purchased in a single transaction.

RFM Analysis

Since I was doing a customer segmentation project, I thought it was fitting to also do a RFM analysis on my dataset. RFM stands for recency, frequency, and monetray value. These values represent how recent customers bought their product, how often they buy, and how much they spend. This is useful in identifying valuable customers.

For this analysis I calculated recency based on today’s date and a customer’s most recent transaction date. I then used count to see how many transactions customers had and summed the total amount of their transactions.

This has led me to the result dataframe:

Recency	Frequency	Monetary	Recency_Score	Frequency_Score	Monetary_Score	RFM_Group	RFM_Score_Total
397	3	6290	2	1	1	2-1-1	4
181	6	16481	4	4	4	4-4-4	12
331	4	9423	3	2	2	3-2-2	7
550	5	7826	1	3	2	1-3-2	6
533	5	9769	2	3	2	2-3-2	7

I then used regular expressions and a function to map labels accordingly onto customers’ RFM Score.

Here is the final result:

Count of Labeled Customers

Here is the result as a pie chart:

Pie Chart

Feature Engineering

For more useful and insightful clusters, I did more feature engineering.

Category Features

These are features based on the category of the products customers bought.

I created a column called Category_Diversity which consists of the unique amount of different categories customers bought. This value represents how much each customer branches out to different categories. I created another column called Favorite_Category, which consists of the category that the customer buys products from the most.

Return Features

These are features based on the returns of customers.

I created Total_Return, which is the total amount of returns a customer has returned. I also created Return_Rate, which is the rate of return of customers.

Behavioral Features

These are features that pertain to a customer’s buying habits.

Using the purchase date, I found the favorite hour of each customer by finding the hour that the customer made the most purchases in. Similarly, using the same technique, I found the customer’s favorite day of the week.

Spending Trends

Using the purchase date, I found each customer’s favorite season by finding the month the customer made the most transactions in.

This is an example row of the final result of my dataframe after feature engineering:

Customer_ID	Purchase_Date	Total_Transactions	Total_Purchase_Amount	Total_Products	Average_Transaction	Average_Quantity_Per_Transaction	Age	Gender	Category_Diversity	Favorite_Category	Total_Return	Return_Rate	Favorite_Hour	Favorite_Day	Favorite_Season
1	397	3	6290	15	2096.67	5	67	Female	3	Books	1	0.0666667	6	1	Spring

Identifying Outliers

Since I was using K-Means Clustering for my customer segmentation project, it was important to identify and get rid of outliers within my dataset. This is because we are taking the mean distance from each centroid to determine each cluster and since mean is not robust to outliers, this will affect the results and quality of our clusters.

In order to identify outliers I used an Isolation Forest as this algorithm has an easier time isolating outliers due to outliers having distinct characteristics that the algorithm recognizes.

Here is what my dataframe looked like after identifying the outliers:

Customer_ID	Purchase_Date	Total_Transactions	Total_Purchase_Amount	Total_Products	Average_Transaction	Average_Quantity_Per_Transaction	Age	Gender	Category_Diversity	Favorite_Category	Total_Return	Return_Rate	Favorite_Hour	Favorite_Day	Favorite_Season	Outlier_Scores
1	397	3	6290	15	2096.67	5	67	Female	3	Books	1	0.0666667	6	1	Spring	1
2	181	6	16481	18	2746.83	3	42	Female	4	Electronics	4	0.222222	2	0	Summer	1
3	331	4	9423	15	2355.75	3.75	31	Male	3	Electronics	1	0.0666667	0	6	Winter	1
4	550	5	7826	19	1565.2	3.8	37	Male	4	Books	3	0.157895	1	2	Fall	1
5	533	5	9769	13	1953.8	2.6	24	Female	2	Home	5	0.384615	4	2	Spring	1

After running the algorithm this is the percentage of inliers / outliers within my data:

percentage of inliers / outliers

After identifying the outliers, I dropped them from my dataframe.

Determining Multicollinearity

To determine multicollinearity I created a heat map:

correlation matrix

From the heat map it is important to note some features that are highly correlated with each other:

Total_Transactions and Total_Purchase_Amount
Total_Transactions and Total_Products
Total_Purchase_Amount and Total_Products

Multicollinearity is important for our clusters because if there is multicollinearity within our data, this could affect the quality of our clusters as features do not provide unique information.

Finding Optimal Components for PCA

Since we identified multicollinearity within our data we can use PCA it helps reduce the effect of multicollinearity by tranforming the data into uncorrelated data.

PCA visualization

Here we can see the graph of the explained variance which shows how much of the variance is obtained by each principal component. I chose 7 as the number of components as this is the point where the cumulative variance slows down.

Utilizing PCA

After running PCA with 7 components this is the result:

PCA Results

The highlighted cells are the coefficients that correspond to each principal component.

Determining Optimal Number of Components

Elbow Method

The Elbow Method is a technique used to find the optimal number of clusters by generating clusters for different values of k and calculating the sum of squared difference of each point and the closest centroid. If we plot the sum of squared differences we can identify the point where adding more clusters does not significantly reduce the sum of squared difference, creating an “elbow”. However, the downside of this method is that the “elbow” is usually vague and hard to identify.

elbow method

Utilizing the Silhouette Method

The silhouette method is another technique used to find the optimal number of clusters in a dataset. This method calculates a silhouette score for each data point that measure how well each point is assigned to a cluster.

To use the silhouette method you have to choose a range of clusters, then use the resulting cluster as a parameter for the silhouette score.

I plotted the silhouette score and the silhouette cofficients to find the most optimal clusters.

Here is the final result:

silhouette method

Interpreting the Silhouette Visualization

To determine the most optimal amount of clusters I looked at the number of clusters with the highest mean silhouette score and looking at the silhouette plots that are roughly the same thickness and width.

Creating the Clusters

After performing clustering, here is the result of the distribution of the customers amongst the clusters.

cluster distribution

Evaluating Cluster Quality

To evaluate my clusters I’m going to use the Silhouette Score (higher values indicate better clustering), Calinski Harabasz Score (high score indicates better clustering), and Davies Bouldin Score (lower value indicates better clustering).

Here is the evaluation of my clusters:

+————————-+———————+ | Metric | Value | +————————-+———————+ | Number of Observations | 46862 | | Silhouette Score | 0.20510671991363183 | | Calinski Harabasz Score | 15007.420975858586 | | Davies Bouldin Score | 1.6801126746929693 | +————————-+———————+

From my evaluation we can see that my silhouette score is 0.20, which is not ideal, meaning that there could be some overlap between the clusters. My Calinski Harabasz score is super high, which indicates that my clusters are well-defined. My Davies Bouldin Score of 1.68 is reasonable indicating that my clusters are somewhat similar.

Clusters

Here are my resulting clusters:

clusters

Cluster Interpretation

Cluster 0: Customers in cluster 0 are usually males whose favorite shopping category are usually electronics or clothing. They have a moderate average transaction, which means that for each transaction they spend an average amount. These customers do not shop frequently shown in the high purchase_date.

Cluster 1:The customers in cluster 1 are usually females. These customers have purchased the most amount of items shown in the high total transaction, total purchase amount, and total products. These customers have made purchases recently as shown in the low purchase_date. These customers also have a high return rate and total return.