Customer segmentation using R

R is a versatile software for statistical computing and graphics that can be used for countless projects including market research.

Customer segmentation is an important process used in market research (and other business planning such as marketing or investment strategies) requiring a deeper understanding of customers. Market segmentation groups customers depending on characteristics they have in common. The four main methods are based on demographics, behaviors, geography, and psychographic segmentation. Many others are possible, depending on a company’s priorities, products/services, and other objectives. In this R project we use the Mall_Customers dataset. It is open to the public and available here.

I will use it to show how R easily performs customer segmentation projects and reports.

Before loading the dataset, I recommend changing the names of the variables “Annual Income” and “Spending Score” into “Annual_Income” and “Spending_Score”, because spaces in variable names can sometimes lead to syntax problems.

For this analysis, I recommend loading the following packages from the beginning.

library(plotly)
library(dplyr)
library(tibble)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(readr)
library(ggpubr)
#Load the dataset
Mall_Customers <- read_csv("Mall_Customers.csv")
View(Mall_Customers)
#Change the name of the dataset for convenience
customers_data <- Mall_Customers
customers_data
#this is a preview of your dataset
names(customers_data)

Check the names of the columns. Each observation in the data set is a customer. Each customer has an identification number (CustomerID), Gender, Age, Annual_Income (in thousands of dollars), and a Spending_Score. The spending score is given by the mall authorities and it describes the total amount the customer spends.

The next function provides more insights regarding the data set. There are 200 observations and 5 variables (the columns we have just identified).

str(customers_data)

Check the first 6 observations from the top:

head(customers_data)

For each variable, we obtain a summary of minimum, median, mean, maximum, 1st quartile, and 3rd quartile for numeric values. For example, the minimum age of our customers is 18 years, the maximum age is 70 years. Their annual income varies from $15 k to $ 137 k, the mean is $60.56 k.

summary(customers_data)

In case we want to obtain direct information about customers’ age, we can use the following code:

sd(customers_data$Age)

The standard deviation is 13.96, and this is useful to show which ages are within one standard deviation of the mean. This result could be applied to Annual_Income as well.

summary(customers_data$Annual_Income)
sd(customers_data$Annual_Income)

The standard deviation for annual income is 26.2. I make a density plot of Annual_Income like this:

plot(density(customers_data$Annual_Income),
     col="yellow",
     main="Density Plot for Annual Income",
     xlab="Annual Income Class",
     ylab="Density")

#this is to fill the plot
polygon(density(customers_data$Annual_Income),
        col="#17becf")

Customers’ minimum income is $15 k and maximum income is $137 k, with an average of $60.56 k. The annual income has a unimodal distribution, and it is a bit skewed. This information about income can be used to consider which products are more appealing for customers. Since in this data set few customers have incomes over $100 k, it would not be very useful for high-end products, but it could be quite useful for mid-range products.

Next, I make a barplot of the customers’ gender.

The table function below quickly identifies the customers’ gender:

gender_tab <- table(customers_data$Gender)
gender_tab 

There are more females than males, which is relevant in case some products are mainly for females or the contrary.

barplot(gender_tab,main="BarPlot - Gender",
        ylab="Count",
        xlab="Gender",
        col=rainbow(3))

Customers’ age by gender can also be represented as follows:

library(ggplot2)
my_plot <- ggplot(customers_data,aes(x= Age, fill=Gender))+geom_histogram( bins = 50)
my_plot +
  scale_fill_manual(values = c("Female" = "#1b98e0",
                               "Male" = "#f6f805"))

The ratio of females to males can be calculated as follows:

prop.table(table(customers_data$Gender))

This ratio can be also represented graphically as follows:

pct <- round(gender_tab/sum(gender_tab)*100)
lbs <- paste(c("Female","Male")," ",pct,"%",sep=" ")
install.packages("plotrix")
library(plotrix)
pie3D(gender_tab,labels=lbs,
      main="Pie Chart - Ratio of Female and Male")

Another way to make a pie chart of the ratio is as follows:

slices <- c(56, 44)
lbls <- c("Female", "Male")
pct <- round(slices/sum(slices)*100)                      
lbls <- paste(lbls, pct) 
lbls <- paste(lbls,"%",sep="")
pie(slices,labels = lbls, col = c("pink", "slategray4"),
    main="Pie Chart Depicting Ratio of Female and Male")

Now, I describe customers’ ages using histograms:

hist(customers_data$Age,
     col="blue",
     main="Histogram to Show Count of Age Class",
     xlab="Age Class",
     ylab="Frequency",
     labels=TRUE)

Here is another method that presents age by gender:

age_tab <- table(customers_data$Age)
age_tab

barplot(age_tab,main="Using BarPlot to display Gender Comparision",
        ylab="Count",
        xlab="Age",
        col=rainbow(2),
        legend=rownames(a))

I can also use a boxplot (even though it is a bit boring):

boxplot(customers_data$Age,
        col="blue",
        main="Boxplot for Descriptive Analysis of Age")

From these plots, I see that customers in their thirties constitute an especially large group in the data set.

Annual Income can be represented in the same way:

library(ggplot2)
summary(customers_data$Annual_Income)
hist(customers_data$Annual_Income,
     col="#9467bd",
     main="Histogram for Annual Income",
     xlab="Annual Income Class",
     ylab="Frequency",
     labels=TRUE)

The annual income is between $15 k and $137 k, the average income is $60.56 k.

We can also describe annual income by gender as follows:

income_plot <- ggplot(customers_data,aes(x= Annual_Income, fill=Gender)) +geom_histogram(bins = 50)
income_plot
income_plot+
  scale_fill_manual(values = c("Female" = "#1b98e0",
                               "Male" = "#f6f805"))

This is the histogram of the spending score by gender (I change the colors to differentiate it from the previous image):

spscore_plot <- ggplot(customers_data,aes(x= Spending_Score, fill=Gender)) +geom_histogram(bins = 50)
spscore_plot
spscore_plot+
  scale_fill_manual(values = c("Female" = "#e377c2",
                               "Male" = "#1f77b4"))

References:

(DataFlair Team. (2019, July 31). Data Science Project – Customer Segmentation using Machine Learning in R – DataFlair. DataFlair. https://data-flair.training/blogs/r-data-science-project-customer-segmentation/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s