Customer Segmentation for Starbucks

11 min readApr 28, 2021

Business Understanding:

Since Starbucks has a wide base of customers scattered all around the world, it is necessary to segment the customers depending on various factors as to offer the right product to the right customers at the right time. Starbucks is also expanding by venturing into many more cities locally and countries globally.

With the increase in the range of products and the also the increasing demographics, the objective here is to find the customers' patterns and behaviors. Customers can be segmented into different ways based on their: Demographics, Psychographics, Geographical and Behavioural.

Here, since we are provided with the customer data, we are going to segment the customers based on their Demographics and find the type of offer they mostly respond to.

For Unsupervised learning, we will segment customers based on age and income to find the age and income responses to the offer type
In real life, predicting customer response based on demographic data gives are very low accuracy, the same customer will react differently to the same offer on different occasions depending on several factors. It would be much better to predict the probabilities of outcomes rather than just predicting it as a success or not.

Therefore, we are going to use these 5 classification algorithms that support probability estimation:

LogisticRegression
RandomForestRegressor
KNeighborsClassifier
GaussianNB
DecisionTreeClassifier.

After choosing the best metrics with high accuracy, we will find the best offer_type to send to a particular customer based on these 15 features- time’, ‘gender’, ‘age’, ‘income’, ‘start_year’, ‘reward’, ‘difficulty’, ‘duration’, ‘email’, ‘mobile’, ‘social’, ‘web’, ‘bogo’, ‘discount’, ‘informational.

You can follow along with the Jupyter notebooks from my Github repository.

Data Understanding:

The data is provided by Starbucks contains simulated data that mimics customer behavior contains three data files. Ten different types of offers were sent to the customers and were received by the customers via four different channels: web, email, mobile, and social. These included three types of offers: buy-one-get-one (BOGO), discount, and informational. The data contains information about 17,000 customers receiving offers, opening offers, completing offers, and making transactions.

The data consists of three JSON files:

portfolio.json — metadata about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers complete

Here is the schema and explanation of each variable in the files:

portfolio.json- 10 rows, 6 columns

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json- 17000 rows, 5 columns.

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json- 306534 rows, 4 columns.

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Cleaning each dataset:

All the three data files are cleaned and convert into the formats to use the data for futhur analysis.

Clean Portfolio Data:

One-hot encode channels
One-hot encode offer_type column

Cleaned Portfolio data frame:

Clean Profile Data

Check for null values
check the age column for extreme values (118)
Drop rows with no gender, income, age of 118
Create readable date format in became_member_on column
Extract its year from became_member_on column add start_year columns (for further analysis)

Cleaned Profile data frame:

Clean Transcript Data

Create separate columns for amount and offer_id from value column dictionary
merge the three datasets with common columns
transcript: segregate offer and transaction data
Label the columns — offer_id. offer_type, gender, and the unique customer_ids to convert them into integer data tpe
Create a offers dataframe by seperating it from transaction in the event column

Cleaned Transcript data frame:

Offer dataframe:

Exploratory Data Analysis

Carrying out Exploratory Data Analysis on the above datasets to answer the following questions regarding the customer segementation.

Here, since we are going to find which offer type — BOGO, discount, informational does the customer respond to the maximum, the column — offer_type is our target variable and the remaining columns are our feature variables.

What is the Gender Distribution of Starbucks Customers?

Ans1: The proprtion of males(around 9000) is slightly more than those of the females(around 6000)and very small amount of others

2. What is the Age Distribution and average age of Starbucks Customers?

Ans2: Age group range from 40–70 frequently visit starbucksrbucks, the reason can be steady life after 40.
with an average of 54 years

3. What is the Income Distribution and average Income of Starbucks Customers?

Ans 3: There is a decrease in the number of customers as after 70K, mentioning as the income increases people spend less on coffe.
with an average income of 65k.

4.How many customers enrolled yearly?

Ans4: Members of the starbucks increased exponentially from 2013 and reached its highest in 2017 which later declines steadily
5599 customers enrolled in 2017

5. Which gender has the highest yearly membership?

Ans5: With the increase in popularity of starbucks, people have joined starbucks yerly exponentially and reached its zenith in 2017.
more men have joined than the female and very few from others every year

6.Which gender has the highest Annual income?

The highest and the lowest income for both male and female are approximately same and for others it is less on both the sides.
The median income (the white dot) for females (around 70k) is higher than males and others (around 60k)
for females the income spreads from 40k to 100k.
For males most the spread is around 40k to 70k which close to median.
for others the spread is around 60K
The count of male customers in low-income level is slightly higher than that of female and other customers.

7. What is the distribution of event in transcripts?

Ans7: We can see that most of the transcripts are transactions.
Around 75% of the offer received were viewed. And nearly 50% of the viewed offers were completed.

8. What is the percent of trasactions and offers in the event?

Ans 8: Nearly 45.5% are trasactions and 54.5% are offers

9. What are the types of offers : received,views, completed ?

Ans9: More of Bogo and Dicount offers were received by the customers than that of informational.
More Bogo offers have been viewed
Most of the discount offers have been completed and no informational offer completed.
Hence, in order to make a offer complete, more of discount offers must be sent to the customers.
Here, bogo has also been a good offer since high number of customers view such offers.

10. What is the Income Distribution for the Offer Events?

Ans10: Highest Offer is received by income group of 50–60k with the least of 110–120k.
The highest offer completed is also from 50–60k and decreses on either side with a larger slope on the higher income groups.
starbucks have lesser higher income group customers.

11. What are the Offer types amongst ages, gender and income groups?

Ans11: We can see from the above graphs that, Bogo is slightly more popular amongst the ages,gender and income groups.
50–59 age group is more respondent to these offers than the otherer groups
Also, for the income distribution, the informational offer is almost round 50% than the other two.
Most male are respondents of these offers than the females with BOGO its leading type
To sumup it up, the active starbucks customer respondents are from the age group of 50–59, with higher male percentage having and annual income of 50–60k.

12. What is the highest completed offer?

Ans12: Out of the orders completed,The offer_id which was a gained higher success rate is ‘fafdcd668e3743c1bb461111dcafc2a4’
with a total of 4957 completions

13. What is the lowest completed offer?

Ans13: Out of the orders completed,The offer_id which was a gained least success rate is ‘4d5c57ea9a6940dd891ad53e9dbe8da0’
with a total of 3281 completions

Data Modelling

Unsupervised Learning Using Kmeans Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). It is an iterative algorithm, that partitions the unlabelled data into K distinct non-overlapping clusters. Each data point belongs to only one cluster.

The Graph makes an elbow at 2.
Number of optimal clusters for the dataset is 2

we fit and predict using offers_df data set, we predict the offer_type w.r.t to the annual income and gender of the customers using the 2 clusters obtained from Kmeans Clustering.

For Cluster 1, the income ranges from 30000.0 to 68000.0.
For Cluster 2, the income ranges from 69000.0 to 120000.0.

Compared to BOGO and Discount offer, the informational offers are very less popular.
Few cases the Discout Offers are used more than the BOGO offers:
In cluster1 at income=51000, income=52000 and
In cluster2 at income=76000, income=77000
Since the income is unevenly distributed,it can also be concluded that the annual income is indepedent of the purchasing behaviour

Compared to BOGO and Discount offer, the informational offers are less popular.
It can thus be concluded that Males with the above income range tend to spend more than Females and Other Genders for the BOGO and Discount Offers.
It can thus be concluded that Females with income range 71000.0 to 120000.0 tend to spend more than Males and Other Genders for the BOGO and Discount Offers.

Supervised Learning

It will help to predict the correct offer_type to send to each customer.

The offers_df is split into feature and target.

where target is the offer_type column and feature is the remaining column(time,gender,age,income,start_year,reward,difficulty,duration,email,mobile,social,web,bogo,discount,informational).

Training and testing sets are created using train_test_split.

Since it is a classification problem,we will use accuracy to evaluate my models.
Comapre the correct predictions and total number of predicitons to determine the accuracy of the model and choose the best.
-Five different ML algorithms can be test on the datset :
1. Decision Trees
2. Logistic Regression
3. Nearest Neighbours (KNN)
4. Naive Bayes
5. Random Forest

We get the accuracy results:

Accuracy score is 100% for training and testing datasets for RandomForestClassifier, GaussianNB, DecisionTreeClassifier which can lead to **overfitting.
Since logistic Regression has a very low train accuracy of 0.50 and test accuracy of 0.52.
So, we choose KNeighborsClassifier.
It has good results 0.93 on training and 0.82 on testing datasets.
Since we have few binomial outcomes ( BOGO = 1, discount = 2, informational = 3 ) we can use KNeighborsClassifier..

Hyperparameter tuning of KNeighborsClassifier to increase the accuracy

It is possible to improve the performance of the model from it base instance by tuning hyperparameters of that algorithm.
We will define a range of values that would be evaluated in the hyperparameter space of the for KNeighborsClassifier model using RandomizedSearchCV.

The best scores achieved after tuning,its essential hyper-parameters{‘p’: 1, ‘n_neighbors’: 21, ‘leaf_size’: 4} byKNeighborsClassifier : training accuracy : 0.93 and testing accuracy :0.93 .
testing accuracy has increased after hyperparameter tunning.

Evalute the model accuracy:

Lets pick a random customer from our data n test its accuracy. Weve taken random customer from the offer_df dataset and checked its offer_type which is BOGO.

Lets evaluate the feature parameters of this customer and check its predicted offer type and thereby check its accuracy.

The model has correctly predicted that the customer will likely respond tor BOGO offer type with an accuracy of 90 %.
Hence our model has good accuracy for prediction.

Conclusion:

Segmentation of startbucks Customers:

The customers can be segmented depending on various parameters according to the campaign chosen
On analysis the data using supervised and unsupervised learning(Kmeans), we can conclude that:
Different segments of customers react to offers differently.
The count of male customers in low-income level is slightly higher than that of female and other customers
Though the aveage salary of femal is greater than that of the male, female spend less on starbucks than male
Starbucks has more of the young crowd than those of the aged once.
The result of the offer_type was prediced by training a supervised classifier.

Results:-

Customers are attracted to BOGO and Discount offers more as compared to Informational Offers
The buying behaviour of a customer is indepemdent of its annual income
Starbucks have more male customers than females and other gender.
KNeighborsClassifier turned out to be the best algorithm for this task and predicts customer response with an accuracy rate of almost 93% after hyperarameter tuning. Given the fact that also the same customer will react differently the same offer.