Skip to main content

K Means Algorithm With Real Life Problem

It has a complicated name but it is sample and is a popular unsupervised machine learning technique.

It means to create k number of centroid and then allocate every data point to the nearest cluster, while keeping the number centroid.

               


Let’s explore this technique with an example, Here we have an online tea store data where we have details of customer, their date of account created and purchase styles.  In this we are interested to know what makes the customer comeback to the store. Retention is the one of the biggest mystery in any industry.

Let us quickly open our jupyter notebook.

# Import Necessary Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

Read the file in Pandas,

# Importing Data

#Import Dataset

cr = pd.read_csv("D:\\storedata_total.csv", encoding="ISO-8859-1")

cr.head()



Explore the data set and understand it before treating it. Try treating data and manipulate it according to the need. Check for Null, NA and redundancy.

Building a barplot of overall customer retention status

In [188]:

#building a barplot of overall customer retention status

plt.figure(figsize=(9,9))

ax = sns.countplot('retained',data=cr)

plt.title("Overall Customer Retention")

plt.xlabel("Retention Status")

plt.ylabel("Number of Customers")

for p in ax.patches:

    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()),

            fontsize=12, color='red', ha='center', va='bottom')   

 

#majority of the records have retained customers


                            

~80% of the customers are assumed to be active in the system.

Bring the data on same scale.

from sklearn.preprocessing import StandardScaler

#Bring the data on same scale

scaleobj = StandardScaler()

Scaled_Data = scaleobj.fit_transform(Log_Tfd_Data)

#Transform it back to dataframe

Scaled_Data = pd.DataFrame(Scaled_Data, index = RFMScores.index, columns = Log_Tfd_Data.columns)

The main objective of the K-Means algorithm is to minimize the sum of distances between the data points and their respective cluster - centroid. Where a random data point is selected and called centroid in each cluster and its reaming data points are grouped with them depending on its distance.

In this, we started by creating an elbow graph and in order to understand how many optimal clusters of customers are required.



We can see that our graph have a smooth curve which starts from point no. 4 which implies four as the number of clusters  and to reconfirm we sort out for a silhouette score.

#import sklearn.cluster as cluster

import sklearn.metrics as metrics 

# Silhouette score for k(clusters)

 

for i in range(2,13):

    labels=KMeans(n_clusters=i,init="k-means++",random_state=200).fit(Scaled_Data).labels_

    print ("#Bring the data on same scale = "+str(i)+" is "

           +str(metrics.silhouette_score(Scaled_Data,labels,metric="euclidean",random_state=200)))

Silhouette score for k(clusters) = 2 is 0.9733454076816606

Silhouette score for k(clusters) = 3 is 0.6007113113327599

Silhouette score for k(clusters) = 4 is 0.5242326198755917

Silhouette score for k(clusters) = 5 is 0.5443578715214975

Silhouette score for k(clusters) = 6 is 0.4679776839296599

Silhouette score for k(clusters) = 7 is 0.47054469398650856

Silhouette score for k(clusters) = 8 is 0.4789560824402456

Silhouette score for k(clusters) = 9 is 0.49011300723523726

Silhouette score for k(clusters) = 10 is 0.4921726855136204

Silhouette score for k(clusters) = 11 is 0.49249945906775494

Silhouette score for k(clusters) = 12 is 0.49105683613109047

After Calculating Silhouette score for k(clusters) we compare eblow curve and Silhouette score and we choose K=4

Perform K-Mean Clustering or build the K-Means clustering model

label = RFMScores.Color.value_counts()

sns.set_style("darkgrid")

plt.figure(figsize=(10,4))

colors = ['#0000FF','#FFFF00',"#ff0000", "#00FF00"]

# Set your custom color palette

sns.set_palette(sns.color_palette(colors))

sns.barplot(x=label.index, y=label.values)

plt.show()





RFM:Scores['Color'].value_counts()


blue      17100

yellow     9364

red        4290

green         9

Name: Color, dtype: int64

Further along, we can see that we have four clusters namely blue, yellow, red and green where we can understand that most of our data point that are concentrated near our first centroid are coded as blue and similarly for yellow, red and green. As we have already explored the data we know most of our customers are retained so, its safe to say that most data points around the first centroid are most loyal customers with high average orders. This brings us to the wrap of our K means algorithm. To  model further visit my GitHub.

Below is the link to my github page for the dataset and detailed code.

GITHUB-Devd

Hope you find it useful and informative. Follow my blog for more Data related quests.


-DataDevil

Honey Saini

Comments

Popular posts from this blog

Prophet of The Future .             Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI. Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.       Accurate and fast.       Fully automatic.       Tunable forecasts.       Available in R or Python. Let’s explore this with an example.   Here we are using Air Passenger dataset and our jupyter workbook. (you can get the link to this dataset at the end) import warnings warnings . filterwarnings( "ignore" ) import numpy as np from d...
Future of Real Estate in the US after the Pandemic . Real Estate has always been a fascinating investment topic to be debated by the pandits. Is the time " NOW " or " NEVER " to invest in real estate after so many obstacles have shaken our faiths in it. Well, the United States of America's real estate market doest think so. In fact, the market appears to be steady and ever raising.  A study by CoreInsights has shown that the market for real estate has increased in the top 10 states  namely  California, Hawaii, Washington, Colorado, Utah, Nevada, Oregon, Idaho, Massachusetts & Arizona. To give an example, in Nevada, house prices have more than doubled since 2010 (105.84%), while in Connecticut, the average price has increased by just 1.12% over the same period.   So then, if house prices continue to increase at this rate over the next ten years, how would the average house price look across the nation?  Then when we look at how 2030 prices could look in Ame...
COVID-19 - B e a u t y    a t    a    P r i c e    The global beauty industry (comprising skin care, color cosmetics, hair care, fragrances, and personal care) has been shocked by the COVID-19 crisis. First-quarter sales have been weak, and there have been widespread store closures. But the industry has quickly adapted to the change by changing its product line to hand sanitizers and house cleaning products also offering free beauty services to front line workers to gain positive brand positioning. The global beauty industry generated $50 billion in sale a year and accounted to millions of jobs, directly and indirectly giving people in these tough times financial capabilities. Let’s be clear we are talking about an industry which even recession couldn’t kick to the ground. In 2008 financial crises, the spending fell slightly but it was regained by 2010. Figure 1: Even though  recession didn’t had stronger economic impact compared to COVID-19....