K Means Algorithm With Real Life Problem

It means to create k number of centroid and then allocate every data point to the nearest cluster, while keeping the number centroid.

Let us quickly open our jupyter notebook.
# Import Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Read the file in Pandas,
# Importing Data
#Import
Dataset
cr
= pd.read_csv("D:\\storedata_total.csv", encoding="ISO-8859-1")
cr.head()

Explore the data set and understand it before treating it. Try treating data and manipulate it according to the need. Check for Null, NA and redundancy.
In [188]:
#building
a barplot of overall customer retention status
plt.figure(figsize=(9,9))
ax
= sns.countplot('retained',data=cr)
plt.title("Overall
Customer Retention")
plt.xlabel("Retention
Status")
plt.ylabel("Number
of Customers")
for p in ax.patches:
ax.text(p.get_x() + p.get_width()/2.,
p.get_height(), '%d' % int(p.get_height()),
fontsize=12, color='red',
ha='center', va='bottom')
#majority
of the records have retained customers

Bring the data on same scale.
from sklearn.preprocessing import StandardScaler
#Bring the
data on same scale
scaleobj = StandardScaler()
Scaled_Data = scaleobj.fit_transform(Log_Tfd_Data)
#Transform
it back to dataframe
Scaled_Data = pd.DataFrame(Scaled_Data, index = RFMScores.index,
columns = Log_Tfd_Data.columns)
The main objective of the K-Means algorithm is to minimize
the sum of distances between the data points and their respective cluster -
centroid. Where a random data point is selected and called centroid in each
cluster and its reaming data points are grouped with them depending on its
distance.
In this, we started by creating an elbow graph and in order
to understand how many optimal clusters of customers are required.

We can see that our graph have a smooth curve which starts
from point no. 4 which implies four as the number of clusters and to reconfirm we sort out for a silhouette
score.
#import
sklearn.cluster as cluster
import sklearn.metrics as metrics
#
Silhouette score for k(clusters)
for i in range(2,13):
labels=KMeans(n_clusters=i,init="k-means++",random_state=200).fit(Scaled_Data).labels_
print ("#Bring the data on same scale
= "+str(i)+" is "
+str(metrics.silhouette_score(Scaled_Data,labels,metric="euclidean",random_state=200)))
Silhouette score for
k(clusters) = 2 is 0.9733454076816606
Silhouette score for
k(clusters) = 3 is 0.6007113113327599
Silhouette score for
k(clusters) = 4 is 0.5242326198755917
Silhouette score for
k(clusters) = 5 is 0.5443578715214975
Silhouette score for
k(clusters) = 6 is 0.4679776839296599
Silhouette score for
k(clusters) = 7 is 0.47054469398650856
Silhouette score for
k(clusters) = 8 is 0.4789560824402456
Silhouette score for
k(clusters) = 9 is 0.49011300723523726
Silhouette score for
k(clusters) = 10 is 0.4921726855136204
Silhouette score for
k(clusters) = 11 is 0.49249945906775494
Silhouette score for
k(clusters) = 12 is 0.49105683613109047
After
Calculating Silhouette score for k(clusters) we compare eblow curve and
Silhouette score and we choose K=4
Perform
K-Mean Clustering or build the K-Means clustering model
label
= RFMScores.Color.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
colors
= ['#0000FF','#FFFF00',"#ff0000", "#00FF00"]
#
Set your custom color palette
sns.set_palette(sns.color_palette(colors))
sns.barplot(x=label.index,
y=label.values)
plt.show()

RFM:Scores['Color'].value_counts()
blue 17100
yellow 9364
red 4290
green 9
Name: Color, dtype:
int64
Further along, we can see that we have four clusters namely blue, yellow, red and green where we can understand that most of our data point that are concentrated near our first centroid are coded as blue and similarly for yellow, red and green. As we have already explored the data we know most of our customers are retained so, its safe to say that most data points around the first centroid are most loyal customers with high average orders. This brings us to the wrap of our K means algorithm. To model further visit my GitHub.
Below is the link to my github page for the dataset and
detailed code.
Hope you find it useful and informative. Follow my blog for more Data related quests.
-DataDevil
Honey Saini
Comments
Post a Comment
Thank you.