K Means Algorithm With Real Life Problem

It has a complicated name but it is sample and is a popular unsupervised machine learning technique.

It means to create k number of centroid and then allocate every data point to the nearest cluster, while keeping the number centroid.

Let’s explore this technique with an example, Here we have an online tea store data where we have details of customer, their date of account created and purchase styles. In this we are interested to know what makes the customer comeback to the store. Retention is the one of the biggest mystery in any industry.

Let us quickly open our jupyter notebook.

# Import Necessary Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

Read the file in Pandas,

# Importing Data

#Import Dataset

cr = pd.read_csv("D:\\storedata_total.csv", encoding="ISO-8859-1")

cr.head()

Explore the data set and understand it before treating it. Try treating data and manipulate it according to the need. Check for Null, NA and redundancy.

Building a barplot of overall customer retention status

In [188]:

#building a barplot of overall customer retention status

plt.figure(figsize=(9,9))

ax = sns.countplot('retained',data=cr)

plt.title("Overall Customer Retention")

plt.xlabel("Retention Status")

plt.ylabel("Number of Customers")

for p in ax.patches:

ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()),

fontsize=12, color='red', ha='center', va='bottom')

#majority of the records have retained customers

~80% of the customers are assumed to be active in the system.

Bring the data on same scale.

from sklearn.preprocessing import StandardScaler

#Bring the data on same scale

scaleobj = StandardScaler()

Scaled_Data = scaleobj.fit_transform(Log_Tfd_Data)

#Transform it back to dataframe

Scaled_Data = pd.DataFrame(Scaled_Data, index = RFMScores.index, columns = Log_Tfd_Data.columns)

The main objective of the K-Means algorithm is to minimize the sum of distances between the data points and their respective cluster - centroid. Where a random data point is selected and called centroid in each cluster and its reaming data points are grouped with them depending on its distance.

In this, we started by creating an elbow graph and in order to understand how many optimal clusters of customers are required.

We can see that our graph have a smooth curve which starts from point no. 4 which implies four as the number of clusters and to reconfirm we sort out for a silhouette score.

#import sklearn.cluster as cluster

import sklearn.metrics as metrics

# Silhouette score for k(clusters)

for i in range(2,13):

labels=KMeans(n_clusters=i,init="k-means++",random_state=200).fit(Scaled_Data).labels_

print ("#Bring the data on same scale = "+str(i)+" is "

+str(metrics.silhouette_score(Scaled_Data,labels,metric="euclidean",random_state=200)))

Silhouette score for k(clusters) = 2 is 0.9733454076816606

Silhouette score for k(clusters) = 3 is 0.6007113113327599

Silhouette score for k(clusters) = 4 is 0.5242326198755917

Silhouette score for k(clusters) = 5 is 0.5443578715214975

Silhouette score for k(clusters) = 6 is 0.4679776839296599

Silhouette score for k(clusters) = 7 is 0.47054469398650856

Silhouette score for k(clusters) = 8 is 0.4789560824402456

Silhouette score for k(clusters) = 9 is 0.49011300723523726

Silhouette score for k(clusters) = 10 is 0.4921726855136204

Silhouette score for k(clusters) = 11 is 0.49249945906775494

Silhouette score for k(clusters) = 12 is 0.49105683613109047

After Calculating Silhouette score for k(clusters) we compare eblow curve and Silhouette score and we choose K=4

Perform K-Mean Clustering or build the K-Means clustering model

label = RFMScores.Color.value_counts()

sns.set_style("darkgrid")

plt.figure(figsize=(10,4))

colors = ['#0000FF','#FFFF00',"#ff0000", "#00FF00"]

# Set your custom color palette

sns.set_palette(sns.color_palette(colors))

sns.barplot(x=label.index, y=label.values)

plt.show()

RFM:Scores['Color'].value_counts()

blue 17100

yellow 9364

red 4290

green 9

Name: Color, dtype: int64

Further along, we can see that we have four clusters namely blue, yellow, red and green where we can understand that most of our data point that are concentrated near our first centroid are coded as blue and similarly for yellow, red and green. As we have already explored the data we know most of our customers are retained so, its safe to say that most data points around the first centroid are most loyal customers with high average orders. This brings us to the wrap of our K means algorithm. To model further visit my GitHub.

Below is the link to my github page for the dataset and detailed code.

GITHUB-Devd

Hope you find it useful and informative. Follow my blog for more Data related quests.

-DataDevil

Honey Saini

Learn ML and Data Science

Search This Blog

Contributors

K Means Algorithm With Real Life Problem

Labels

Comments

Post a Comment

Popular posts from this blog