In the chapter 2, we get RFM model just by quantiling the variables independently. Here KMeans model counts on the relationships among variables.
from sklearn.cluster import KMeans
import pandas as pd
datamart_rfm = pd.read_csv("data/chapter_4/datamart_rfm.csv")
kmeans = KMeans(n_clusters=3,random_state=1)
kmeans.fit(datamart_rfm)
cluster_labels = kmeans.labels_
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)
datamart_rfm_k3.groupby('Cluster').agg({
'Recency': 'mean',
'Frequency': 'mean',
'MonetaryValue': ['mean', 'count']
}).round(1)
The classification is good, let's compare it to the previous one.
pd.read_csv("data/chapter_2/datamart_rfm_scores_named_segment.csv").groupby('RFM_Level').agg({
'Recency': 'mean',
'Frequency': 'mean',
'MonetaryValue': ['mean', 'count']
}).round(1)
Obviously, KMeans model does not treat each group with similar counts.
However, we are not sure three clusters is the best choice. Do a for loop to check it.
datamart_normalized = pd.read_csv("data/chapter_4/datamart_normalized_df.csv")
datamart_normalized.head()
sse = {}
for k in range(1,21):
kmeans = KMeans(n_clusters = k, random_state = 1)
kmeans.fit(datamart_normalized)
sse[k] = kmeans.inertia_
import seaborn as sns
import matplotlib.pyplot as plt
plt.title("The Elbow Method")
plt.xlabel('k')
plt.ylabel('SSE')
sns.pointplot(x = list(sse.keys()), y = list(sse.values()))
plt.show()
Here, three is good.
Now, turn to properly display the model performance for the readers.
datamart_rfm_k3.head()
datamart_melt = pd.melt(
datamart_rfm_k3,
id_vars = ['CustomerID','Cluster'],
value_vars = ['Recency','Frequency','MonetaryValue'],
var_name = 'Metric', value_name = 'value'
)
Like tidyr::gather
in R
Do a snake plot
datamart_melt = pd.read_csv("data/chapter_4/datamart_melt.csv")
plt.title('Snake plot of normalized variables')
plt.xlabel('Metric')
plt.ylabel('Value')
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()
Here customer type 2 is normal in all metrics.
The next display way is to compare each group to the population.
relative_imp = \
datamart_rfm_k3.drop('CustomerID',axis = 1).groupby('Cluster').mean() / \
datamart_rfm_k3[['Frequency','MonetaryValue','Recency']].mean() - 1
relative_imp
# Initialize a plot with a figure size of 8 by 2 inches
plt.figure(figsize=(8, 2))
# Add the plot title
plt.title('Relative importance of attributes')
# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()