In the chapter 2, we get RFM model just by quantiling the variables independently. Here KMeans model counts on the relationships among variables.

In [77]:
from sklearn.cluster import KMeans
import pandas as pd
In [78]:
datamart_rfm = pd.read_csv("data/chapter_4/datamart_rfm.csv")
In [79]:
kmeans = KMeans(n_clusters=3,random_state=1)
kmeans.fit(datamart_rfm)
Out[79]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=1, tol=0.0001, verbose=0)
In [80]:
cluster_labels = kmeans.labels_
In [81]:
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)
In [82]:
datamart_rfm_k3.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)
Out[82]:
Recency Frequency MonetaryValue
mean mean mean count
Cluster
0 92.7 17.9 320.6 1801
1 88.8 18.0 281.7 1825
2 22.7 175.6 15231.8 17

The classification is good, let's compare it to the previous one.

In [22]:
pd.read_csv("data/chapter_2/datamart_rfm_scores_named_segment.csv").groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)
Out[22]:
Recency Frequency MonetaryValue
mean mean mean count
RFM_Level
Low 180.8 3.2 52.7 1075
Middle 73.9 10.7 202.9 1547
Top 20.3 47.1 959.7 1021

Obviously, KMeans model does not treat each group with similar counts.

However, we are not sure three clusters is the best choice. Do a for loop to check it.

In [25]:
datamart_normalized = pd.read_csv("data/chapter_4/datamart_normalized_df.csv")
In [26]:
datamart_normalized.head()
Out[26]:
CustomerID Recency Frequency MonetaryValue
0 12747 -2.002202 0.865157 1.464940
1 12748 -2.814518 3.815272 2.994692
2 12749 -1.789490 1.189117 1.347598
3 12820 -1.789490 0.546468 0.500595
4 12822 0.337315 0.020925 0.037943
In [29]:
sse = {}
for k in range(1,21):
    kmeans = KMeans(n_clusters = k, random_state = 1)
    kmeans.fit(datamart_normalized)
    sse[k] = kmeans.inertia_
In [30]:
import seaborn as sns
import matplotlib.pyplot as plt
In [35]:
plt.title("The Elbow Method")
plt.xlabel('k')
plt.ylabel('SSE')
sns.pointplot(x = list(sse.keys()), y = list(sse.values()))
plt.show()

Here, three is good.

Now, turn to properly display the model performance for the readers.

In [59]:
datamart_rfm_k3.head()
Out[59]:
CustomerID Recency Frequency MonetaryValue Cluster
0 12747 3 25 948.70 0
1 12748 1 888 7046.16 0
2 12749 4 37 813.45 0
3 12820 4 17 268.02 0
4 12822 71 9 146.15 0
In [60]:
datamart_melt = pd.melt(
    datamart_rfm_k3,
    id_vars = ['CustomerID','Cluster'],
    value_vars = ['Recency','Frequency','MonetaryValue'],
    var_name = 'Metric', value_name = 'value'
)

Like tidyr::gather in R

Do a snake plot

In [62]:
datamart_melt = pd.read_csv("data/chapter_4/datamart_melt.csv")
In [64]:
plt.title('Snake plot of normalized variables')
plt.xlabel('Metric')
plt.ylabel('Value')
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()

Here customer type 2 is normal in all metrics.

The next display way is to compare each group to the population.

In [90]:
relative_imp = \
datamart_rfm_k3.drop('CustomerID',axis = 1).groupby('Cluster').mean() / \
datamart_rfm_k3[['Frequency','MonetaryValue','Recency']].mean() - 1
In [92]:
relative_imp
Out[92]:
Frequency MonetaryValue Recency
Cluster
0 -0.041074 -0.135118 0.024997
1 -0.037551 -0.240098 -0.017692
2 8.382597 40.089783 -0.748928
In [93]:
# Initialize a plot with a figure size of 8 by 2 inches 
plt.figure(figsize=(8, 2))

# Add the plot title
plt.title('Relative importance of attributes')

# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()