In the chapter 2, we get RFM model just by quantiling the variables independently. Here KMeans model counts on the relationships among variables.

from sklearn.cluster import KMeans
import pandas as pd

datamart_rfm = pd.read_csv("data/chapter_4/datamart_rfm.csv")

kmeans = KMeans(n_clusters=3,random_state=1)
kmeans.fit(datamart_rfm)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=1, tol=0.0001, verbose=0)

cluster_labels = kmeans.labels_

datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)

datamart_rfm_k3.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)

The classification is good, let's compare it to the previous one.

pd.read_csv("data/chapter_2/datamart_rfm_scores_named_segment.csv").groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)

Obviously, KMeans model does not treat each group with similar counts.

However, we are not sure three clusters is the best choice. Do a for loop to check it.

datamart_normalized = pd.read_csv("data/chapter_4/datamart_normalized_df.csv")

datamart_normalized.head()

sse = {}
for k in range(1,21):
    kmeans = KMeans(n_clusters = k, random_state = 1)
    kmeans.fit(datamart_normalized)
    sse[k] = kmeans.inertia_

import seaborn as sns
import matplotlib.pyplot as plt

plt.title("The Elbow Method")
plt.xlabel('k')
plt.ylabel('SSE')
sns.pointplot(x = list(sse.keys()), y = list(sse.values()))
plt.show()

Here, three is good.

Now, turn to properly display the model performance for the readers.

datamart_rfm_k3.head()

datamart_melt = pd.melt(
    datamart_rfm_k3,
    id_vars = ['CustomerID','Cluster'],
    value_vars = ['Recency','Frequency','MonetaryValue'],
    var_name = 'Metric', value_name = 'value'
)

Like tidyr::gather in R

Do a snake plot

datamart_melt = pd.read_csv("data/chapter_4/datamart_melt.csv")

plt.title('Snake plot of normalized variables')
plt.xlabel('Metric')
plt.ylabel('Value')
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()

Here customer type 2 is normal in all metrics.

The next display way is to compare each group to the population.

relative_imp = \
datamart_rfm_k3.drop('CustomerID',axis = 1).groupby('Cluster').mean() / \
datamart_rfm_k3[['Frequency','MonetaryValue','Recency']].mean() - 1

relative_imp

# Initialize a plot with a figure size of 8 by 2 inches 
plt.figure(figsize=(8, 2))

# Add the plot title
plt.title('Relative importance of attributes')

# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()

	Recency	Frequency	MonetaryValue
	mean	mean	mean	count
Cluster
0	92.7	17.9	320.6	1801
1	88.8	18.0	281.7	1825
2	22.7	175.6	15231.8	17

	Recency	Frequency	MonetaryValue
	mean	mean	mean	count
RFM_Level
Low	180.8	3.2	52.7	1075
Middle	73.9	10.7	202.9	1547
Top	20.3	47.1	959.7	1021

	CustomerID	Recency	Frequency	MonetaryValue
0	12747	-2.002202	0.865157	1.464940
1	12748	-2.814518	3.815272	2.994692
2	12749	-1.789490	1.189117	1.347598
3	12820	-1.789490	0.546468	0.500595
4	12822	0.337315	0.020925	0.037943

	CustomerID	Recency	Frequency	MonetaryValue
0	12747	3	25	948.70
1	12748	1	888	7046.16
2	12749	4	37	813.45
3	12820	4	17	268.02
4	12822	71	9	146.15

	Frequency	MonetaryValue	Recency
Cluster
0	-0.041074	-0.135118	0.024997
1	-0.037551	-0.240098	-0.017692
2	8.382597	40.089783	-0.748928