# Print the average values of the variables in the dataset
print(data.mean())

# Print the standard deviation of the variables in the dataset
print(data.std())

# Get the key statistics of the dataset
print(data.describe())

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-1d1a794d059f> in <module>
      1 # Print the average values of the variables in the dataset
----> 2 print(data.mean())
      3 
      4 # Print the standard deviation of the variables in the dataset
      5 print(data.std())

NameError: name 'data' is not defined

import pandas as pd
data = pd.read_csv("data/chapter_3/rfm_datamart.csv")

print(data.mean())
print(data.std())
print(data.describe())

CustomerID       15551.620642
Recency             90.435630
Frequency           18.714247
MonetaryValue      370.694387
dtype: float64
CustomerID       1562.587958
Recency            94.446510
Frequency          43.754468
MonetaryValue    1347.443451
dtype: float64
         CustomerID     Recency    Frequency  MonetaryValue
count   3643.000000  3643.00000  3643.000000    3643.000000
mean   15551.620642    90.43563    18.714247     370.694387
std     1562.587958    94.44651    43.754468    1347.443451
min    12747.000000     1.00000     1.000000       0.650000
25%    14209.500000    19.00000     4.000000      58.705000
50%    15557.000000    51.00000     9.000000     136.370000
75%    16890.000000   139.00000    21.000000     334.350000
max    18287.000000   365.00000  1497.000000   48060.350000

Obviously, here Python methods do each variables in the dataset.

Next we will detect the skewness for this dataset.

import seaborn as sns
import matplotlib.pyplot as plt

plt.subplot(3,1,1);sns.distplot(data['Recency'])
plt.subplot(3,1,2);sns.distplot(data['Frequency'])
plt.subplot(3,1,3);sns.distplot(data['MonetaryValue'])
plt.show()

Obviously, there are right skewness for all variables. It is usual in the Fin-Tech database.

import numpy as np
data_log = np.log(data)
plt.subplot(3,1,1);sns.distplot(data_log['Recency'])
plt.subplot(3,1,2);sns.distplot(data_log['Frequency'])
plt.subplot(3,1,3);sns.distplot(data_log['MonetaryValue'])
plt.show()

It works better, but there are some weakness. Let's normalize it.

data_normalized = (data_log - data_log.mean()) / data_log.std()

data_normalized.describe().round(2)

describe() is a good indicator to detect mean, std and skewness(25,75%)

Or, we can do it in a much quicker way.

conda install scikit-learn

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_normalized = scaler.fit_transform(data_log)

data_normalized = pd.DataFrame(data_normalized, index=data.index, columns=data.columns)

data_normalized.describe().round(2)

plt.subplot(3,1,1);sns.distplot(data_normalized['Recency'])
plt.subplot(3,1,2);sns.distplot(data_normalized['Frequency'])
plt.subplot(3,1,3);sns.distplot(data_normalized['MonetaryValue'])
plt.show()

Here, there are some skewness (positive and negative) in this dataset.

	CustomerID	Recency	Frequency	MonetaryValue
count	3643.00	3643.00	3643.00	3643.00
mean	-0.00	-0.00	-0.00	-0.00
std	1.00	1.00	1.00	1.00
min	-1.91	-2.81	-1.79	-4.09
25%	-0.84	-0.64	-0.65	-0.66
50%	0.05	0.09	0.02	-0.01
75%	0.87	0.83	0.72	0.67
max	1.65	1.55	4.25	4.46