In [1]:
# Print the average values of the variables in the dataset
print(data.mean())

# Print the standard deviation of the variables in the dataset
print(data.std())

# Get the key statistics of the dataset
print(data.describe())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-1d1a794d059f> in <module>
      1 # Print the average values of the variables in the dataset
----> 2 print(data.mean())
      3 
      4 # Print the standard deviation of the variables in the dataset
      5 print(data.std())

NameError: name 'data' is not defined
In [4]:
import pandas as pd
data = pd.read_csv("data/chapter_3/rfm_datamart.csv")
In [8]:
print(data.mean())
print(data.std())
print(data.describe())
CustomerID       15551.620642
Recency             90.435630
Frequency           18.714247
MonetaryValue      370.694387
dtype: float64
CustomerID       1562.587958
Recency            94.446510
Frequency          43.754468
MonetaryValue    1347.443451
dtype: float64
         CustomerID     Recency    Frequency  MonetaryValue
count   3643.000000  3643.00000  3643.000000    3643.000000
mean   15551.620642    90.43563    18.714247     370.694387
std     1562.587958    94.44651    43.754468    1347.443451
min    12747.000000     1.00000     1.000000       0.650000
25%    14209.500000    19.00000     4.000000      58.705000
50%    15557.000000    51.00000     9.000000     136.370000
75%    16890.000000   139.00000    21.000000     334.350000
max    18287.000000   365.00000  1497.000000   48060.350000

Obviously, here Python methods do each variables in the dataset.

Next we will detect the skewness for this dataset.

In [11]:
import seaborn as sns
import matplotlib.pyplot as plt
In [14]:
plt.subplot(3,1,1);sns.distplot(data['Recency'])
plt.subplot(3,1,2);sns.distplot(data['Frequency'])
plt.subplot(3,1,3);sns.distplot(data['MonetaryValue'])
plt.show()

Obviously, there are right skewness for all variables. It is usual in the Fin-Tech database.

In [18]:
import numpy as np
data_log = np.log(data)
plt.subplot(3,1,1);sns.distplot(data_log['Recency'])
plt.subplot(3,1,2);sns.distplot(data_log['Frequency'])
plt.subplot(3,1,3);sns.distplot(data_log['MonetaryValue'])
plt.show()

It works better, but there are some weakness. Let's normalize it.

In [19]:
data_normalized = (data_log - data_log.mean()) / data_log.std()
In [22]:
data_normalized.describe().round(2)
Out[22]:
CustomerID Recency Frequency MonetaryValue
count 3643.00 3643.00 3643.00 3643.00
mean -0.00 -0.00 -0.00 -0.00
std 1.00 1.00 1.00 1.00
min -1.91 -2.81 -1.79 -4.09
25% -0.84 -0.64 -0.65 -0.66
50% 0.05 0.09 0.02 -0.01
75% 0.87 0.83 0.72 0.67
max 1.65 1.55 4.25 4.46

describe() is a good indicator to detect mean, std and skewness(25,75%)

Or, we can do it in a much quicker way.

conda install scikit-learn
In [25]:
from sklearn.preprocessing import StandardScaler
In [27]:
scaler = StandardScaler()
data_normalized = scaler.fit_transform(data_log)
In [28]:
data_normalized = pd.DataFrame(data_normalized, index=data.index, columns=data.columns)
In [30]:
data_normalized.describe().round(2)
Out[30]:
CustomerID Recency Frequency MonetaryValue
count 3643.00 3643.00 3643.00 3643.00
mean 0.00 -0.00 0.00 0.00
std 1.00 1.00 1.00 1.00
min -1.91 -2.81 -1.79 -4.09
25% -0.84 -0.64 -0.65 -0.66
50% 0.05 0.09 0.02 -0.01
75% 0.87 0.83 0.72 0.67
max 1.65 1.55 4.25 4.46
In [32]:
plt.subplot(3,1,1);sns.distplot(data_normalized['Recency'])
plt.subplot(3,1,2);sns.distplot(data_normalized['Frequency'])
plt.subplot(3,1,3);sns.distplot(data_normalized['MonetaryValue'])
plt.show()

Here, there are some skewness (positive and negative) in this dataset.