使用 RMarkdown 的 child 参数，进行文档拼接。
这样拼接以后的笔记方便复习。
相关问题提交到 Issue

1 主题模型

参考 Werger (2020)

In this exercise you’re going to work with text data, containing emails from Enron employees. The Enron scandal is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as “sell enron stock”.

使用到安然事件的邮件文本数据。

1.1 EDA

import pandas as pd
import numpy as np

df = pd.read_csv('../data/chapter_4/enron_emails_clean.csv')

df.shape

(2090, 6)

df.head()

	Message-ID	From	To	Date	content	clean_content
0	<8345058.1075840404046.JavaMail.evans@thyme>	(‘advdfeedback@investools.com’)	(‘advdfeedback@investools.com’)	2002-01-29 23:20:55	INVESTools AdvisoryFree Digest of Trusted …	investools advisory free digest trusted invest…
1	<1512159.1075863666797.JavaMail.evans@thyme>	(‘richard.sanders@enron.com’)	(‘richard.sanders@enron.com’)	2000-09-20 19:07:00	—– Forwarded by Richard B Sanders/HOU/ECT o…	forwarded richard b sanders hou ect pm justin …
2	<26118676.1075862176383.JavaMail.evans@thyme>	(‘m..love@enron.com’)	(‘m..love@enron.com’)	2001-10-30 16:15:17	hey you are not wearing your target purple shi…	hey wearing target purple shirt today mine wan…
3	<10369289.1075860831062.JavaMail.evans@thyme>	(‘leslie.milosevich@kp.org’)	(‘leslie.milosevich@kp.org’)	2002-01-30 17:54:18	Leslie Milosevich1042 Santa Clara Avenue…	leslie milosevich santa clara avenue alameda c…
4	<26728895.1075860815046.JavaMail.evans@thyme>	(‘rtwait@graphicaljazz.com’)	(‘rtwait@graphicaljazz.com’)	2002-01-30 19:36:01	Rini Twait1010 E 5th Ave, CO 80501…	rini twait e th ave longmont co rtwait graphic…

# Find all cleaned emails that contain 'sell enron stock'
mask = df['clean_content'].str.contains('sell enron stock', na=False)

# Select the data from df using the mask
df.loc[mask]

	Message-ID	From	To	Date	content	clean_content
154	<6336501.1075841154311.JavaMail.evans@thyme>	(‘sarah.palmer@enron.com’)	(‘sarah.palmer@enron.com’)	2002-02-01 14:53:35	Venture: A 1997 Enron Meeting Belies O…	joint venture enron meeting belies officers cl…

# Create a list of terms to search for
searchfor = ['enron stock', 'stock bonus', 'sell enron stock', 'sell stock']

# Filter cleaned emails on searchfor list and select from df 
filtered_emails = df.loc[df['clean_content'].str.contains('|'.join(searchfor), na=False)]
filtered_emails.head()

	Message-ID	From	To	Date	content	clean_content
0	<8345058.1075840404046.JavaMail.evans@thyme>	(‘advdfeedback@investools.com’)	(‘advdfeedback@investools.com’)	2002-01-29 23:20:55	INVESTools AdvisoryFree Digest of Trusted …	investools advisory free digest trusted invest…
1	<1512159.1075863666797.JavaMail.evans@thyme>	(‘richard.sanders@enron.com’)	(‘richard.sanders@enron.com’)	2000-09-20 19:07:00	—– Forwarded by Richard B Sanders/HOU/ECT o…	forwarded richard b sanders hou ect pm justin …
2	<26118676.1075862176383.JavaMail.evans@thyme>	(‘m..love@enron.com’)	(‘m..love@enron.com’)	2001-10-30 16:15:17	hey you are not wearing your target purple shi…	hey wearing target purple shirt today mine wan…
3	<10369289.1075860831062.JavaMail.evans@thyme>	(‘leslie.milosevich@kp.org’)	(‘leslie.milosevich@kp.org’)	2002-01-30 17:54:18	Leslie Milosevich1042 Santa Clara Avenue…	leslie milosevich santa clara avenue alameda c…
4	<26728895.1075860815046.JavaMail.evans@thyme>	(‘rtwait@graphicaljazz.com’)	(‘rtwait@graphicaljazz.com’)	2002-01-30 19:36:01	Rini Twait1010 E 5th Ave, CO 80501…	rini twait e th ave longmont co rtwait graphic…

filtered_emails.shape

(314, 6)

# Create flag variable where the emails match the searchfor terms
df['flag'] = np.where((df['clean_content'].str.contains('|'.join(searchfor)) == True), 1, 0)

# Count the values of the flag variable
count = df['flag'].value_counts()
print(count)

0 1776 1 314 Name: flag, dtype: int64

1.2 LDA

1.2.1 定义 stopwords

# Import nltk packages and string 
from nltk.corpus import stopwords
import string

# Define stopwords to exclude
stop = set(stopwords.words('english'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))

# Define punctuations to exclude and lemmatizer
exclude = set(string.punctuation)

# Import the lemmatizer from nltk
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()

# Define word cleaning function
def clean(text, stop):
    text = str(text).rstrip()
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(i for i in stop_free if i not in exclude)
    # normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())      
    normalized = punc_free
    return normalized

# Clean the emails in df and print results
text_clean=[]
for text in df['clean_content']:
    text_clean.append(clean(text, stop).split())    
print(text_clean)

IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --NotebookApp.iopub_data_rate_limit.

Current values: NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec) NotebookApp.rate_limit_window=3.0 (secs)

# Import the packages
import gensim
from gensim import corpora

# Define the dictionary
dictionary = corpora.Dictionary(text_clean)

# Define the corpus 
corpus = [dictionary.doc2bow(text) for text in text_clean]

corpus.__class__,dictionary.__class__

(list, gensim.corpora.dictionary.Dictionary)

len(corpus),len(dictionary)

(2090, 36544)

# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5)

# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

# Print the results
for topic in topics:
    print(topic)

(0, ‘0.020“enron” + 0.006“company” + 0.004“said” + 0.004“new” + 0.004“would”’) (1, ’0.025“enron” + 0.009“energy” + 0.009“pm” + 0.007“employees” + 0.006“company”’) (2, ‘0.010“bakernet” + 0.008“wj” + 0.007“mailto” + 0.007“conference” + 0.006“mail”’) (3, ’0.035“enron” + 0.011“message” + 0.010“hou” + 0.010“original” + 0.009“pm”’) (4, ’0.040“td” + 0.029“image” + 0.029“net” + 0.027“money” + 0.026*“tr”’)

Fantastic work! You have now successfully created your first topic model on the Enron email data. However, the print of words doesn’t really give you enough information to find a topic that might lead you to signs of fraud. You’ll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in your data. You’ll learn more about this in the next video.

结果没有很明显告知欺诈主题。

Topic 1 seems to discuss the employee share option program, and seems to point to internal conversation (with “please, may, know” etc), so this is more likely to be related to the internal accounting fraud and trading stock with insider knowledge. Topic 3 seems to be more related to general news around Enron.

通过 PCA 可视化，可以查看主题之间的远近，进而也可以确认是否是欺诈。比如主题1的相对位置比较独立一些。

参考 Prabhakaran (2018)

def format_topics_sentences(ldamodel=ldamodel, corpus=corpus, texts=text_clean):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamodel, corpus=corpus, texts=text_clean)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

df_dominant_topic.head(10)

	Document_No	Dominant_Topic	Topic_Perc_Contrib	Keywords	Text
0	0	0.0	0.7949	enron, company, said, new, would, energy, gas,…	[investools, advisory, free, digest, trusted, …
1	1	0.0	0.8600	enron, company, said, new, would, energy, gas,…	[forwarded, richard, b, sanders, hou, pm, just…
2	2	0.0	0.6064	enron, company, said, new, would, energy, gas,…	[hey, wearing, target, purple, shirt, today, m…
3	3	1.0	0.9935	enron, energy, pm, employees, company, schedul…	[leslie, milosevich, santa, clara, avenue, ala…
4	4	1.0	0.9934	enron, energy, pm, employees, company, schedul…	[rini, twait, e, th, ave, longmont, co, rtwait…
5	5	1.0	0.9933	enron, energy, pm, employees, company, schedul…	[laurent, meillier, jackson, stq, oakland, ca,…
6	6	0.0	0.8824	enron, company, said, new, would, energy, gas,…	[dear, mr, whalley, hope, take, time, read, co…
7	7	1.0	0.9934	enron, energy, pm, employees, company, schedul…	[jan, nissl, e, state, boise, id, jnissl, heal…
8	8	0.0	0.6629	enron, company, said, new, would, energy, gas,…	[much, nice, machine, original, message, baumb…
9	9	1.0	0.9934	enron, energy, pm, employees, company, schedul…	[leigh, clemons, seyburn, drive, baton, rouge,…

# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

sent_topics_sorteddf_mallet.head()
# 展示每个主题句中最 topic 的句子。

	Topic_Num	Topic_Perc_Contrib	Keywords	Text
0	0.0	1.0000	enron, company, said, new, would, energy, gas,…	[joint, venture, enron, meeting, belies, offic…
1	1.0	0.9991	enron, energy, pm, employees, company, schedul…	[inline, attachment, follows, scasey, tfsbroke…
2	2.0	0.9981	bakernet, wj, mailto, conference, mail, e, ama…	[transmission, expansion, systems, transition,…
3	3.0	0.9999	enron, message, hou, original, pm, thanks, ple…	[original, message, received, thu, aug, cdt, e…
4	4.0	0.9996	td, image, net, money, tr, width, class, heigh…	[unsubscribe, mailing, please, go, money, net,…

1.3 后续优化

Topic modelling: discover hidden patterns in text data

Discovering topics in text data

“What is the text about”

Conceptually similar to clustering data

Compare topics of fraud cases to non-fraud cases and use as a feature or flag

Or.. is there a particular topic in the data that seems to point to fraud?

这是拿到主题模型进行 EDA 的处理。

Are there any suspicious topics? (no labels)

Are the topics in fraud and non-fraud cases similar? (with labels)

Are fraud cases associated more with certain topics? (with labels)

显然这个可以用一个监督学习进行表达，然后用主题模型作为特征，跑一个监督学习。

附录

参考文献

Prabhakaran, Selva. 2018. “Topic Modeling with Gensim (Python).” Machine Learning Plus. 2018. machinelearningplus.com/nlp/topic-modeling-gensim-python/#18dominanttopicineachsentence.

Werger, Charlotte. 2020. “Fraud Detection in Python.” DataCamp. 2020. https://learn.datacamp.com/courses/fraud-detection-in-python.

反欺诈模型学习笔记

反欺诈模型学习笔记

1 主题模型

1.1 EDA

1.2 LDA

1.2.1 定义 stopwords

1.3 后续优化

附录

参考文献

反欺诈模型 学习笔记

1 主题模型

1.1 EDA

1.2 LDA

1.2.1 定义 stopwords

1.3 后续优化

附录

参考文献

反欺诈模型学习笔记