1. 使用 RMarkdown 的 child 参数,进行文档拼接。
  2. 这样拼接以后的笔记方便复习。
  3. 相关问题提交到 Issue

1 主题模型

参考 Werger (2020)

In this exercise you’re going to work with text data, containing emails from Enron employees. The Enron scandal is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as “sell enron stock”.

使用到安然事件的邮件文本数据。

1.1 EDA

(2090, 6)
Message-ID From To Date content clean_content
0 < (‘’) (‘’) 2002-01-29 23:20:55 INVESTools AdvisoryFree Digest of Trusted … investools advisory free digest trusted invest…
1 < (‘’) (‘’) 2000-09-20 19:07:00 —– Forwarded by Richard B Sanders/HOU/ECT o… forwarded richard b sanders hou ect pm justin …
2 < (‘m..love@enron.com’) (‘m..love@enron.com’) 2001-10-30 16:15:17 hey you are not wearing your target purple shi… hey wearing target purple shirt today mine wan…
3 < (‘’) (‘’) 2002-01-30 17:54:18 Leslie Milosevich1042 Santa Clara Avenue… leslie milosevich santa clara avenue alameda c…
4 < (‘’) (‘’) 2002-01-30 19:36:01 Rini Twait1010 E 5th Ave, CO 80501… rini twait e th ave longmont co rtwait graphic…
Message-ID From To Date content clean_content
154 < (‘’) (‘’) 2002-02-01 14:53:35 Venture: A 1997 Enron Meeting Belies O… joint venture enron meeting belies officers cl…
Message-ID From To Date content clean_content
0 < (‘’) (‘’) 2002-01-29 23:20:55 INVESTools AdvisoryFree Digest of Trusted … investools advisory free digest trusted invest…
1 < (‘’) (‘’) 2000-09-20 19:07:00 —– Forwarded by Richard B Sanders/HOU/ECT o… forwarded richard b sanders hou ect pm justin …
2 < (‘m..love@enron.com’) (‘m..love@enron.com’) 2001-10-30 16:15:17 hey you are not wearing your target purple shi… hey wearing target purple shirt today mine wan…
3 < (‘’) (‘’) 2002-01-30 17:54:18 Leslie Milosevich1042 Santa Clara Avenue… leslie milosevich santa clara avenue alameda c…
4 < (‘’) (‘’) 2002-01-30 19:36:01 Rini Twait1010 E 5th Ave, CO 80501… rini twait e th ave longmont co rtwait graphic…

(314, 6)

0 1776 1 314 Name: flag, dtype: int64

1.2 LDA

1.2.1 定义 stopwords

IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --NotebookApp.iopub_data_rate_limit.

Current values: NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec) NotebookApp.rate_limit_window=3.0 (secs)

(list, gensim.corpora.dictionary.Dictionary)

(2090, 36544)

(0, ‘0.020“enron” + 0.006“company” + 0.004“said” + 0.004“new” + 0.004“would”’) (1, ’0.025“enron” + 0.009“energy” + 0.009“pm” + 0.007“employees” + 0.006“company”’) (2, ‘0.010“bakernet” + 0.008“wj” + 0.007“mailto” + 0.007“conference” + 0.006“mail”’) (3, ’0.035“enron” + 0.011“message” + 0.010“hou” + 0.010“original” + 0.009“pm”’) (4, ’0.040“td” + 0.029“image” + 0.029“net” + 0.027“money” + 0.026*“tr”’)

Fantastic work! You have now successfully created your first topic model on the Enron email data. However, the print of words doesn’t really give you enough information to find a topic that might lead you to signs of fraud. You’ll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in your data. You’ll learn more about this in the next video.

结果没有很明显告知欺诈主题。

Topic 1 seems to discuss the employee share option program, and seems to point to internal conversation (with “please, may, know” etc), so this is more likely to be related to the internal accounting fraud and trading stock with insider knowledge. Topic 3 seems to be more related to general news around Enron.

通过 PCA 可视化,可以查看主题之间的远近,进而也可以确认是否是欺诈。比如主题1的相对位置比较独立一些。

参考 Prabhakaran (2018)

Document_No Dominant_Topic Topic_Perc_Contrib Keywords Text
0 0 0.0 0.7949 enron, company, said, new, would, energy, gas,… [investools, advisory, free, digest, trusted, …
1 1 0.0 0.8600 enron, company, said, new, would, energy, gas,… [forwarded, richard, b, sanders, hou, pm, just…
2 2 0.0 0.6064 enron, company, said, new, would, energy, gas,… [hey, wearing, target, purple, shirt, today, m…
3 3 1.0 0.9935 enron, energy, pm, employees, company, schedul… [leslie, milosevich, santa, clara, avenue, ala…
4 4 1.0 0.9934 enron, energy, pm, employees, company, schedul… [rini, twait, e, th, ave, longmont, co, rtwait…
5 5 1.0 0.9933 enron, energy, pm, employees, company, schedul… [laurent, meillier, jackson, stq, oakland, ca,…
6 6 0.0 0.8824 enron, company, said, new, would, energy, gas,… [dear, mr, whalley, hope, take, time, read, co…
7 7 1.0 0.9934 enron, energy, pm, employees, company, schedul… [jan, nissl, e, state, boise, id, jnissl, heal…
8 8 0.0 0.6629 enron, company, said, new, would, energy, gas,… [much, nice, machine, original, message, baumb…
9 9 1.0 0.9934 enron, energy, pm, employees, company, schedul… [leigh, clemons, seyburn, drive, baton, rouge,…
Topic_Num Topic_Perc_Contrib Keywords Text
0 0.0 1.0000 enron, company, said, new, would, energy, gas,… [joint, venture, enron, meeting, belies, offic…
1 1.0 0.9991 enron, energy, pm, employees, company, schedul… [inline, attachment, follows, scasey, tfsbroke…
2 2.0 0.9981 bakernet, wj, mailto, conference, mail, e, ama… [transmission, expansion, systems, transition,…
3 3.0 0.9999 enron, message, hou, original, pm, thanks, ple… [original, message, received, thu, aug, cdt, e…
4 4.0 0.9996 td, image, net, money, tr, width, class, heigh… [unsubscribe, mailing, please, go, money, net,…

1.3 后续优化

Topic modelling: discover hidden patterns in text data

  1. Discovering topics in text data
  2. “What is the text about”
  3. Conceptually similar to clustering data
  4. Compare topics of fraud cases to non-fraud cases and use as a feature or flag
  5. Or.. is there a particular topic in the data that seems to point to fraud?

这是拿到主题模型进行 EDA 的处理。

  1. Are there any suspicious topics? (no labels)
  2. Are the topics in fraud and non-fraud cases similar? (with labels)
  3. Are fraud cases associated more with certain topics? (with labels)

显然这个可以用一个监督学习进行表达,然后用主题模型作为特征,跑一个监督学习。

附录

参考文献

Prabhakaran, Selva. 2018. “Topic Modeling with Gensim (Python).” Machine Learning Plus. 2018. machinelearningplus.com/nlp/topic-modeling-gensim-python/#18dominanttopicineachsentence.

Werger, Charlotte. 2020. “Fraud Detection in Python.” DataCamp. 2020. https://learn.datacamp.com/courses/fraud-detection-in-python.