--- title: Title keywords: fastai sidebar: home_sidebar summary: "summary" ---

Preprocessing

make_df[source]

make_df(csv_name)

Use jieba, create data frame.

affirmative = make_df("data/affirmative.csv")
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\LIJIAX~1\AppData\Local\Temp\jieba.cache
Loading model cost 1.083 seconds.
Prefix dict has been built succesfully.
affirmative.head()
Group Students Content text
0 ��1�� ���� Ľ�ν��ֲ���������ص������ʵĽ�����Դ�ۼ���һ�����κ���ѧϰԸ�������ܹ��ͳɱ��ģ�ͨ������... Ľ�� �� �ֲ� �� ������� �� �� ���� �� ������Դ �ۼ� �� һ�� �� �� �κ� �� ...
1 ��1�� ���� ��Ľ�η�չ�����е��ֽ׶Σ��й�����Ľ��ƽ̨icourse163���û�����ͻ��100��,����... �� Ľ�� ��չ ���� �� �� �ֽ׶� �� �й� ��� �� Ľ�� ƽ̨ icourse163 ...
2 ��1�� ��һ �о�����,��Ľ������Ŀ���ѧϰ��,ѧϰ���������ḻ,֪ʶ�����Լ�Ԫ��֪�����õ�����,˼���... �о� ���� , �� Ľ�� ���� �� ���� ѧϰ �� , ѧϰ�� ��� ���� �ḻ , ֪ʶ ...
3 ��1�� ���� Ľ���ڱ�֤����������ͬʱ�������ṩ�����ijɱ��������������㽡��κ����κ�ʱ�����κεط�����... Ľ�� �� ��֤ ���� ���� �� ͬʱ �� ���� �ṩ ���� �� �ɱ� �� �� ��� ���� ��...
4 ��1�� ��һ �Է���һ����Ҳ˵�ǿ��ܳ��ֵĻ������գ���ͳ���ü���������ʦ����������֪ʶ�����ѵ�����һ����... �Է� ��һ ���� Ҳ ˵ �� ���� ���� �� ���� ���� �� ��ͳ ���� ���� ���� ��ʦ ...
negative = make_df("data/negative.csv")
negative.head()
Group Students Content text
0 ��1�� ��һ ͨ�׵�˵��Ľ���Ǵ��ģ�����翪�ſγ̡�1.��ͳ�����п��ܳ��ֵĻ�������գ�Ľ��û�У�2.��ͳ... ͨ�� ��˵ �� Ľ�� �� ���ģ �� ���� ���� �γ� �� 1 . ��ͳ ���� �� ���� ��...
1 ��1�� ��һ ͨ�׵�˵��Ľ���Ǵ��ģ�����翪�ſγ̡��ҷ��۵�Ϊ��Ľ�β��ܴ��洫ͳ����1.��ͳ�����п��ܳ���... ͨ�� ��˵ �� Ľ�� �� ���ģ �� ���� ���� �γ� �� �ҷ� �۵� Ϊ �� Ľ�� ���� ...
2 ��1�� ��һ ��һ�۵���������һ�仰��������߽���,ѧϰ��ƫ�������Ľ������ۡ��ⲻ���Ǵ�ͳ�������ܸ������ ��һ �۵� ���� ���� һ�� �� �� ��� ���� ���� , ѧϰ�� ƫ�� ����� �� ���� ...
3 ��1�� ��һ ���������������У�˵������Ľ�ε�����ԣ��Ͼ��й�ѧ���Ӵ�Ļ������⡣ ���� ���� �� ���� �� �� ˵�� ���� Ľ�� �� ����� �� �Ͼ� �й� ѧ�� �Ӵ� �� ...
4 ��1�� ��һ ��һ˵�Ķ���˼�����⣬��ͳ���õĵ��Խϴ󣬿��������������ʦѧ�����У������������ʩ�̣����... ��һ˵ �� ����˼�� ���� �� ��ͳ ���� �� ���� �ϴ� �� ���� ��� ���� �� ��ʦ ...

LDA

chinese_word_cut[source]

chinese_word_cut(mytext)

print_top_words(model, feature_names, n_top_words)

get_custom_stopwords[source]

get_custom_stopwords(stop_words_file, encoding='utf-8')

stopwords = get_custom_stopwords("data/stopwords.txt", encoding='utf-8') # HITͣ�ôʴʵ�
max_df = 0.9 # �ڳ�����һ�������ĵ��г��ֵĹؼ��ʣ�����ƽ������ȥ������
min_df = 5 # �ڵ�����һ�������ĵ��г��ֵĹؼ��ʣ����ڶ��أ���ȥ������
n_features = 1000 # �����ȡ��������
n_top_words = 20 # ��ʾ�����¹ؼ��ʵ�ʱ����ʾ���ٸ�
col_content = "text" # ˵�����е��ı���Ϣ����������

lda_on_chinese_articles_with_param[source]

lda_on_chinese_articles_with_param(df, n_topics, col_content, stopwords, n_features, max_df, min_df, n_top_words)

lda_on_chinese_articles[source]

lda_on_chinese_articles(df, n_topics)

lda, tf, vect = lda_on_chinese_articles(df = affirmative, n_topics = 3)
pyLDAvis.sklearn.prepare(lda, tf, vect)
D:\install\miniconda\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['lex', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ۢ�', '�ۢ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ܣ�', '�ܣ�', '�ܣ�', '�ܣ�', '�ܣ�', '�ݣ�', '�ݣ�', '�ݣ�', '�ݣ�', '�ݣ�', '���', '���', '����������'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
Topic #0:
ѧϰ ѧ�� ���� �γ� ѧϰ�� ��Ƶ ʱ�� ���� ���� ���� ��ʦ ���� ���� ֪ʶ ��ѧ ��Ȥ ʵ�� �ٽ� ���� �ߵȽ���
Topic #1:
���� ��չ ��ͳ ��ѧ �Ѿ� ȡ�� �ҹ� ���� ��� ���� ���� ���� ��Ϊ ʱ�� ���� Ŀǰ ���� ���� �ߵȽ��� ��У
Topic #2:
ѧ�� ���� ��ͳ ѧϰ ��ʦ ��ʦ ��ѧ ���� ��ʽ ���� ��Ƶ ֪ʶ �γ� ��Ҫ һ�� ���� ���� ʱ�� û�� ģʽ
D:\install\miniconda\lib\site-packages\pyLDAvis\_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribtion2%5%10%123Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)学习教育学生发展交流学习者传统大学方式已经我国老师过程社会讨论教师视频时间网络不能时代一种国家兴趣目前自主现在取代合作课堂Top-30 Most Salient Terms(1)0100200300400500600700
TypeError: __init__() got an unexpected keyword argument 'n_topics'

һ������������ⶼ�dz�������ĸд����©д֮�������

��������������������������������

��Ȩ����������ΪCSDN������zhuimengshaonian66����ԭ�����£���ѭ CC 4.0 BY-SA ��ȨЭ�飬ת���븽��ԭ�ij������Ӽ��������� ԭ�����ӣ�https://blog.csdn.net/zhuimengshaonian66/article/details/81700959

n_components ���������޸��ˡ�

lda, tf, vect = lda_on_chinese_articles(df = negative, n_topics = 3)
pyLDAvis.sklearn.prepare(lda, tf, vect)
D:\install\miniconda\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['lex', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٢�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�٣�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڢ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ڣ�', '�ۢ�', '�ۢ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ۣ�', '�ܣ�', '�ܣ�', '�ܣ�', '�ܣ�', '�ܣ�', '�ݣ�', '�ݣ�', '�ݣ�', '�ݣ�', '�ݣ�', '���', '���', '����������'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
Topic #0:
���� ѧ�� ��ͳ ѧϰ û�� ȡ�� ��ʦ ���� ��ʦ ���� ���� ��Ƶ ��Ϊ ֪ʶ ʦ�� ���� ����� ���� ���� ����
Topic #1:
ѧϰ �γ� ѧ�� ���� ��ѧ ѧϰ�� ��ʦ ��չ ֪ʶ ���� ��Ҫ �޷� �ҹ� ���� ���� ���� û�� ��Ϊ ƽ̨ ��ʽ
Topic #2:
��ͳ ���� ѧ�� ���� ���� ���� ��չ ��У Ч�� ����Ӣ ���� �޷� ȡ�� �й� ���� ʵ�� ���� ��ѧ�� ���� ʱ��
D:\install\miniconda\lib\site-packages\pyLDAvis\_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribtion2%5%10%123Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)课堂传统课程学生现在没有发展取代高校教育互动交流无法韩高英学习学习者进步效果师生老师网上不是实际中国情况讨论大学生线下面对面时间Top-30 Most Salient Terms(1)0100200300400500
pyLDAvis.sklearn.prepare(lda, tf, vect)
D:\install\miniconda\lib\site-packages\pyLDAvis\_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribtion2%5%10%123Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)课堂传统课程学生现在没有发展取代高校教育互动交流无法韩高英学习学习者进步效果师生老师网上不是实际中国情况讨论大学生线下面对面时间Top-30 Most Salient Terms(1)0100200300400500

�ο� https://github.com/bmabey/pyLDAvis/issues/132

D:\install\miniconda\lib\site-packages\pyLDAvis\_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))

���°�װ����Ȼû�н����

pyLDAvis.__version__
'2.1.2'
pd.__version__
'0.24.2'
# !pip install pyldavis
import pickle as pkl
with open("model/sklearn-lda.pkl", 'wb') as fp:
    pkl.dump(lda, fp)
with open("model/sklearn-lda.pkl", 'rb') as fp:
    model0 = pkl.load(fp)
    print(model0.__class__)
<class 'sklearn.decomposition.online_lda.LatentDirichletAllocation'>