Python 文学编程 学习笔记
2020-09-07
- 使用 RMarkdown 的
child
参数,进行文档拼接。 - 这样拼接以后的笔记方便复习。
- 相关问题提交到 Issue
nbdev1 和 nbconvert 完成 Python 上的文学编程,两种都是基于 jupyter notebook,因此以下涉及三部分内容。
1 安装
2 修改 metadata
先修改 settings.ini
# All sections below are required unless otherwise specified
lib_name = test_nbdev
user = JiaxiangBU
description = Test nbdev
keywords = nbdev
author = Jiaxiang Li
author_email = alex.lijiaxiang@foxmail.com
copyright = Jiaxiang Li
否则 nbdev_build_lib
会报错
3 新建函数
notebook 开头申明
会保存函数lib_name/core.py
产生函数
5 nbconvert
5.1 转成 markdown
指定位置
$ jupyter nbconvert ../wei_lda_debate/dtm.ipynb --to markdown --output ../learn_nlp/output/dtm.md
[NbConvertApp] Converting notebook ../wei_lda_debate/dtm.ipynb to markdown
[NbConvertApp] Writing 138233 bytes to ../wei_lda_debate\../learn_nlp/output/dtm.md
nbdev 转 md 不行。
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 1293: illegal multibyte sequence
有 Unicode 的问题。
jupyter notebook convert to markdown using command line
5.2 转成 script
5.3 指定导出路径
$ jupyter nbconvert --to markdown analysis/200212_varmod_gaowenxin.ipynb --output-dir tmp
[NbConvertApp] Converting notebook analysis/200212_varmod_gaowenxin.ipynb to markdown
[NbConvertApp] Support files will be in 200212_varmod_gaowenxin_files\
[NbConvertApp] Making directory tmp\200212_varmod_gaowenxin_files
[NbConvertApp] Making directory tmp\200212_varmod_gaowenxin_files
[NbConvertApp] Making directory tmp\200212_varmod_gaowenxin_files
[NbConvertApp] Making directory tmp\200212_varmod_gaowenxin_files
[NbConvertApp] Writing 68756 bytes to tmp\200212_varmod_gaowenxin.md
--output-dir tmp
表示存在当前路径的 tmp
目录下。
5.4 让 Ex 都归档于 libs
参考 https://nbconvert.readthedocs.io/en/latest/config_options.html
5.5 只打印结果
参考 https://stackoverflow.com/questions/49907455/hide-code-when-exporting-jupyter-notebook-to-html
5.6 md 不要 output
paste(
"jupyter nbconvert --to markdown --output-dir tmp",
"--TemplateExporter.exclude_output=True",
"*.ipynb"
)
## [1] "jupyter nbconvert --to markdown --output-dir tmp --TemplateExporter.exclude_output=True *.ipynb"
6 blackcellmagic
format code in Jupyter Notebook
参考 https://github.com/csurfer/blackcellmagic
申明%load_ext blackcellmagic
然后在 block 中加上%%black
run 以后,代码就 reformat 好了。
1
全部 reformat 代码,但是容易报错。
7 notedown
notedown input.Rmd --knit > output.ipynb
notedown index.md --run > index2.ipynb
notedown index.md > index2.ipynb
%%
针对的是 block 且是第一行,%
针对的是一行。
“Jupyter Notebooks run on the browser on the localhost. Therefore, they’re OS-independent. In other words, the experience will be the same regardless of whether you’re on a Mac, a PC, or a Linux box.” (Banik 2018)
OS 独立是 Jupyter Notebook 优于 RStudio 等的地方。
8 安装 Jupyter
9 打开 Jupyter Notebook
10 修改打开的默认路径
比如使用Win7系统,文件存C盘,电脑会很卡。
因此如何放到D盘呢?
参考这篇日志,cmd
中输入命令
jupyter notebook --generate-config
2,创建一个.py
文档,里面全部是string格式的,也就是不执行任何命令。
打开这个文件,
路径为C:\Users\username\.jupyter\jupyter_notebook_config.py
,搜索关键词## The directory to use for notebooks and kernels.
,意思是用于notebook的路径,下方有一串代码,
#c.NotebookApp.notebook_dir = u''
,
先变成
#c.NotebookApp.notebook_dir = u'D:\jupyter'
,
注意这里放到了D盘的一个文件夹。
然后再变成
c.NotebookApp.notebook_dir = u'D:\jupyter'
。
相当于执行这条命令。
这样就修改好了。
注意参考
Stack Overflow
路径不能写成
'D:\jupyter\'
,这里识别成\'
为'
了。
或者直接
11 显示目录
安装:
conda install -c conda-forge jupyter_contrib_nbextensions
重启
jupyter
,选择Nbextensions
,filter: toc
,选择Table of Contents
12 重要的快捷键
数字键:将单元格变为标题类型,数字越大标题越小
L: 显示代码的行号
m: 将单元格变为 markdown 类型,在 markdown 类型的单元格内可以编写文档
每个输入框按键盘左上角的
esc
键,再按m
键,进入markdown模式。#
表示一级标题##
表示二级标题
y: 将单元格变为代码类型,在代码类型的单元格中输入 Python 代码
d: 按两下 d 按键删除单元格
z: 撤销最后删除单元格操作
If you go to “Edit”, there’s an option for “Undo Delete Cells”. (Stack Overflow)
a: 在当前单元格之上创建一个新的单元格
b: 在当前单元格之下创建一个新的单元格
x: 剪切当前单元格
c: 复制当前单元格
v: 在当前单元格之下粘贴剪切板中的单元格
shift+k: 将当前单元格上移
shift+j: 将当前单元格下移
shift+m: 与下面的单元合并
ctrl+/: 让多条代码 comment 化 Stack Overflow
13 Toggle code
功能
收起代码,留下markdown、正文、图。
14 隐藏代码
类似于RMarkdown
的code_folding
(Masnick 2015)。
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)
# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
参考 Rogozhnikov (2016) 和 李中梁 (2019)
15 调用外部Python脚本
针对 %run
和%load
%run
can execute python code from .py files – this is well-documented behavior. Lesser known is the fact that it can also execute other jupyter notebooks, which can quite useful. (Rogozhnikov 2016) This will replace the contents of the cell with an external script. You can either use a file on your computer as a source, or alternatively a URL. (Rogozhnikov 2016)
%run
既可以执行 Python 脚本,也可以执行 notebook。
这是一个 well-documented behavior,我可以要求大家长期养成这个习惯。
# this will execute and show the output from
# all code cells of the specified notebook
%run ./two-histograms.ipynb
15.1 可以多个kernels使用
在同一个notebook中运行R和Python 但是会限制 R 代码在一行。
%load_ext rpy2.ipython
%R require(ggplot2)
array([1], dtype=int32)
import pandas as pd
df = pd.DataFrame({ 'Letter': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 'X': [4, 3, 5, 2, 1, 7, 7, 5, 9], 'Y': [0, 4, 3, 6, 7, 10, 11, 9, 13], 'Z': [1, 2, 3, 1, 2, 3, 1, 2, 3] })
%%R -i df
ggplot(data = df) + geom_point(aes(x = X, y= Y, color = Letter, size = Z))
15.2 Notebook之间传对象
The %store command lets you pass variables between two different notebooks. (Rogozhnikov 2016)
节省相同代码,适合 code smell。
16 查询当前对象
The %who command without any arguments will list all variables that existing in the global scope. Passing a parameter like str will list only variables of that type. (Rogozhnikov 2016)
打印当前的变量,和dir
差不多。
17 查询当前 block 执行时间
%%time will give you information about a single run of the code in your cell. (Rogozhnikov 2016)
比 time.time
包方便,可以进行 code smell 了。
%%time
import time
for _ in range(1000):
time.sleep(0.01) # sleep for 0.01 seconds
CPU times: user 21.5 ms, sys: 14.8 ms, total: 36.3 ms
Wall time: 11.6 s
%%timeit uses the Python timeit module which runs a statement 100,000 times (by default) and then provides the mean of the fastest three times. (Rogozhnikov 2016)
只针对一行,且会重复,慎用,不然很慢。
18 导出 block
Using the %%writefile magic saves the contents of that cell to an external file. %pycat does the opposite, and shows you (in a popup) the syntax highlighted contents of an external file. (Rogozhnikov 2016)
非常方便 notebook 导出Python脚本,给调用。
19 查看函数内部耗时
%prun some_useless_slow_function()
26324 function calls in 0.556 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.527 0.000 0.528 0.000 :2(append_if_not_exists)
10000 0.022 0.000 0.022 0.000 {method 'randint' of 'mtrand.RandomState' objects}
1 0.006 0.006 0.556 0.556 :6(some_useless_slow_function)
6320 0.001 0.000 0.001 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.556 0.556 :1()
1 0.000 0.000 0.556 0.556 {built-in method exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
20 rmarkdown convert_ipynb
参考 https://rmarkdown.rstudio.com/docs/reference/convert_ipynb.html
- Notebook 转 Rmd 可以替代 notedown
启动 R Kernel
Then, you will have to make Jupyter see the newly installed R kernel by installing a kernel spec. To install system-wide, set user to False in the
installspec
command (Phuriphanvichai 2019)
21 AttributeError: module ‘attr’ has no attribute ‘s’
参考 abarnert (2018)
$ jupyter nbconvert --to markdown --output-dir . --NbConvertApp.output_files_dir=libs \
> analysis/numpy_irr.ipynb
Traceback (most recent call last):
File "D:\install\miniconda\Scripts\jupyter-nbconvert-script.py", line 5, in <module>
from nbconvert.nbconvertapp import main
File "D:\install\miniconda\lib\site-packages\nbconvert\__init__.py", line 4, in <module>
from .exporters import *
File "D:\install\miniconda\lib\site-packages\nbconvert\exporters\__init__.py", line 1, in <module>
from .base import (export, get_exporter,
File "D:\install\miniconda\lib\site-packages\nbconvert\exporters\base.py", line 13, in <module>
from nbformat import NotebookNode
File "D:\install\miniconda\lib\site-packages\nbformat\__init__.py", line 33, in <module>
from .validator import validate, ValidationError
File "D:\install\miniconda\lib\site-packages\nbformat\validator.py", line 12, in <module>
from jsonschema import ValidationError
File "D:\install\miniconda\lib\site-packages\jsonschema\__init__.py", line 12, in <module>
from jsonschema.exceptions import (
File "D:\install\miniconda\lib\site-packages\jsonschema\exceptions.py", line 141, in <module>
@attr.s(hash=True)
AttributeError: module 'attr' has no attribute 's'
22 jupytext
$ ls
LICENSE README.md cosupload.py preprocess.py utils.py
$ jupytext --to notebook *.py
[jupytext] Reading cosupload.py
[jupytext] Writing cosupload.ipynb
[jupytext] Reading preprocess.py
[jupytext] Writing preprocess.ipynb
[jupytext] Reading utils.py
[jupytext] Writing utils.ipynb
use jupytext
参考 https://github.com/mwouts/jupytext#command-line-conversion https://stackoverflow.com/a/59568527/8625228
23 Preview Notebook
- 一种是 jupyter nbconvert 出 md 文档,方便大家在 GitHub 上面预览
- 一种是打开 GitHub Pages 权限,定义在 master,然后按照文件路径 加上 github.io 在 nbviewer 里面打开
25 快速打开路径和界面
Ctrl + F
+ Ctrl + Enter
快速在 jupyter notebook 界面打开文件和路径。
26 直接执行 notebook
27 Pair Notebook
参考 https://jupytext.readthedocs.io/en/latest/paired-notebooks.html
Paired notebooks Jupytext can write a given notebook to multiple files. In addition to the original notebook file, Jupytext can save the input cells to a text file — either a script or a Markdown document. Put the text file under version control for a clear commit history. Or refactor the paired script, and reimport the updated input cells by simply refreshing the notebook in Jupyter.
可以保持同一路径里面 py 文件,新生成一个 ipynb 文件。 这样的好处是当 py 文件修改时,ipynb 会自动更新,并且保存 output,方便我们查阅。 同时我们可以在 Jupyter Notebook 执行 py 文件。 并且方便我们在 GitHub 里面讨论 py 文件。notebook 是 JavaScript 不方便讨论。 当然 GitHub 上面有付费的服务,但是我们还不到那个量级,因此先这样折中解决。
在 jupyter notebook,点击 File -> jupytext -> Pair Notebook with ipynb document
28 papermill
value_list = [7, 8, 9]
ipynb_notebook_path_list = [
"lda-question-lenleq" + j + "-output.ipynb"
for j in [str(i).zfill(2) for i in value_list]
]
df_output_path_list = [
"data/" + "document_with_topic_with_question-lenleq" + j + ".csv"
for j in [str(i).zfill(2) for i in value_list]
]
value_list, \
ipynb_notebook_path_list, \
df_output_path_list
for idx, (n_len_leq_value, filename, df_output_path_value) in enumerate(zip(value_list, ipynb_notebook_path_list, df_output_path_list)):
names01 = ["n_len_leq"]
values01 = [n_len_leq_value]
param_dict01 = dict(zip(names01, values01))
names02 = ["df_output_path"]
values02 = [df_output_path_value]
param_dict02 = dict(zip(names02, values02))
pm.execute_notebook(
"lda-short-text-preprocess.ipynb",
'lda-short-text-preprocess-pm.ipynb',
kernel_name="python3",
parameters=param_dict01,
cwd = '.'
)
pm.execute_notebook(
"lda-short-text.ipynb",
filename,
kernel_name="python3",
parameters=param_dict02,
cwd = '.'
)
29 nbdime
Notebook Merge 反应较慢,但是还不错。
local base remote 三者之间进行选择,他们的关系是
30 ImportError: DLL load failed
报错信息
$ "D:\install\miniconda\Scripts\jupyter-notebook.exe"
Traceback (most recent call last):
File "d:\install\miniconda\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "d:\install\miniconda\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\install\miniconda\Scripts\jupyter-notebook.exe\__main__.py", line 4,in <module>
File "d:\install\miniconda\lib\site-packages\notebook\notebookapp.py", line 64, in <module>
from tornado import httpserver
File "d:\install\miniconda\lib\site-packages\tornado\httpserver.py", line 29,in <module>
import ssl
File "d:\install\miniconda\lib\ssl.py", line 98, in <module>
import _ssl # if we can't import it, let the error propagate
ImportError: DLL load failed: xxx
补充环境变量后重启
D:\install\miniconda\Library\mingw-w64\bin
D:\install\miniconda\usr\bin
D:\install\miniconda\Library\bin
附录
30.1 labs
30.1.1 sample_df.ipynb
30.1.2 sample_df.py
30.1.3 use_store.ipynb
参考文献
abarnert. 2018. “AttributeError: Module ’Attr’ Has No Attribute ’S’.” Stack Overflow. 2018. https://stackoverflow.com/a/49228822/862522.
Banik, Rounak. 2018. Hands-on Recommendation Systems with Python. Packt Publishing.
Masnick, Max. 2015. “How to Hide Code from Cells in Ipython Notebook Visualized with Nbviewer?” 2015. https://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-notebook-visualized-with-nbviewer.
Phuriphanvichai, Jirapongse. 2019. “Using R on Jupyter Notebook.” Big Data Zone. 2019. https://dzone.com/articles/using-r-on-jupyternbspnotebook.
Rogozhnikov, Alex. 2016. “28 Jupyter Notebook Tips, Tricks, and Shortcuts.” Dataquest Labs, Inc. 2016. https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/.
李中梁. 2019. “Python编程神器Jupyter Notebook使用的28个秘诀.” 机器学习算法工程师. 2019. https://mp.weixin.qq.com/s/TCErnnnJkcPkEK3kf7gF2A.
从 Jupyter Notebook 上实现,使用 fastai 的
nbdev
包。参考 https://github.com/fastai/nbdev↩这里Mac的设置方式一样的。↩