使用 RMarkdown 的 child 参数，进行文档拼接。
这样拼接以后的笔记方便复习。
相关问题提交到 Issue

1 正则化查询标记

rvest 进行修正
write_lines("tmp.txt") 这是一种好方法，写出来，然后正则化找。

2 Misc

基于rvest包进行学习， Wickham and Keyes (2017 Chapter 4.2) 讲解了HTML格式主要的结构，一般爬取的是文字、参数、标签¹名称。更多可以参考 louwill and 布丁 (2018) 的讲解。

从数据科学的角度，通过网站爬虫提取信息，主要关注两方面

class，用html_node(s)进行识别
attr，用html_atrr(s)提取信息

相关例子我在Stack Overflow 进行了举例。

例如，

a就是标签，
href就是参数
this is a test就是文字。

分别使用以下函数进行抓取。

html_text(x = ___) - get text contents
html_attr(x = ___, name = ___) - get specific attribute
html_name(x = ___) - get tag name

一般会借助CSS selector进行识别标签，一般使用css语言。

Use the CSS selector ".infobox" to select all elements that have the attribute class = "infobox".

Use the CSS selector "#firstHeading" to select all elements that have the attribute id = "firstHeading".

Use the CSS selector "a" to select all elements that have the attribute a. (Wickham and Keyes 2017,Chapter 4.2)

附录

参考文献

louwill, and 布丁. 2018. “R语千寻: HTML基础与R语言解析.” 狗熊会. 2018. https://mp.weixin.qq.com/s/Zo9SeBtY4n7LJnzABdBUJA.

Wickham, Charlotte, and Oliver Keyes. 2017. “Working with Web Data in R.” DataCamp. 2017. https://www.datacamp.com/courses/working-with-web-data-in-r.

HTML可以看做是一个tag标签语言。

this is a test ↩

rvest Cookbook

rvest Cookbook

1 正则化查询标记

2 Misc

附录

参考文献