使用 RMarkdown 的 child 参数，进行文档拼接。
这样拼接以后的笔记方便复习。
相关问题提交到 Issue http://127.0.0.1:43572/rmd_output/0/#number

1 Chapter 1

source(here::here("code/load.R"))

参考 Wickham (2019)

1.1 Quotes

\

escape sequence
backslash type \\

"\"There's plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table." %>% cat

## "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.

'"There\'s plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.' %>% cat

## "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.

1.2 Emoji

cat("hello\n\U1F30D")

## hello
## <U+0001F30D>

\U followed by up to 8 hex digits sequence denotes a particular Unicode character.

using base 16 and the digits 0-9 and A-F. http://www.unicode.org/charts/
four digits for the codepoint, an alternative escape sequence is \u.

1.3 number

感觉 fixed 和 scientific 好复杂。

有效数字，使用format(..., digits = n)，小数点前后取最大数的数位和最小数的数位(解释看例子)
1. 方便以列展示数据，但是当数据呈现是居中，可以用trim = TRUE去除效果。
取消科学计数法，使用format(..., scientific = FALSE)

## Some vectors of numbers
percent_change  <- c(4, -1.91, 3.00, -5.002)
income <-  c(72.19, 1030.18, 10291.93, 1189192.18)
p_values <- c(0.12, 0.98, 0.0000191, 0.00000000002)

## Format c(0.0011, 0.011, 1) with digits = 1
format(c(0.0011, 0.011, 1), digits = 1)

## [1] "0.001" "0.011" "1.000"

## Format c(1.0011, 2.011, 1) with digits = 1
format(c(1.0011, 2.011, 1), digits = 1)

## [1] "1" "2" "1"

## Format percent_change to one place after the decimal point
format(percent_change, digits = 2)

## [1] " 4.0" "-1.9" " 3.0" "-5.0"

## Format income to whole numbers
format(income, digits = 2)

## [1] "     72" "   1030" "  10292" "1189192"

## stretch the small number

## Format p_values in fixed format
format(p_values, scientific = FALSE)

## [1] "0.12000000000" "0.98000000000" "0.00001910000" "0.00000000002"

三位数控制，使用big.mark = ','给出分隔符，big.interval = 3给出分割距离。

formatted_income <- format(income, digits = 2)

## Print formatted_income
formatted_income

## [1] "     72" "   1030" "  10292" "1189192"

## Call writeLines() on the formatted income
writeLines(formatted_income)

## Define trimmed_income
trimmed_income <- format(income, digits =2, trim = TRUE)

## Call writeLines() on the trimmed_income
writeLines(trimmed_income)

## 72
## 1030
## 10292
## 1189192

## Define pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",",big.interval = 4)
## 中文需要四位分隔符

## Call writeLines() on the pretty_income
writeLines(pretty_income)

##       72
##     1030
##   1,0292
## 118,9192

1.4 formatC

based on C style syntax

"f" for fixed, digits is the number of digits after the decimal point.
"e" for scientific, and
"g" for fixed unless scientific saves space
flag = "+"表示数据正负号
format = "g", digits = 2用最少字母表示小数点后两位
flag = "0"前面用0 pad 上，不清晰，也使用str_pad替代

## From the format() exercise
x <- c(0.0011, 0.011, 1)
y <- c(1.0011, 2.011, 1)

## formatC() on x with format = "f", digits = 1
formatC(x, format = "f", digits = 1)

## [1] "0.0" "0.0" "1.0"

format(x, digits = 1)

## [1] "0.001" "0.011" "1.000"

## formatC() on y with format = "f", digits = 1
formatC(y, format = "f", digits = 1)

## [1] "1.0" "2.0" "1.0"

format(y, format = "f", digits = 1)

## [1] "1" "2" "1"

## Format percent_change to one place after the decimal point
formatC(percent_change, format = "f", digits = 1)

## [1] "4.0"  "-1.9" "3.0"  "-5.0"

## percent_change with flag = "+"
formatC(percent_change, format = "f", digits = 1, flag = "+")

## [1] "+4.0" "-1.9" "+3.0" "-5.0"

## Format p_values using format = "g" and digits = 2
formatC(p_values, format = "g", digits = 2)

## [1] "0.12"    "0.98"    "1.9e-05" "2e-11"

formatC(income,digits = 0,format = 'f',flag = '0')

## [1] "72"      "1030"    "10292"   "1189192"

会计格式展示，使用format和paste完成
justify = "right"可以让文本居中或者向左向右。

## Define the names vector
income_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")

## Create pretty_income
pretty_income <- format(income, digit = 2, big.mark = ',')

## Create dollar_income
dollar_income <- paste('$', pretty_income, sep = '')

## Create formatted_names
formatted_names <- format(income_names, justify = "right")

## Create rows
rows <- paste(formatted_names, dollar_income, sep = '   ')

## Write rows
writeLines(rows)

##           Year 0   $       72
##           Year 1   $    1,030
##           Year 2   $   10,292
## Project Lifetime   $1,189,192

income %>% 
    round %>% 
    as.character %>% 
    str_pad(string = .,width = max(str_count(.)),pad = '0')

## [1] "0000072" "0001030" "0010292" "1189192"

formatC(income, width = 7, format = "d", flag = "0")

## [1] "0000072" "0001030" "0010291" "1189192"

formatC更简单一些 (Richie Cotton)
可以方便用于文件命名

0:999 %>% 
    as.character %>% 
    str_pad(string = .,width = max(str_count(.)),pad = '0') %>% 
    head

## [1] "000" "001" "002" "003" "004" "005"

书签 https://campus.datacamp.com/courses/string-manipulation-in-r-with-stringr/pattern-matching-with-regular-expressions?ex=3

2 Chapter 2

source(here::here("code/load.R"))

参考 Wickham (2019)

2.1 str_c 更好的处理 NA

str_c 当发现有 NA进行合并和 collapse 时，保留 NA 不做其他修改
这点比 paste 和 paste0 更好

library(stringr)
library(tidyverse)

my_toppings <- c("cheese", NA, NA)
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")

## Print my_toppings_and
my_toppings_and

## [1] "cheese" "NA"     "and NA"

## Use str_c() instead of paste(): my_toppings_str
my_toppings_str <- str_c(c("", "", "and "), my_toppings)

## Print my_toppings_str
my_toppings_str

## [1] "cheese" NA       NA

## paste() my_toppings_and with collapse = ", "
paste(my_toppings_and,collapse = ", ")

## [1] "cheese, NA, and NA"

## str_c() my_toppings_str with collapse = ", "
str_c(my_toppings_str,collapse = ", ")

## [1] NA

str_c(my_toppings_str,collapse = ", ") %>% str_replace_na('There is no value.')

## [1] "There is no value."

2.2 str_length

str_length 对 factor 变量同样有效。

str_length(factor('haha'))

## [1] 4

2.3 英文名特点

library(babynames)
data(babynames)

babynames %>% dim

## [1] 1924665       5

babynames_sub <- 
    babynames %>% 
    filter(year >= 1990)

找不到 year 报错

babynames_sub %>% 
    transmute(
        first_letter = str_sub(name,1,1)
        ,last_letter = str_sub(name,-1,-1)
    ) %>% 
    {
        list(
            first_letter = table(.$first_letter)
            ,last_letter = table(.$last_letter)
        )
            
    }

## $first_letter
## 
##      A      B      C      D      E      F      G      H      I      J 
## 101494  32632  50036  57301  31795   9904  17624  18412  13880  77706 
##      K      L      M      N      O      P      Q      R      S      T 
##  77323  41851  68839  30200   8171  12109   3835  39261  63938  50452 
##      U      V      W      X      Y      Z 
##   1774   8647   6119   2276  13123  16358 
## 
## $last_letter
## 
##      a      b      c      d      e      f      g      h      i      j 
## 213857   2946   3121  13632 117122   1809   2992  51317  51723   1433 
##      k      l      m      n      o      p      q      r      s      t 
##   9203  38067  11083 170144  23668   1075    612  29900  31689  11871 
##      u      v      w      x      y      z 
##   3417   1477   1088   2219  55225   4370

因此以a开头和结尾的名字在90后和00后挺多的。

2.4 str_count

babynames_sub %>% 
    mutate(contain_n_a = str_count(name,'A|a')) %>% 
    summarise(
        mean(contain_n_a)
        ,sum(contain_n_a>0)
        ,sum(contain_n_a>1)
        ,sum(contain_n_a>2)
        ,sum(contain_n_a>3)
        ,sum(contain_n_a>4)
        ,sum(contain_n_a>5)
    ) %>% 
    gather

可以发现字母a挺常用的。

2.5 str_split

simplify = TRUE转换成矩阵，由于是统一数据类型，因此转换成矩阵，更加计算有效率。

both_names <- c("Box, George", "Cox, David")

## Split both_names into first_names and last_names
both_names_split <- str_split(both_names,", ",n=2,simplify = T)
both_names_split

##      [,1]  [,2]    
## [1,] "Box" "George"
## [2,] "Cox" "David"

## Get first names
both_names_split[,2]

## [1] "George" "David"

## Get last names
both_names_split[,1]

## [1] "Box" "Cox"

2.6 lapply

lines <- 
    c(
        "The table was a large one, but the three were all crowded together at one corner of it:"
        ,"\"No room! No room!\" they cried out when they saw Alice coming."
        ,"\"There’s plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."
    )

## Split lines into words
words <- str_split(lines,' ')

## Number of words per line
lapply(words,length)

## [[1]]
## [1] 18
## 
## [[2]]
## [1] 12
## 
## [[3]]
## [1] 21

## Number of characters in each word
word_lengths <- lapply(words,str_length)
  
## Average word length per line
lapply(word_lengths,mean)

## [[1]]
## [1] 3.888889
## 
## [[2]]
## [1] 4.25
## 
## [[3]]
## [1] 4.380952

2.7 实现需求首名首字母提取

## Define some full names
names <- c("Diana Prince", "Clark Kent")

## Split into first and last names
names_split <- str_split(names, fixed(" "), simplify = TRUE)

## Extract the first letter in the first name
abb_first <- str_sub(names_split[, 1], 1, 1)

## Combine the first letter ". " and last name
str_c(abb_first, ". ", names_split[, 2])

## [1] "D. Prince" "C. Kent"

3 Chapter 3

source(here::here("code/load.R"))

gr(?:e|a)y

source(here::here("code/load.R"))

参考 Wickham (2019)

source(here::here("code/load.R"))

参考 Wickham (2019)

gr(?:e|a)y

source(here::here("code/load.R"))

参考 Wickham (2019)

参考 Github Issue 152，目前可以了。

3.1 rebus 包

方便管理复杂的正则化表达，因为支持缩进。
支持共同部分的设定变量去调用
Alternation or()
1. or('aaa','bbb') -> (?:aaa|bbb)
2. "gr" %R% or("e", "a") %R% "y" -> gr(?:e|a)y
Character classes char_class()
1. char_class("ae") -> [ae]
2. negated_char_class("ae") -> [^ae]
Repetition optional(), one_or_more(), zero_or_more(),repeated(lo,hi)不支持%R%
exactly() = START %R% ... %R% END
Shortcuts
1. DGT = \d
2. SPC =
3. WRD = [a-zA-Z0-9_]
Special characters
1. BACKSLASH = \
2. CARET = ^
3. DOLLAR = $
4. DOT = .
5. PIPE = |
6. QUESTION = ?
7. STAR = *
8. PLUS = +
9. OPEN_PAREN = (
10. CLOSE_PAREN = )
11. OPEN_BRACKET = [
12. CLOSE_BRACKET = ]
13. OPEN_BRACE = {
Date-time regexes

str_view()中的match = TRUE: this will only display elements that had a match, which is useful when you are searching over many strings.; 且只展示第一个字母，全展示使用str_view_all; 使用去查看自己的规则是否正确

library(tidyverse)
contact <- 
c('Call me at 555-555-0191','123 Main St','(555) 555 0191','Phone: 555.555.0191 Mobile: 555.555.0192')

optional('(') %R%
    repeated(DGT,3) %R%
    optional(')') %R%
    char_class("-.() ") %R%
    repeated(DGT,3) %R%
    char_class("-.() ") %R%
    repeated(DGT,4) %>% 
    str_extract_all(contact,.)

## [[1]]
## [1] "555-555-0191"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "(555) 555 0191"
## 
## [[4]]
## [1] "555.555.0191" "555.555.0192"

分段写函数，方便管理

narratives <- 
c("19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS ","31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI ","ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED ","TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*","10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB ","53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION ","13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION","14YR M PLAYING FOOTBALL; DX KNEE SPRAIN ","55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE ","5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN")

## Use these patterns
age <- dgt(1,2)
unit <- optional(SPC) %R% or("YO", "YR", "MO")
gender <- optional(SPC) %R% or("M", "F")

## Extract age, unit, gender
age_gender <- str_extract(narratives,age %R% unit %R% gender)
age_gender

##  [1] "19YOM"   "31 YOF"  "82 YOM"  "33 YOF"  "10YOM"   "53 YO F" "13 MOF" 
##  [8] "14YR M"  "55YOM"   "5 YOM"

## 之后还可以利用规则，进行变量提取

age_gender %>% 
    tibble() %>% 
    set_names('raw') %>% 
    mutate(
        gender = str_extract(raw,gender)
    ) %>% 
    mutate(
        is_year = str_extract(raw,unit) %>% str_trim %>% str_sub(1,1)
        ,age_num = str_extract(raw,age)
        ,age = 
            if (is_year == 'Y') {
                as.numeric(age_num)
            } else {
                as.numeric(age_num)/12
            }
    ) %>% 
    select(raw,gender,age)

ls.str()还可以看具体的结构。

4 Chapter 4

source(here::here("code/load.R"))

参考 Wickham (2019)

4.1 capture + str_match

配合str_match使用

hero_contacts <- 
    c("(wolverine@xmen.com)","wonderwoman@justiceleague.org","thor@avengers.com")
capture(one_or_more(WRD)) %R% 
    "@" %R% 
    capture(one_or_more(WRD)) %R% 
    DOT %R% 
    capture(one_or_more(WRD)) %>% 
    str_match(hero_contacts,.)

##      [,1]                            [,2]          [,3]            [,4] 
## [1,] "wolverine@xmen.com"            "wolverine"   "xmen"          "com"
## [2,] "wonderwoman@justiceleague.org" "wonderwoman" "justiceleague" "org"
## [3,] "thor@avengers.com"             "thor"        "avengers"      "com"

capture 的特性也见于 impala 的函数 regexp_extract

但是邮箱正则化很难提取，见讨论

另外一个例子

narratives <- 
c("19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS ","31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI ","ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED ","TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*","10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB ","53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION ","13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION","14YR M PLAYING FOOTBALL; DX KNEE SPRAIN ","55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE ","5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN")

## narratives has been pre-defined
narratives

##  [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS "                      
##  [2] "31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI "                                    
##  [3] "ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED "                                      
##  [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"            
##  [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "                          
##  [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "                                      
##  [7] "13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
##  [8] "14YR M PLAYING FOOTBALL; DX KNEE SPRAIN "                                                  
##  [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "                      
## [10] "5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"

## Add capture() to get age, unit and sex
pattern <- capture(optional(DGT) %R% DGT) %R%  
  optional(SPC) %R% capture(or("YO", "YR", "MO")) %R%
  optional(SPC) %R% capture(or("M", "F"))

## Pull out from narratives
str_match(narratives,pattern)

##       [,1]      [,2] [,3] [,4]
##  [1,] "19YOM"   "19" "YO" "M" 
##  [2,] "31 YOF"  "31" "YO" "F" 
##  [3,] "82 YOM"  "82" "YO" "M" 
##  [4,] "33 YOF"  "33" "YO" "F" 
##  [5,] "10YOM"   "10" "YO" "M" 
##  [6,] "53 YO F" "53" "YO" "F" 
##  [7,] "13 MOF"  "13" "MO" "F" 
##  [8,] "14YR M"  "14" "YR" "M" 
##  [9,] "55YOM"   "55" "YO" "M" 
## [10,] "5 YOM"   "5"  "YO" "M"

同时优化一下，只提取首字母表示年月

## Edit to capture just Y and M in units
pattern2 <- capture(optional(DGT) %R% DGT) %R%  
  optional(SPC) %R% capture(or("Y", "M")) %R% optional(or("O","R")) %R%
  optional(SPC) %R% capture(or("M", "F"))

## Check pattern
str_view(narratives, pattern2)

## Pull out pieces
str_match(narratives, pattern2)

##       [,1]      [,2] [,3] [,4]
##  [1,] "19YOM"   "19" "Y"  "M" 
##  [2,] "31 YOF"  "31" "Y"  "F" 
##  [3,] "82 YOM"  "82" "Y"  "M" 
##  [4,] "33 YOF"  "33" "Y"  "F" 
##  [5,] "10YOM"   "10" "Y"  "M" 
##  [6,] "53 YO F" "53" "Y"  "F" 
##  [7,] "13 MOF"  "13" "M"  "F" 
##  [8,] "14YR M"  "14" "Y"  "M" 
##  [9,] "55YOM"   "55" "Y"  "M" 
## [10,] "5 YOM"   "5"  "Y"  "M"

4.2 backreferences

方便删除重复单词
使用引用进行正序和反序提取字符
1. 这一波可以去stackoverflow回复很多问题了

str_subset(c("hello", "sweet", "kitten"), 
  pattern = capture(LOWER) %R% REF1)

## [1] "hello"  "sweet"  "kitten"

这里相当于LOWER %R% LOWER

abab

## Names with a pair of repeated letters
pair_of_repeated <- capture((LOWER %R% LOWER)) %R% REF1

## Test it
str_view('abab', pattern = pair_of_repeated, match = TRUE)

abba

## Names with a pair that reverses
pair_that_reverses <- capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1

## Test it
str_view('abba', pattern = pair_that_reverses, match = TRUE)

精确匹配

## Four letter palindrome names
four_letter_palindrome <- exactly(
  capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1
)

## Test it
str_view('abba', pattern = four_letter_palindrome, match = TRUE)

## View text containing phone numbers
contact <- 
c('Call me at 555-555-0191','123 Main St','(555) 555 0191','Phone: 555.555.0191 Mobile: 555.555.0192')


## Replace digits with "X"
str_replace(contact, DGT, "X")

## [1] "Call me at X55-555-0191"                 
## [2] "X23 Main St"                             
## [3] "(X55) 555 0191"                          
## [4] "Phone: X55.555.0191 Mobile: 555.555.0192"

## Replace all digits with "X"
str_replace_all(contact, DGT, "X")

## [1] "Call me at XXX-XXX-XXXX"                 
## [2] "XXX Main St"                             
## [3] "(XXX) XXX XXXX"                          
## [4] "Phone: XXX.XXX.XXXX Mobile: XXX.XXX.XXXX"

## Replace all digits with different symbol
str_replace_all(contact, DGT, c("X", ".", "*", "_"))

## [1] "Call me at XXX-XXX-XXXX"                 
## [2] "... Main St"                             
## [3] "(***) *** ****"                          
## [4] "Phone: ___.___.____ Mobile: ___.___.____"

这是str_replace_all的特性，函数向量化。

剔除重复的词

str_replace_all(
    'I am the the aquaman'
    ,SPC %R% 
        capture(one_or_more(WRD)) %R% 
        SPC %R% 
        REF1
    ,' ' %R% REF1)

## [1] "I am the aquaman"

这个可以在RStudio，replace的时候，可以每个替换对应的值

SPC %R% 
    capture(one_or_more(WRD)) %R% 
    SPC %R% 
    REF1

## <regex> \s([\w]+)\s\1

表示

SPC %R% 
    capture(one_or_more(WRD)) %R% 
    SPC %R% 
    capture(one_or_more(WRD))

## <regex> \s([\w]+)\s([\w]+)

adverbs <- 
c("ABNORMALLY","ABSENTMINDEDLY","ACCIDENTALLY","ACIDLY","ACTUALLY","ADVENTUROUSLY","AFTERWARDS","ALMOST","ALWAYS","ANGRILY","ANNUALLY","ANXIOUSLY","ARROGANTLY","AWKWARDLY","BADLY","BASHFULLY","BEAUTIFULLY","BITTERLY","BLEAKLY","BLINDLY","BLISSFULLY","BOASTFULLY","BOLDLY","BRAVELY","BRIEFLY","BRIGHTLY","BRISKLY","BROADLY","BUSILY","CALMLY","CAREFULLY","CARELESSLY","CAUTIOUSLY","CERTAINLY","CHEERFULLY","CLEARLY","CLEVERLY","CLOSELY","COAXINGLY","COLORFULLY","COMMONLY","CONTINUALLY","COOLLY","CORRECTLY","COURAGEOUSLY","CROSSLY","CRUELLY","CURIOUSLY","DAILY","DAINTILY","DEARLY","DECEIVINGLY","DEEPLY","DEFIANTLY","DELIBERATELY","DELIGHTFULLY","DILIGENTLY","DIMLY","DOUBTFULLY","DREAMILY","EASILY","ELEGANTLY","ENERGETICALLY","ENORMOUSLY","ENTHUSIASTICALLY","EQUALLY","ESPECIALLY","EVEN","EVENLY","EVENTUALLY","EXACTLY","EXCITEDLY","EXTREMELY","FAIRLY","FAITHFULLY","FAMOUSLY","FAR","FAST","FATALLY","FEROCIOUSLY","FERVENTLY","FIERCELY","FONDLY","FOOLISHLY","FORTUNATELY","FRANKLY","FRANTICALLY","FREELY","FRENETICALLY","FRIGHTFULLY","FULLY","FURIOUSLY","GENERALLY","GENEROUSLY","GENTLY","GLADLY","GLEEFULLY","GRACEFULLY","GRATEFULLY","GREATLY","GREEDILY","HAPPILY","HASTILY","HEALTHILY","HEAVILY","HELPFULLY","HELPLESSLY","HIGHLY","HONESTLY","HOPELESSLY","HOURLY","HUNGRILY","IMMEDIATELY","INNOCENTLY","INQUISITIVELY","INSTANTLY","INTENSELY","INTENTLY","INTERESTINGLY","INWARDLY","IRRITABLY","JAGGEDLY","JEALOUSLY","JOSHINGLY","JOVIALLY","JOYFULLY","JOYOUSLY","JUBILANTLY","JUDGEMENTALLY","JUSTLY","KEENLY","KIDDINGLY","KINDHEARTEDLY","KINDLY","KISSINGLY","KNAVISHLY","KNOTTILY","KNOWINGLY","KNOWLEDGEABLY","KOOKILY","LAZILY","LESS","LIGHTLY","LIKELY","LIMPLY","LIVELY","LOFTILY","LONGINGLY","LOOSELY","LOUDLY","LOVINGLY","LOYALLY","MADLY","MAJESTICALLY","MEANINGFULLY","MECHANICALLY","MERRILY","MISERABLY","MOCKINGLY","MONTHLY","MORE","MORTALLY","MOSTLY","MYSTERIOUSLY","NATURALLY","NEARLY","NEATLY","NEEDILY","NERVOUSLY","NEVER","NICELY","NOISILY","NOT","OBEDIENTLY","OBNOXIOUSLY","ODDLY","OFFENSIVELY","OFFICIALLY","OFTEN","ONLY","OPENLY","OPTIMISTICALLY","OVERCONFIDENTLY","OWLISHLY","PAINFULLY","PARTIALLY","PATIENTLY","PERFECTLY","PHYSICALLY","PLAYFULLY","POLITELY","POORLY","POSITIVELY","POTENTIALLY","POWERFULLY","PROMPTLY","PROPERLY","PUNCTUALLY","QUAINTLY","QUARRELSOMELY","QUEASILY","QUEERLY","QUESTIONABLY","QUESTIONINGLY","QUICKER","QUICKLY","QUIETLY","QUIRKILY","QUIZZICALLY","RAPIDLY","RARELY","READILY","REALLY","REASSURINGLY","RECKLESSLY","REGULARLY","RELUCTANTLY","REPEATEDLY","REPROACHFULLY","RESTFULLY","RIGHTEOUSLY","RIGHTFULLY","RIGIDLY","ROUGHLY","RUDELY","SADLY","SAFELY","SCARCELY","SCARILY","SEARCHINGLY","SEDATELY","SEEMINGLY","SELDOM","SELFISHLY","SEPARATELY","SERIOUSLY","SHAKILY","SHARPLY","SHEEPISHLY","SHRILLY","SHYLY","SILENTLY","SLEEPILY","SLOWLY","SMOOTHLY","SOFTLY","SOLEMNLY","SOLIDLY","SOMETIMES","SOON","SPEEDILY","STEALTHILY","STERNLY","STRICTLY","SUCCESSFULLY","SUDDENLY","SURPRISINGLY","SUSPICIOUSLY","SWEETLY","SWIFTLY","SYMPATHETICALLY","TENDERLY","TENSELY","TERRIBLY","THANKFULLY","THOROUGHLY","THOUGHTFULLY","TIGHTLY","TOMORROW","TOO","TREMENDOUSLY","TRIUMPHANTLY","TRULY","TRUTHFULLY","ULTIMATELY","UNABASHEDLY","UNACCOUNTABLY","UNBEARABLY","UNETHICALLY","UNEXPECTEDLY","UNFORTUNATELY","UNIMPRESSIVELY","UNNATURALLY","UNNECESSARILY","UPBEAT","UPLIFTINGLY","UPRIGHT","UPSIDE-DOWN","UPWARD","UPWARDLY","URGENTLY","USEFULLY","USELESSLY","USUALLY","UTTERLY","VACANTLY","VAGUELY","VAINLY","VALIANTLY","VASTLY","VERBALLY","VERY","VICIOUSLY","VICTORIOUSLY","VIOLENTLY","VIVACIOUSLY","VOLUNTARILY","WARMLY","WEAKLY","WEARILY","WELL","WETLY","WHOLLY","WILDLY","WILLFULLY","WISELY","WOEFULLY","WONDERFULLY","WORRIEDLY","WRONGLY","YAWNINGLY","YEARLY","YEARNINGLY","YESTERDAY","YIELDINGLY","YOUTHFULLY")

随机给动词增加副词

## Build pattern to match words ending in "ING"
pattern <- one_or_more(WRD) %R% "ING"
str_view(narratives, pattern)

## Test replacement
str_replace(narratives, capture(pattern), 
  str_c("CARELESSLY", REF1, sep = " "))

##  [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE CARELESSLY PLAYING FOOTBALL W/ FRIENDS "                      
##  [2] "31 YOF FELL FROM TOILET HITITNG HEAD CARELESSLY SUSTAINING A CHI "                                    
##  [3] "ANKLE STR. 82 YOM STRAINED ANKLE CARELESSLY GETTING OUT OF BED "                                      
##  [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"                       
##  [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "                                     
##  [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "                                                 
##  [7] "13 MOF CARELESSLY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
##  [8] "14YR M CARELESSLY PLAYING FOOTBALL; DX KNEE SPRAIN "                                                  
##  [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "                                 
## [10] "5 YOM CARELESSLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"

## One adverb per narrative
adverbs_10 <- sample(adverbs, 10)

## Replace "***ing" with "adverb ***ly"
str_replace(narratives, 
  capture(pattern),
  str_c(adverbs_10, REF1, sep = " "))

##  [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE KINDHEARTEDLY PLAYING FOOTBALL W/ FRIENDS "               
##  [2] "31 YOF FELL FROM TOILET HITITNG HEAD UNIMPRESSIVELY SUSTAINING A CHI "                            
##  [3] "ANKLE STR. 82 YOM STRAINED ANKLE READILY GETTING OUT OF BED "                                     
##  [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"                   
##  [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "                                 
##  [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "                                             
##  [7] "13 MOF LAZILY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
##  [8] "14YR M MONTHLY PLAYING FOOTBALL; DX KNEE SPRAIN "                                                 
##  [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "                             
## [10] "5 YOM GRACEFULLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"

Replacement combined with backreferences can be really useful for reformatting text data.

附录

Unicode 的章节可以进附录

通过\p{name} impala应该是不支持的 https://grokbase.com/t/cloudera/impala-user/137tdg0thq/unicode-string-issue-with-impala

Vietnamese 越南语

## Names with builtin accents
(tay_son_builtin <- c(
  "Nguy\u1ec5n Nh\u1ea1c", 
  "Nguy\u1ec5n Hu\u1ec7",
  "Nguy\u1ec5n Quang To\u1ea3n"
))

## [1] "Nguy<U+1EC5>n Nh<U+1EA1>c" "Nguy<U+1EC5>n Hu<U+1EC7>" "Nguy<U+1EC5>n Quang To<U+1EA3>n"

## Convert to separate accents
library(stringi)
tay_son_separate <- stri_trans_nfd(tay_son_builtin)

## Verify that the string prints the same
tay_son_separate

## [1] "Nguye<U+0302><U+0303>n Nha<U+0323>c" "Nguye<U+0302><U+0303>n Hue<U+0323><U+0302>" "Nguye<U+0302><U+0303>n Quang Toa<U+0309>n"

## Match all accents
library(stringr)
library(rebus)
str_view_all(tay_son_separate, UP_DIACRITIC)

x <- c("Adele", "Ad\u00e8le", "Ad\u0065\u0300le")
writeLines(x)

## Adele
## Adèle
## Ade<U+0300>le

str_view(x, "Ad" %R% ANY_CHAR %R% "le")

str_view(x, "Ad" %R% '\\X' %R% "le")

str_view(x, "Ad" %R% GRAPHEME %R% "le")

当两个Unicode时，用.是识别不出来的，需要使用\X

5 Chapter 5

source(here::here("code/load.R"))

参考 Wickham (2019)

library(stringi)
suppressMessages(library(tidyverse))
library(rebus)
earnest_file <- '../data/importance-of-being-earnest.txt'

这个老师的文本处理非常优秀，应该 follow

的确是针对小说进行过分析。

## Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)

## Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))

## Get rid of gutenberg intro text
earnest_sub  <- earnest[(start + 1):(end - 1)]

## Detect first act
lines_start <- str_which(earnest_sub, fixed("FIRST ACT"))

## Set up index
intro_line_index <- 1:(lines_start - 1)

## Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]

## Take a look at the first 20 lines
writeLines(play_text[1:20])

## FIRST ACT
## 
## 
## SCENE
## 
## 
## Morning-room in Algernon's flat in Half-Moon Street.  The room is
## luxuriously and artistically furnished.  The sound of a piano is heard in
## the adjoining room.
## 
## [Lane is arranging afternoon tea on the table, and after the music has
## ceased, Algernon enters.]
## 
## Algernon.  Did you hear what I was playing, Lane?
## 
## Lane.  I didn't think it polite to listen, sir.
## 
## Algernon.  I'm sorry for that, for your sake.  I don't play
## accurately--any one can play accurately--but I play with wonderful
## expression.  As far as the piano is concerned, sentiment is my forte.  I

Why does Charlotte Wickham says stri_read_lines() is better than readLines()?

提供了处理文本的函数，而不需要在data.frame去实现。

## Get rid of empty strings
empty <- stri_isempty(play_text)
play_lines <- play_text[!empty]

play_lines[10:15]

## [1] "Algernon.  I'm sorry for that, for your sake.  I don't play"             
## [2] "accurately--any one can play accurately--but I play with wonderful"      
## [3] "expression.  As far as the piano is concerned, sentiment is my forte.  I"
## [4] "keep science for Life."                                                  
## [5] "Lane.  Yes, sir."                                                        
## [6] "Algernon.  And, speaking of the science of Life, have you got the"

## Pattern from last step
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT

## Get subset of lines that match
lines <- str_subset(play_lines,pattern_2)

## Extract match from lines
who <- str_extract(lines,pattern_2)

## Let's see what we have
unique(who)

##  [1] "Algernon."   "Lane."       "Jack."       "Cecily."     "Ernest."    
##  [6] "University." "Gwendolen."  "July."       "Chasuble."   "Merriman."  
## [11] "Sunday."     "Mr."         "London."     "Cardew."     "Opera."     
## [16] "Markby."     "Oxonian."

stri_read_lines 更好总结一下

or takes multiple character vector inputs and returns a character vector of the inputs separated by pipes. %|% is an operator interface to this function. or1 takes a single character vector and returns a string collapsed by pipes.

## Variables from previous step
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble", 
  "Merriman", "Lady Bracknell", "Miss Prism")
pattern_3 <- START %R% or1(characters) %R% DOT

## Pull out matches
lines <- str_subset(play_lines, pattern_3)

## Extract match from lines
who <- str_extract(lines, pattern_3)

## Let's see what we have
unique(who)

## [1] "Algernon."       "Lane."           "Jack."           "Cecily."        
## [5] "Gwendolen."      "Lady Bracknell." "Miss Prism."     "Chasuble."      
## [9] "Merriman."

## Count lines per character
table(who)

## who
##       Algernon.         Cecily.       Chasuble.      Gwendolen. 
##             201             154              42             102 
##           Jack. Lady Bracknell.           Lane.       Merriman. 
##             219              84              21              17 
##     Miss Prism. 
##              41

Jack and Algernon [Speaking together.]这个问题需要解决。

这个的处理不够strong，我需要知道每一步发生的逻辑，这样之后才能精进

5.1 ignore case

library(readr)
catcidents <- read_rds('../data/catcidents.rds')

## catcidents has been pre-defined
head(catcidents)

## [1] "79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*"                                                               
## [2] "21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%"                         
## [3] "87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION "                                                                            
## [4] "bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS"
## [5] "42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA"                                                        
## [6] "4YOf DOg jUst hAd PUpPieS, Cat TRIED 2 get PuPpIes, pT THru CaT dwn stA Irs, LoST foOTING & FELl down ~12 stePS; MInor hEaD iNJuRY"

## Construct pattern of DOG in boundaries
whole_dog_pattern <- whole_word("DOG")

## See matches to word DOG
str_view(catcidents,whole_dog_pattern,match=T)

x <- c("Cat", "CAT", "cAt") 
str_subset(x, "cat")

## character(0)

str_subset(x, stringr::regex("cat", ignore_case = TRUE))

## [1] "Cat" "CAT" "cAt"

5.2 stri_trans_totitle()

library(stringi)

## Get first five catcidents
cat5 <- catcidents[1:5]

## Take a look at original
writeLines(cat5)

## 79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*
## 21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%
## 87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION 
## bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS
## 42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA

## Transform to title case
str_to_title(cat5)

## [1] "79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*"                                                               
## [2] "21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%"                         
## [3] "87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion "                                                                            
## [4] "Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs"
## [5] "42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica"

## Transform to title case with stringi
stri_trans_totitle(cat5)

## [1] "79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*"                                                               
## [2] "21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%"                         
## [3] "87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion "                                                                            
## [4] "Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs"
## [5] "42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica"

## Transform to sentence case with stringi
stri_trans_totitle(cat5,type = 'sentence')

## [1] "79yof fractured finger tripped over cat and fell to floor last night at home*"                                                               
## [2] "21 yof reports sus laceration of her left hand when she was opening a can of cat food just pta. Dx hand laceration%"                         
## [3] "87yof tripped over cat, hit leg on step. Dx lower leg contusion "                                                                            
## [4] "Blunt chest trauma, r/o rib fx, r/o cartilage inj to rib cage; 32yom walking dog, dog took off after cat,fell,struck chest on steps,hit ribs"
## [5] "42yof to er for back pain after putting down some cat litter dx: back pain, sciatica"

整理下笔记，真的非常有用的学习

ppt 上面推荐的书，可以看一下

5.3 完成证书

参考 DataCamp

附录

参考文献

Wickham, Charlotte. 2019. “String Manipulation in R with Stringr.” DataCamp. 2019. https://www.datacamp.com/courses/string-manipulation-in-r-with-stringr.

正则化 DataCamp Notes

正则化 DataCamp Notes

1 Chapter 1

1.1 Quotes

1.2 Emoji

1.3 number

1.4 formatC

2 Chapter 2

2.1 str_c 更好的处理 NA

2.2 str_length

2.3 英文名特点

2.4 str_count

2.5 str_split

2.6 lapply

2.7 实现需求首名首字母提取

3 Chapter 3

3.1 rebus 包

4 Chapter 4

4.1 capture + str_match

4.2 backreferences

附录

5 Chapter 5

5.1 ignore case

5.2 stri_trans_totitle()

5.3 完成证书

附录

参考文献

正则化 DataCamp Notes

1 Chapter 1

1.1 Quotes

1.2 Emoji

1.3 number

1.4 formatC

2 Chapter 2

2.1 str_c 更好的处理 NA

2.2 str_length

2.3 英文名特点

2.4 str_count

2.5 str_split

2.6 lapply

2.7 实现需求 首名首字母提取

3 Chapter 3

3.1 rebus 包

4 Chapter 4

4.1 capture + str_match

4.2 backreferences

附录

5 Chapter 5

5.1 ignore case

5.2 stri_trans_totitle()

5.3 完成证书

附录

参考文献

2.7 实现需求首名首字母提取