正则化 DataCamp Notes
2020-01-31
- 使用 RMarkdown 的
child
参数,进行文档拼接。 - 这样拼接以后的笔记方便复习。
- 相关问题提交到 Issue http://127.0.0.1:43572/rmd_output/0/#number
1 Chapter 1
参考 Wickham (2019)
1.1 Quotes
\
- escape sequence
- backslash type
\\
## "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.
## "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.
1.2 Emoji
## hello
## <U+0001F30D>
\U
followed by up to 8 hex digits sequence denotes a particular Unicode character.
- using base 16 and the digits 0-9 and A-F. http://www.unicode.org/charts/
- four digits for the codepoint, an alternative escape sequence is
\u
.
1.3 number
感觉 fixed 和 scientific 好复杂。
- 有效数字,使用
format(..., digits = n)
,小数点前后取最大数的数位和最小数的数位(解释看例子)- 方便以列展示数据,但是当数据呈现是居中,可以用
trim = TRUE
去除效果。
- 方便以列展示数据,但是当数据呈现是居中,可以用
- 取消科学计数法,使用
format(..., scientific = FALSE)
## Some vectors of numbers
percent_change <- c(4, -1.91, 3.00, -5.002)
income <- c(72.19, 1030.18, 10291.93, 1189192.18)
p_values <- c(0.12, 0.98, 0.0000191, 0.00000000002)
## Format c(0.0011, 0.011, 1) with digits = 1
format(c(0.0011, 0.011, 1), digits = 1)
## [1] "0.001" "0.011" "1.000"
## [1] "1" "2" "1"
## [1] " 4.0" "-1.9" " 3.0" "-5.0"
## [1] " 72" " 1030" " 10292" "1189192"
## [1] "0.12000000000" "0.98000000000" "0.00001910000" "0.00000000002"
- 三位数控制,使用
big.mark = ','
给出分隔符,big.interval = 3
给出分割距离。
## [1] " 72" " 1030" " 10292" "1189192"
## 72
## 1030
## 10292
## 1189192
## Define trimmed_income
trimmed_income <- format(income, digits =2, trim = TRUE)
## Call writeLines() on the trimmed_income
writeLines(trimmed_income)
## 72
## 1030
## 10292
## 1189192
## Define pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",",big.interval = 4)
## 中文需要四位分隔符
## Call writeLines() on the pretty_income
writeLines(pretty_income)
## 72
## 1030
## 1,0292
## 118,9192
1.4 formatC
based on C style syntax
"f"
for fixed,digits
is the number of digits after the decimal point."e"
for scientific, and"g"
for fixed unless scientific saves spaceflag = "+"
表示数据正负号format = "g", digits = 2
用最少字母表示小数点后两位flag = "0"
前面用0 pad 上,不清晰,也使用str_pad
替代
## From the format() exercise
x <- c(0.0011, 0.011, 1)
y <- c(1.0011, 2.011, 1)
## formatC() on x with format = "f", digits = 1
formatC(x, format = "f", digits = 1)
## [1] "0.0" "0.0" "1.0"
## [1] "0.001" "0.011" "1.000"
## [1] "1.0" "2.0" "1.0"
## [1] "1" "2" "1"
## Format percent_change to one place after the decimal point
formatC(percent_change, format = "f", digits = 1)
## [1] "4.0" "-1.9" "3.0" "-5.0"
## [1] "+4.0" "-1.9" "+3.0" "-5.0"
## [1] "0.12" "0.98" "1.9e-05" "2e-11"
## [1] "72" "1030" "10292" "1189192"
- 会计格式展示,使用
format
和paste
完成 justify = "right"
可以让文本居中或者向左向右。
## Define the names vector
income_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")
## Create pretty_income
pretty_income <- format(income, digit = 2, big.mark = ',')
## Create dollar_income
dollar_income <- paste('$', pretty_income, sep = '')
## Create formatted_names
formatted_names <- format(income_names, justify = "right")
## Create rows
rows <- paste(formatted_names, dollar_income, sep = ' ')
## Write rows
writeLines(rows)
## Year 0 $ 72
## Year 1 $ 1,030
## Year 2 $ 10,292
## Project Lifetime $1,189,192
## [1] "0000072" "0001030" "0010292" "1189192"
## [1] "0000072" "0001030" "0010291" "1189192"
formatC
更简单一些 (Richie Cotton)- 可以方便用于文件命名
## [1] "000" "001" "002" "003" "004" "005"
2 Chapter 2
参考 Wickham (2019)
2.1 str_c 更好的处理 NA
- str_c 当发现有
NA
进行合并和 collapse 时,保留 NA 不做其他修改 - 这点比
paste
和paste0
更好
library(stringr)
library(tidyverse)
my_toppings <- c("cheese", NA, NA)
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")
## Print my_toppings_and
my_toppings_and
## [1] "cheese" "NA" "and NA"
## Use str_c() instead of paste(): my_toppings_str
my_toppings_str <- str_c(c("", "", "and "), my_toppings)
## Print my_toppings_str
my_toppings_str
## [1] "cheese" NA NA
## [1] "cheese, NA, and NA"
## [1] NA
## [1] "There is no value."
2.3 英文名特点
## [1] 1924665 5
找不到 year 报错
babynames_sub %>%
transmute(
first_letter = str_sub(name,1,1)
,last_letter = str_sub(name,-1,-1)
) %>%
{
list(
first_letter = table(.$first_letter)
,last_letter = table(.$last_letter)
)
}
## $first_letter
##
## A B C D E F G H I J
## 101494 32632 50036 57301 31795 9904 17624 18412 13880 77706
## K L M N O P Q R S T
## 77323 41851 68839 30200 8171 12109 3835 39261 63938 50452
## U V W X Y Z
## 1774 8647 6119 2276 13123 16358
##
## $last_letter
##
## a b c d e f g h i j
## 213857 2946 3121 13632 117122 1809 2992 51317 51723 1433
## k l m n o p q r s t
## 9203 38067 11083 170144 23668 1075 612 29900 31689 11871
## u v w x y z
## 3417 1477 1088 2219 55225 4370
因此以a开头和结尾的名字在90后和00后挺多的。
2.4 str_count
babynames_sub %>%
mutate(contain_n_a = str_count(name,'A|a')) %>%
summarise(
mean(contain_n_a)
,sum(contain_n_a>0)
,sum(contain_n_a>1)
,sum(contain_n_a>2)
,sum(contain_n_a>3)
,sum(contain_n_a>4)
,sum(contain_n_a>5)
) %>%
gather
可以发现字母a挺常用的。
2.5 str_split
simplify = TRUE
转换成矩阵,由于是统一数据类型,因此转换成矩阵,更加计算有效率。
both_names <- c("Box, George", "Cox, David")
## Split both_names into first_names and last_names
both_names_split <- str_split(both_names,", ",n=2,simplify = T)
both_names_split
## [,1] [,2]
## [1,] "Box" "George"
## [2,] "Cox" "David"
## [1] "George" "David"
## [1] "Box" "Cox"
2.6 lapply
lines <-
c(
"The table was a large one, but the three were all crowded together at one corner of it:"
,"\"No room! No room!\" they cried out when they saw Alice coming."
,"\"There’s plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."
)
## Split lines into words
words <- str_split(lines,' ')
## Number of words per line
lapply(words,length)
## [[1]]
## [1] 18
##
## [[2]]
## [1] 12
##
## [[3]]
## [1] 21
## Number of characters in each word
word_lengths <- lapply(words,str_length)
## Average word length per line
lapply(word_lengths,mean)
## [[1]]
## [1] 3.888889
##
## [[2]]
## [1] 4.25
##
## [[3]]
## [1] 4.380952
2.7 实现需求 首名首字母提取
## Define some full names
names <- c("Diana Prince", "Clark Kent")
## Split into first and last names
names_split <- str_split(names, fixed(" "), simplify = TRUE)
## Extract the first letter in the first name
abb_first <- str_sub(names_split[, 1], 1, 1)
## Combine the first letter ". " and last name
str_c(abb_first, ". ", names_split[, 2])
## [1] "D. Prince" "C. Kent"
3 Chapter 3
- gr(?:e|a)y
参考 Wickham (2019)
参考 Wickham (2019)
- gr(?:e|a)y
参考 Wickham (2019)
参考 Github Issue 152,目前可以了。
3.1 rebus 包
方便管理复杂的正则化表达,因为支持缩进。
支持共同部分的设定变量去调用
Alternation
or()
or('aaa','bbb')
-> (?:aaa|bbb)"gr" %R% or("e", "a") %R% "y"
-> gr(?:e|a)y
Character classes
char_class()
char_class("ae")
-> [ae]negated_char_class("ae")
-> [^ae]
Repetition
optional()
,one_or_more()
,zero_or_more()
,repeated(lo,hi)
不支持%R%
exactly()
=START %R% ... %R% END
Shortcuts
DGT
= \dSPC
=WRD
=[a-zA-Z0-9_]
Special characters
- BACKSLASH = \
- CARET = ^
- DOLLAR = $
- DOT = .
- PIPE = |
- QUESTION = ?
- STAR = *
- PLUS = +
- OPEN_PAREN = (
- CLOSE_PAREN = )
- OPEN_BRACKET = [
- CLOSE_BRACKET = ]
- OPEN_BRACE = {
Date-time regexes
str_view()
中的match = TRUE
- this will only display elements that had a match, which is useful when you are searching over many strings.
- 且只展示第一个字母,全展示使用
str_view_all
- 使用去查看自己的规则是否正确
library(tidyverse)
contact <-
c('Call me at 555-555-0191','123 Main St','(555) 555 0191','Phone: 555.555.0191 Mobile: 555.555.0192')
optional('(') %R%
repeated(DGT,3) %R%
optional(')') %R%
char_class("-.() ") %R%
repeated(DGT,3) %R%
char_class("-.() ") %R%
repeated(DGT,4) %>%
str_extract_all(contact,.)
## [[1]]
## [1] "555-555-0191"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "(555) 555 0191"
##
## [[4]]
## [1] "555.555.0191" "555.555.0192"
分段写函数,方便管理
narratives <-
c("19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS ","31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI ","ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED ","TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*","10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB ","53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION ","13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION","14YR M PLAYING FOOTBALL; DX KNEE SPRAIN ","55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE ","5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN")
## Use these patterns
age <- dgt(1,2)
unit <- optional(SPC) %R% or("YO", "YR", "MO")
gender <- optional(SPC) %R% or("M", "F")
## Extract age, unit, gender
age_gender <- str_extract(narratives,age %R% unit %R% gender)
age_gender
## [1] "19YOM" "31 YOF" "82 YOM" "33 YOF" "10YOM" "53 YO F" "13 MOF"
## [8] "14YR M" "55YOM" "5 YOM"
## 之后还可以利用规则,进行变量提取
age_gender %>%
tibble() %>%
set_names('raw') %>%
mutate(
gender = str_extract(raw,gender)
) %>%
mutate(
is_year = str_extract(raw,unit) %>% str_trim %>% str_sub(1,1)
,age_num = str_extract(raw,age)
,age =
if (is_year == 'Y') {
as.numeric(age_num)
} else {
as.numeric(age_num)/12
}
) %>%
select(raw,gender,age)
ls.str()
还可以看具体的结构。
4 Chapter 4
参考 Wickham (2019)
4.1 capture + str_match
配合str_match
使用
hero_contacts <-
c("(wolverine@xmen.com)","wonderwoman@justiceleague.org","thor@avengers.com")
capture(one_or_more(WRD)) %R%
"@" %R%
capture(one_or_more(WRD)) %R%
DOT %R%
capture(one_or_more(WRD)) %>%
str_match(hero_contacts,.)
## [,1] [,2] [,3] [,4]
## [1,] "wolverine@xmen.com" "wolverine" "xmen" "com"
## [2,] "wonderwoman@justiceleague.org" "wonderwoman" "justiceleague" "org"
## [3,] "thor@avengers.com" "thor" "avengers" "com"
capture 的特性也见于 impala 的函数 regexp_extract
但是邮箱正则化很难提取,见 讨论
另外一个例子
narratives <-
c("19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS ","31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI ","ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED ","TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*","10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB ","53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION ","13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION","14YR M PLAYING FOOTBALL; DX KNEE SPRAIN ","55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE ","5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN")
## [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS "
## [2] "31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI "
## [3] "ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED "
## [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"
## [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "
## [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "
## [7] "13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
## [8] "14YR M PLAYING FOOTBALL; DX KNEE SPRAIN "
## [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "
## [10] "5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"
## Add capture() to get age, unit and sex
pattern <- capture(optional(DGT) %R% DGT) %R%
optional(SPC) %R% capture(or("YO", "YR", "MO")) %R%
optional(SPC) %R% capture(or("M", "F"))
## Pull out from narratives
str_match(narratives,pattern)
## [,1] [,2] [,3] [,4]
## [1,] "19YOM" "19" "YO" "M"
## [2,] "31 YOF" "31" "YO" "F"
## [3,] "82 YOM" "82" "YO" "M"
## [4,] "33 YOF" "33" "YO" "F"
## [5,] "10YOM" "10" "YO" "M"
## [6,] "53 YO F" "53" "YO" "F"
## [7,] "13 MOF" "13" "MO" "F"
## [8,] "14YR M" "14" "YR" "M"
## [9,] "55YOM" "55" "YO" "M"
## [10,] "5 YOM" "5" "YO" "M"
同时优化一下,只提取首字母表示年月
## Edit to capture just Y and M in units
pattern2 <- capture(optional(DGT) %R% DGT) %R%
optional(SPC) %R% capture(or("Y", "M")) %R% optional(or("O","R")) %R%
optional(SPC) %R% capture(or("M", "F"))
## Check pattern
str_view(narratives, pattern2)
## [,1] [,2] [,3] [,4]
## [1,] "19YOM" "19" "Y" "M"
## [2,] "31 YOF" "31" "Y" "F"
## [3,] "82 YOM" "82" "Y" "M"
## [4,] "33 YOF" "33" "Y" "F"
## [5,] "10YOM" "10" "Y" "M"
## [6,] "53 YO F" "53" "Y" "F"
## [7,] "13 MOF" "13" "M" "F"
## [8,] "14YR M" "14" "Y" "M"
## [9,] "55YOM" "55" "Y" "M"
## [10,] "5 YOM" "5" "Y" "M"
4.2 backreferences
- 方便删除重复单词
- 使用引用进行正序和反序提取字符
- 这一波可以去stackoverflow回复很多问题了
## [1] "hello" "sweet" "kitten"
这里相当于LOWER %R% LOWER
abab
## Names with a pair of repeated letters
pair_of_repeated <- capture((LOWER %R% LOWER)) %R% REF1
## Test it
str_view('abab', pattern = pair_of_repeated, match = TRUE)
abba
## Names with a pair that reverses
pair_that_reverses <- capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1
## Test it
str_view('abba', pattern = pair_that_reverses, match = TRUE)
精确匹配
## Four letter palindrome names
four_letter_palindrome <- exactly(
capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1
)
## Test it
str_view('abba', pattern = four_letter_palindrome, match = TRUE)
## View text containing phone numbers
contact <-
c('Call me at 555-555-0191','123 Main St','(555) 555 0191','Phone: 555.555.0191 Mobile: 555.555.0192')
## Replace digits with "X"
str_replace(contact, DGT, "X")
## [1] "Call me at X55-555-0191"
## [2] "X23 Main St"
## [3] "(X55) 555 0191"
## [4] "Phone: X55.555.0191 Mobile: 555.555.0192"
## [1] "Call me at XXX-XXX-XXXX"
## [2] "XXX Main St"
## [3] "(XXX) XXX XXXX"
## [4] "Phone: XXX.XXX.XXXX Mobile: XXX.XXX.XXXX"
## [1] "Call me at XXX-XXX-XXXX"
## [2] "... Main St"
## [3] "(***) *** ****"
## [4] "Phone: ___.___.____ Mobile: ___.___.____"
这是str_replace_all
的特性,函数向量化。
剔除重复的词
str_replace_all(
'I am the the aquaman'
,SPC %R%
capture(one_or_more(WRD)) %R%
SPC %R%
REF1
,' ' %R% REF1)
## [1] "I am the aquaman"
这个可以在RStudio,replace的时候,可以每个替换对应的值
## <regex> \s([\w]+)\s\1
表示
## <regex> \s([\w]+)\s([\w]+)
adverbs <-
c("ABNORMALLY","ABSENTMINDEDLY","ACCIDENTALLY","ACIDLY","ACTUALLY","ADVENTUROUSLY","AFTERWARDS","ALMOST","ALWAYS","ANGRILY","ANNUALLY","ANXIOUSLY","ARROGANTLY","AWKWARDLY","BADLY","BASHFULLY","BEAUTIFULLY","BITTERLY","BLEAKLY","BLINDLY","BLISSFULLY","BOASTFULLY","BOLDLY","BRAVELY","BRIEFLY","BRIGHTLY","BRISKLY","BROADLY","BUSILY","CALMLY","CAREFULLY","CARELESSLY","CAUTIOUSLY","CERTAINLY","CHEERFULLY","CLEARLY","CLEVERLY","CLOSELY","COAXINGLY","COLORFULLY","COMMONLY","CONTINUALLY","COOLLY","CORRECTLY","COURAGEOUSLY","CROSSLY","CRUELLY","CURIOUSLY","DAILY","DAINTILY","DEARLY","DECEIVINGLY","DEEPLY","DEFIANTLY","DELIBERATELY","DELIGHTFULLY","DILIGENTLY","DIMLY","DOUBTFULLY","DREAMILY","EASILY","ELEGANTLY","ENERGETICALLY","ENORMOUSLY","ENTHUSIASTICALLY","EQUALLY","ESPECIALLY","EVEN","EVENLY","EVENTUALLY","EXACTLY","EXCITEDLY","EXTREMELY","FAIRLY","FAITHFULLY","FAMOUSLY","FAR","FAST","FATALLY","FEROCIOUSLY","FERVENTLY","FIERCELY","FONDLY","FOOLISHLY","FORTUNATELY","FRANKLY","FRANTICALLY","FREELY","FRENETICALLY","FRIGHTFULLY","FULLY","FURIOUSLY","GENERALLY","GENEROUSLY","GENTLY","GLADLY","GLEEFULLY","GRACEFULLY","GRATEFULLY","GREATLY","GREEDILY","HAPPILY","HASTILY","HEALTHILY","HEAVILY","HELPFULLY","HELPLESSLY","HIGHLY","HONESTLY","HOPELESSLY","HOURLY","HUNGRILY","IMMEDIATELY","INNOCENTLY","INQUISITIVELY","INSTANTLY","INTENSELY","INTENTLY","INTERESTINGLY","INWARDLY","IRRITABLY","JAGGEDLY","JEALOUSLY","JOSHINGLY","JOVIALLY","JOYFULLY","JOYOUSLY","JUBILANTLY","JUDGEMENTALLY","JUSTLY","KEENLY","KIDDINGLY","KINDHEARTEDLY","KINDLY","KISSINGLY","KNAVISHLY","KNOTTILY","KNOWINGLY","KNOWLEDGEABLY","KOOKILY","LAZILY","LESS","LIGHTLY","LIKELY","LIMPLY","LIVELY","LOFTILY","LONGINGLY","LOOSELY","LOUDLY","LOVINGLY","LOYALLY","MADLY","MAJESTICALLY","MEANINGFULLY","MECHANICALLY","MERRILY","MISERABLY","MOCKINGLY","MONTHLY","MORE","MORTALLY","MOSTLY","MYSTERIOUSLY","NATURALLY","NEARLY","NEATLY","NEEDILY","NERVOUSLY","NEVER","NICELY","NOISILY","NOT","OBEDIENTLY","OBNOXIOUSLY","ODDLY","OFFENSIVELY","OFFICIALLY","OFTEN","ONLY","OPENLY","OPTIMISTICALLY","OVERCONFIDENTLY","OWLISHLY","PAINFULLY","PARTIALLY","PATIENTLY","PERFECTLY","PHYSICALLY","PLAYFULLY","POLITELY","POORLY","POSITIVELY","POTENTIALLY","POWERFULLY","PROMPTLY","PROPERLY","PUNCTUALLY","QUAINTLY","QUARRELSOMELY","QUEASILY","QUEERLY","QUESTIONABLY","QUESTIONINGLY","QUICKER","QUICKLY","QUIETLY","QUIRKILY","QUIZZICALLY","RAPIDLY","RARELY","READILY","REALLY","REASSURINGLY","RECKLESSLY","REGULARLY","RELUCTANTLY","REPEATEDLY","REPROACHFULLY","RESTFULLY","RIGHTEOUSLY","RIGHTFULLY","RIGIDLY","ROUGHLY","RUDELY","SADLY","SAFELY","SCARCELY","SCARILY","SEARCHINGLY","SEDATELY","SEEMINGLY","SELDOM","SELFISHLY","SEPARATELY","SERIOUSLY","SHAKILY","SHARPLY","SHEEPISHLY","SHRILLY","SHYLY","SILENTLY","SLEEPILY","SLOWLY","SMOOTHLY","SOFTLY","SOLEMNLY","SOLIDLY","SOMETIMES","SOON","SPEEDILY","STEALTHILY","STERNLY","STRICTLY","SUCCESSFULLY","SUDDENLY","SURPRISINGLY","SUSPICIOUSLY","SWEETLY","SWIFTLY","SYMPATHETICALLY","TENDERLY","TENSELY","TERRIBLY","THANKFULLY","THOROUGHLY","THOUGHTFULLY","TIGHTLY","TOMORROW","TOO","TREMENDOUSLY","TRIUMPHANTLY","TRULY","TRUTHFULLY","ULTIMATELY","UNABASHEDLY","UNACCOUNTABLY","UNBEARABLY","UNETHICALLY","UNEXPECTEDLY","UNFORTUNATELY","UNIMPRESSIVELY","UNNATURALLY","UNNECESSARILY","UPBEAT","UPLIFTINGLY","UPRIGHT","UPSIDE-DOWN","UPWARD","UPWARDLY","URGENTLY","USEFULLY","USELESSLY","USUALLY","UTTERLY","VACANTLY","VAGUELY","VAINLY","VALIANTLY","VASTLY","VERBALLY","VERY","VICIOUSLY","VICTORIOUSLY","VIOLENTLY","VIVACIOUSLY","VOLUNTARILY","WARMLY","WEAKLY","WEARILY","WELL","WETLY","WHOLLY","WILDLY","WILLFULLY","WISELY","WOEFULLY","WONDERFULLY","WORRIEDLY","WRONGLY","YAWNINGLY","YEARLY","YEARNINGLY","YESTERDAY","YIELDINGLY","YOUTHFULLY")
随机给动词增加副词
## Build pattern to match words ending in "ING"
pattern <- one_or_more(WRD) %R% "ING"
str_view(narratives, pattern)
## [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE CARELESSLY PLAYING FOOTBALL W/ FRIENDS "
## [2] "31 YOF FELL FROM TOILET HITITNG HEAD CARELESSLY SUSTAINING A CHI "
## [3] "ANKLE STR. 82 YOM STRAINED ANKLE CARELESSLY GETTING OUT OF BED "
## [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"
## [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "
## [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "
## [7] "13 MOF CARELESSLY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
## [8] "14YR M CARELESSLY PLAYING FOOTBALL; DX KNEE SPRAIN "
## [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "
## [10] "5 YOM CARELESSLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"
## One adverb per narrative
adverbs_10 <- sample(adverbs, 10)
## Replace "***ing" with "adverb ***ly"
str_replace(narratives,
capture(pattern),
str_c(adverbs_10, REF1, sep = " "))
## [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE KINDHEARTEDLY PLAYING FOOTBALL W/ FRIENDS "
## [2] "31 YOF FELL FROM TOILET HITITNG HEAD UNIMPRESSIVELY SUSTAINING A CHI "
## [3] "ANKLE STR. 82 YOM STRAINED ANKLE READILY GETTING OUT OF BED "
## [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"
## [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "
## [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "
## [7] "13 MOF LAZILY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
## [8] "14YR M MONTHLY PLAYING FOOTBALL; DX KNEE SPRAIN "
## [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "
## [10] "5 YOM GRACEFULLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"
Replacement combined with backreferences can be really useful for reformatting text data.
附录
Unicode 的章节可以进附录
通过\p{name}
impala应该是不支持的
https://grokbase.com/t/cloudera/impala-user/137tdg0thq/unicode-string-issue-with-impala
Vietnamese 越南语
## Names with builtin accents
(tay_son_builtin <- c(
"Nguy\u1ec5n Nh\u1ea1c",
"Nguy\u1ec5n Hu\u1ec7",
"Nguy\u1ec5n Quang To\u1ea3n"
))
## [1] "Nguy<U+1EC5>n Nh<U+1EA1>c" "Nguy<U+1EC5>n Hu<U+1EC7>" "Nguy<U+1EC5>n Quang To<U+1EA3>n"
## Convert to separate accents
library(stringi)
tay_son_separate <- stri_trans_nfd(tay_son_builtin)
## Verify that the string prints the same
tay_son_separate
## [1] "Nguye<U+0302><U+0303>n Nha<U+0323>c" "Nguye<U+0302><U+0303>n Hue<U+0323><U+0302>" "Nguye<U+0302><U+0303>n Quang Toa<U+0309>n"
## Adele
## Adèle
## Ade<U+0300>le
当两个Unicode时,用.
是识别不出来的,需要使用\X
5 Chapter 5
参考 Wickham (2019)
library(stringi)
suppressMessages(library(tidyverse))
library(rebus)
earnest_file <- '../data/importance-of-being-earnest.txt'
这个老师的文本处理非常优秀,应该 follow
的确是针对小说进行过分析。
## Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)
## Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))
## Get rid of gutenberg intro text
earnest_sub <- earnest[(start + 1):(end - 1)]
## Detect first act
lines_start <- str_which(earnest_sub, fixed("FIRST ACT"))
## Set up index
intro_line_index <- 1:(lines_start - 1)
## Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]
## Take a look at the first 20 lines
writeLines(play_text[1:20])
## FIRST ACT
##
##
## SCENE
##
##
## Morning-room in Algernon's flat in Half-Moon Street. The room is
## luxuriously and artistically furnished. The sound of a piano is heard in
## the adjoining room.
##
## [Lane is arranging afternoon tea on the table, and after the music has
## ceased, Algernon enters.]
##
## Algernon. Did you hear what I was playing, Lane?
##
## Lane. I didn't think it polite to listen, sir.
##
## Algernon. I'm sorry for that, for your sake. I don't play
## accurately--any one can play accurately--but I play with wonderful
## expression. As far as the piano is concerned, sentiment is my forte. I
Why does Charlotte Wickham says stri_read_lines()
is better than readLines()
?
提供了处理文本的函数,而不需要在data.frame
去实现。
## [1] "Algernon. I'm sorry for that, for your sake. I don't play"
## [2] "accurately--any one can play accurately--but I play with wonderful"
## [3] "expression. As far as the piano is concerned, sentiment is my forte. I"
## [4] "keep science for Life."
## [5] "Lane. Yes, sir."
## [6] "Algernon. And, speaking of the science of Life, have you got the"
## Pattern from last step
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT
## Get subset of lines that match
lines <- str_subset(play_lines,pattern_2)
## Extract match from lines
who <- str_extract(lines,pattern_2)
## Let's see what we have
unique(who)
## [1] "Algernon." "Lane." "Jack." "Cecily." "Ernest."
## [6] "University." "Gwendolen." "July." "Chasuble." "Merriman."
## [11] "Sunday." "Mr." "London." "Cardew." "Opera."
## [16] "Markby." "Oxonian."
or
takes multiple character vector inputs and returns a character vector of the inputs separated by pipes.%|%
is an operator interface to this function.or1
takes a single character vector and returns a string collapsed by pipes.
## Variables from previous step
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble",
"Merriman", "Lady Bracknell", "Miss Prism")
pattern_3 <- START %R% or1(characters) %R% DOT
## Pull out matches
lines <- str_subset(play_lines, pattern_3)
## Extract match from lines
who <- str_extract(lines, pattern_3)
## Let's see what we have
unique(who)
## [1] "Algernon." "Lane." "Jack." "Cecily."
## [5] "Gwendolen." "Lady Bracknell." "Miss Prism." "Chasuble."
## [9] "Merriman."
## who
## Algernon. Cecily. Chasuble. Gwendolen.
## 201 154 42 102
## Jack. Lady Bracknell. Lane. Merriman.
## 219 84 21 17
## Miss Prism.
## 41
Jack and Algernon [Speaking together.]
这个问题需要解决。
这个的处理不够strong,我需要知道每一步发生的逻辑,这样之后才能精进
5.1 ignore case
## [1] "79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*"
## [2] "21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%"
## [3] "87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION "
## [4] "bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS"
## [5] "42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA"
## [6] "4YOf DOg jUst hAd PUpPieS, Cat TRIED 2 get PuPpIes, pT THru CaT dwn stA Irs, LoST foOTING & FELl down ~12 stePS; MInor hEaD iNJuRY"
## Construct pattern of DOG in boundaries
whole_dog_pattern <- whole_word("DOG")
## See matches to word DOG
str_view(catcidents,whole_dog_pattern,match=T)
## character(0)
## [1] "Cat" "CAT" "cAt"
5.2 stri_trans_totitle()
library(stringi)
## Get first five catcidents
cat5 <- catcidents[1:5]
## Take a look at original
writeLines(cat5)
## 79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*
## 21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%
## 87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION
## bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS
## 42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA
## [1] "79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*"
## [2] "21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%"
## [3] "87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion "
## [4] "Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs"
## [5] "42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica"
## [1] "79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*"
## [2] "21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%"
## [3] "87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion "
## [4] "Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs"
## [5] "42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica"
## [1] "79yof fractured finger tripped over cat and fell to floor last night at home*"
## [2] "21 yof reports sus laceration of her left hand when she was opening a can of cat food just pta. Dx hand laceration%"
## [3] "87yof tripped over cat, hit leg on step. Dx lower leg contusion "
## [4] "Blunt chest trauma, r/o rib fx, r/o cartilage inj to rib cage; 32yom walking dog, dog took off after cat,fell,struck chest on steps,hit ribs"
## [5] "42yof to er for back pain after putting down some cat litter dx: back pain, sciatica"
整理下笔记,真的非常有用的学习
ppt 上面推荐的书,可以看一下
5.3 完成证书
参考 DataCamp
附录
参考文献
Wickham, Charlotte. 2019. “String Manipulation in R with Stringr.” DataCamp. 2019. https://www.datacamp.com/courses/string-manipulation-in-r-with-stringr.