使用 Tidyverse 完成函数化编程
李家翔
2019-07-26
前言
使用 Tidyverse 风格书写代码主要侧重于数据分析,而非编程1:。
更快的熟悉方式可以参考DataCamp上的这门 课程 Add to Tidyverse (FREE)(Li 2018) 。
1 magrittr 包的使用
Tidyverse 风格的代码会常使用到管道%>%
,这只是其中最常见的一种。
更多 管道知识,参考GitHub
magrittr包的主要目标有2个 (张丹 2018)
- 第一是减少代码开发时间,提高代码的可读性和维护性;
- 第二是让你的代码更短
主要四种pipeline。
%>%
%T>%
: tee operator%$%
%<>%
我发现pipeline的功效需要举例说明,更直观。
1.1 %>%
参考 github
x %>% f %>% g %>% h
=h(g(f(x)))
x %>% f(y, z = .)
=f(y, z = x)
x %>% {f(y = nrow(.), z = ncol(.))}
=f(y = nrow(x), z = ncol(x))
以下举例。
- 取10000个随机数符合,符合正态分布。
- 求这个10000个数的绝对值,同时乘以50。
- 把结果组成一个100*100列的方阵。
- 计算方阵中每行的均值,并四舍五入保留到整数。
- 把结果除以7求余数,并话出余数的直方图。
1.2 %T>%
- 取10000个随机数符合,符合正态分布。
- 求这个10000个数的绝对值,同时乘以50。
- 把结果组成一个100*100列的方阵。
- 计算方阵中每行的均值,并四舍五入保留到整数。
- 把结果除以7求余数,并话出余数的直方图。
- 对余数求和
set.seed(1)
rnorm(10000) %>%
abs %>%
`*` (50) %>%
matrix(ncol = 100) %>%
rowMeans %>%
round %>%
`%%` (7) %T>%
hist %>%
sum
## [1] 328
可以发现g
保证了最后两个output通过
`%%` (7)
产生。
1.3 %$%
Many functions accept a data argument, e.g. lm and aggregate, which is very useful in a pipeline where data is first processed and then passed into such a function. There are also functions that do not have a data argument, for which it is useful to expose the variables in the data. This is done with the
%$%
operator. https://github.com/tidyverse/magrittr
终于理解了 $
就是 obj$x$y
。
iris %>%
subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)
#> [1] 0.3361992
这样可以省略.$a
1.4 %<>%
类似于 pandas 中函数的 replace = TRUE
参数。
1.5 还可以传递函数
这里使用了 if ... else
函数,不会出现封装逻辑的报错,比如ifelse
老是反馈向量化的结果。
2 read_* 数据
参考 jiaxiangbu 和 jiaxiangbu
2.1 read_*
文档
Yihui 在 blogdown 包中采用read_utf8 {xfun}
而非read_file
,保证了代码是UTF-8
录入。
read_utf8
虽然不是 Tidyverse 集成包中的函数,但是很好的处理了编码的问题,建议大家使用。
2.2 专业的数据描述文档
参考 Östblom and Niehus (2018)
这里学习read_delim
去阅读,有 comment 的数据集。
car_acc <- read_delim("datasets/road-accidents.csv",delim = '|',comment = "#")
通过
#
表达 comment#####
表达标题
可以学习专业的数据描述文档。
##### LICENSE #####
# This data set is modified from the original at fivethirtyeight (https://github.com/fivethirtyeight/data/tree/master/bad-drivers)
# and it is released under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
##### COLUMN ABBREVIATIONS #####
# drvr_fatl_col_bmiles = Number of drivers involved in fatal collisions per billion miles (2011)
# perc_fatl_speed = Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding (2009)
# perc_fatl_alcohol = Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired (2011)
# perc_fatl_1st_time = Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents (2011)
##### DATA BEGIN #####
state|drvr_fatl_col_bmiles|perc_fatl_speed|perc_fatl_alcohol|perc_fatl_1st_time
Alabama|18.8|39|30|80
3 reprex 使用技巧
reprex 包是为了方便共享代码很好,这一章节主要介绍大家使用,以方便大家在网上提问。
3.1 常见报错
3.1.1 安装对应的包
把要测试的代码写入...
,这是最稳妥的方法。
> install.packages("shinyjs")
Error in install.packages : error reading from connection
> devtools::install_github("daattali/shinyjs")
shinyjs
包会报错,没有安装包,用github安装。
3.2 数据引入
参考 McBain and Dervieux (2018) 。
## [1] "structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1), cyl = c(6, "
## [2] "6, 4, 6, 8, 6), disp = c(160, 160, 108, 258, 360, 225), hp = c(110, "
## [3] "110, 93, 110, 175, 105), drat = c(3.9, 3.9, 3.85, 3.08, 3.15, "
## [4] "2.76), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46), qsec = c(16.46, "
## [5] "17.02, 18.61, 19.44, 17.02, 20.22), vs = c(0, 0, 1, 1, 0, 1), "
## [6] " am = c(1, 1, 1, 0, 0, 0), gear = c(4, 4, 4, 3, 3, 3), carb = c(4, "
## [7] " 4, 1, 1, 2, 1)), row.names = c(\"Mazda RX4\", \"Mazda RX4 Wag\", "
## [8] "\"Datsun 710\", \"Hornet 4 Drive\", \"Hornet Sportabout\", \"Valiant\""
## [9] "), class = \"data.frame\")"
## tibble::tribble(
## ~mpg, ~cyl, ~disp, ~hp, ~drat, ~wt, ~qsec, ~vs, ~am, ~gear, ~carb,
## 21, 6, 160, 110, 3.9, 2.62, 16.46, 0, 1, 4, 4,
## 21, 6, 160, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4,
## 22.8, 4, 108, 93, 3.85, 2.32, 18.61, 1, 1, 4, 1,
## 21.4, 6, 258, 110, 3.08, 3.215, 19.44, 1, 0, 3, 1,
## 18.7, 8, 360, 175, 3.15, 3.44, 17.02, 0, 0, 3, 2,
## 18.1, 6, 225, 105, 2.76, 3.46, 20.22, 1, 0, 3, 1
## )
deparse()
解析表的结构,这个时候再复制粘贴就好,clipr::write_clip()
执行。datapasta::tribble_paste()
直接发生inplace的反馈tibble
格式的表格
3.3 其他更多功能
3.4 指定网站发布
- “gh” for GitHub-Flavored Markdown, the default
- “so” for Stack Overflow Markdown
- “ds” for Discourse, e.g., community.rstudio.com.
3.5 可复现例子
根据我们电话沟通,目前你的反馈中没有加入 library(plm)
,因此产生报错。
比如我们想要查询 mtcars
的行数、列数、summary
情况,并且想知道执行后使用者的本地配置。
先写好需要的代码
然后复制 ctrl + c,再执行代码
这样就知道当前你的系统信息和相关包的安装情况了。
这个时候如果你的剪贴板没有被覆盖的话,在 GitHub 的一个对话框中,执行 ctrl + v,会发现有以下 html 代码,粘贴到对话框后,github 上显示的格式和截图中一致。
library(dplyr)
#>
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>% dim()
#> [1] 32 11
mtcars %>% summary()
#> mpg cyl disp hp
#> Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
#> 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
#> Median :19.20 Median :6.000 Median :196.3 Median :123.0
#> Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
#> 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
#> Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
#> drat wt qsec vs
#> Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
#> 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
#> Median :3.695 Median :3.325 Median :17.71 Median :0.0000
#> Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
#> 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
#> Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
#> am gear carb
#> Min. :0.0000 Min. :3.000 Min. :1.000
#> 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
#> Median :0.0000 Median :4.000 Median :2.000
#> Mean :0.4062 Mean :3.688 Mean :2.812
#> 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :1.0000 Max. :5.000 Max. :8.000
Created on 2019-07-25 by the reprex package (v0.2.1)
Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#> setting value
#> version R version 3.5.3 (2019-03-11)
#> os macOS Mojave 10.14.5
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate zh_CN.UTF-8
#> ctype zh_CN.UTF-8
#> tz Asia/Shanghai
#> date 2019-07-25
#>
#> ─ Packages ──────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
#> backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0)
#> callr 3.3.0 2019-07-04 [1] CRAN (R 3.5.2)
#> cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
#> devtools 2.1.0 2019-07-06 [1] CRAN (R 3.5.2)
#> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0)
#> dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
#> evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.0)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.5.2)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.2)
#> highr 0.7 2018-06-09 [1] CRAN (R 3.5.0)
#> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
#> knitr 1.22.8 2019-05-14 [1] Github (yihui/knitr@00ffce2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
#> pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0)
#> pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.5.2)
#> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
#> processx 3.4.0 2019-07-03 [1] CRAN (R 3.5.2)
#> ps 1.2.1 2018-11-06 [1] CRAN (R 3.5.0)
#> purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.0)
#> R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.0)
#> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0)
#> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.5.2)
#> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
#> rmarkdown 1.10 2018-06-11 [1] CRAN (R 3.5.0)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0)
#> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
#> testthat 2.1.1 2019-04-23 [1] CRAN (R 3.5.2)
#> tibble 2.1.1 2019-03-16 [1] CRAN (R 3.5.2)
#> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0)
#> usethis 1.5.0 2019-04-07 [1] CRAN (R 3.5.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
#> xfun 0.6 2019-04-02 [1] CRAN (R 3.5.2)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library
4 dplyr
参考 Baert (2018a), Baert (2018b), Baert (2018c), Baert (2018d)。
4.1 not: ~!
~
=function()
!
= not,来自magrittr
包的定义 (张丹 2018)
4.2 select_*
4.2.1 select_all
= rename_all
按照数据类型进行选择。
4.2.2 select_if
+ aggregate function
sum
条件进行变量筛选mean(., na.rm=TRUE) > 10
n_distinct(.) < 10
4.3 head
, glimpse
不需要()
4.4 rowwise()
rowwise()
改变mean
的方向,by row,而非by column
4.6 na_if
让特定字符为NA
str_detect
比str_subset
简单。
4.7 filter w/ conditions
filter(condition1, condition2)
will return rows where both conditions are met.filter(condition1, !condition2)
will return all rows where condition one is true but condition 2 is not.filter(condition1 | condition2)
will return rows where condition 1 and/or condition 2 is met.filter(xor(condition1, condition2)
will return all rows where only one of the conditions is met, and not when both conditions are met.
4.7.1 xor
: 开区间
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
xor(a > 10, a < 90)
= a > 10 & a>= 90
| a <= 10 & a < 90
4.7.2 any_vars
& all_vars
msleep %>%
select(name:order, sleep_total, -vore) %>%
filter_all(any_vars(str_detect(., pattern = "Ca")))
any_vars
all_vars
批量筛选,满足一个条件,对多个col
进行filter
。
4.8 add_count
/add_tally
所属level的数量/和
msleep %>%
select(name:vore) %>%
add_count(vore)
msleep %>%
select(name:vore) %>%
left_join(
msleep %>%
group_by(vore) %>%
count()
,by='vore'
)
This saves the combination of grouping, mutating and ungrouping again (Baert 2018d).
Obtain count of unique combination of columns in R dataframe without eliminating the duplicate columns from the data. (Thomas 2017)
5 rlang
We have come to realize that this pattern is difficult to teach and to learn because it involves a new, unfamiliar syntax and because it introduces two new programming concepts (quote and unquote) that are hard to understand intuitively. This complexity is not really justified because this pattern is overly flexible for basic programming needs. https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/
说的占理。
## [1] '0.4.0'
get_n_unique <- function(df, by, columns) {
df %>%
group_by({{by}}) %>%
summarise(
n_unique = n_distinct({{columns}})
)
}
get_n_unique(mtcars, cyl, mpg)
get_n_unique <- function(df, by, ...) {
df %>%
group_by({{by}}) %>%
summarise(
...
)
}
get_n_unique(
df = mtcars,
by = cyl,
n_unique1 = n_distinct(hp),
n_unique2 = n_distinct(disp)
)
6 实际、复合需求
6.1 自动展示最新文档
可重复性研究的时候,需要保存每天跑的版本,因此保存文档的格式为date_*
,
date
表示日期,如181001
表示18年10月1日保存的文档,可以是.csv
,用于传输数据.png
,用于展示图像
_*
是文档的命名和类型,如_imptable.csv
,表示重要性数据的表格
然后在展示的.Rmd
直接引用,
可以设置
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## [1] "190726"
但是有些报告,不是当天都要刷的,有时候直接refer昨日的数据,这时,就需要基于最新的文档就好了。
list.files('data') %>% str_subset('confusing_matrix.csv') %>% max
因此在data
文件夹中,匹配相应的,命名和格式,如confusing_matrix.csv
,然后选取最新的文档,用max
函数实现。
不需要每天刷。
6.2 Frequency table
# df1 <-
mtcars %>%
dplyr::count(cyl, am) %>%
mutate(prop = prop.table(n)) %T>%
print %>%
select(-n) %>%
# margin table
group_by(cyl) %>%
mutate(prop = prop/sum(prop)) %>%
spread(am,prop)
## # A tibble: 6 x 4
## cyl am n prop
## <dbl> <dbl> <int> <dbl>
## 1 4 0 3 0.0938
## 2 4 1 8 0.25
## 3 6 0 4 0.125
## 4 6 1 3 0.0938
## 5 8 0 12 0.375
## 6 8 1 2 0.0625
参考 Aizkalns (2018)
注意这里,group_by(cyl)
是因为要标准化每个cyl
的level的概率。
6.3 按照箱型图的逻辑,剔除outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
Blagotić (2011) 根据箱型图定义构建函数。
但是 Win. (2011) 利用函数boxplot.stats
给出了更简单的方式,日后采用。
6.4 按照唯一值分bin
quantile(unique(...))
(Elferts 2013) 完成。
set.seed(123)
a <- c(rep(1,10),rep(2,10),runif(10,3,10))
breaks <-
a %>%
unique %>%
c(-Inf,.,Inf) %>%
quantile(c(0, 0.25,0.5,0.75,1))
cut(
a
,breaks
)
## [1] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [5] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [9] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [13] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [17] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [21] (3.74,6.45] (6.45,9.02] (3.74,6.45] (9.02, Inf]
## [25] (9.02, Inf] (-Inf,3.74] (6.45,9.02] (9.02, Inf]
## [29] (6.45,9.02] (3.74,6.45]
## 4 Levels: (-Inf,3.74] (3.74,6.45] ... (9.02, Inf]
## [1] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [5] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [9] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [13] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [17] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74] (-Inf,3.74]
## [21] (3.74,6.45] (6.45,9.02] (3.74,6.45] (9.02, Inf]
## [25] (9.02, Inf] (-Inf,3.74] (6.45,9.02] (9.02, Inf]
## [29] (6.45,9.02] (3.74,6.45]
## 4 Levels: (-Inf,3.74] (3.74,6.45] ... (9.02, Inf]
这里function(x)
参考 1.5
6.5 变量切bin分类
使用cut
而非case_when
## [1] (0.5,2.5] (0.5,2.5] (0.5,2.5] (0.5,2.5]
## [5] (0.5,2.5] (0.5,2.5] (0.5,2.5] (0.5,2.5]
## [9] (0.5,2.5] (0.5,2.5] (0.5,2.5] (0.5,2.5]
## [13] (0.5,2.5] (0.5,2.5] (0.5,2.5] (0.5,2.5]
## [17] (0.5,2.5] (0.5,2.5] (0.5,2.5] (0.5,2.5]
## [21] (2.5, Inf] (2.5, Inf] (2.5, Inf] (2.5, Inf]
## [25] (2.5, Inf] (2.5, Inf] (2.5, Inf] (2.5, Inf]
## [29] (2.5, Inf] (2.5, Inf]
## Levels: (-Inf,0.5] (0.5,2.5] (2.5, Inf]
这里只想切0.5,2.5
6.6 连续变量和分类变量的批量处理
我们需要
- 当变量为连续变量时,汇总为均值;
- 当变量为分类变量时,汇总为众数;
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
summarise_all(
function(x){ifelse(is.numeric(x),mean(x),names(sort(-table(x)))[1])}
)
6.7 众数求解
library(formattable)
mtcars %>%
.$cyl %>%
table %>%
sort(.,decreasing = T) %>%
print %>%
`/` (sum(.)) %>%
percent %>%
print %>%
head(1) %>%
names()
## .
## 8 4 6
## 14 11 7
## .
## 8 4 6
## 44% 34% 22%
## [1] "8"
6.8 计算餐补次数
INPUT
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter,
## second, wday, week, yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(tidyverse)
library(lubridate)
fread("id X1 X2
1 09:12 17:16 09:10 14:14 17:21
2 09:54 20:24 09:54
")
OUTPUT
STEP
拆分每个人每天的上下班打卡时间(X1、X2是工作日,id是人)
分别计算每人每天的打卡时间差(每个单元格内的最大值减去最小值)
重塑一列餐补次数和餐补金额 餐补次数(每人在所有工作日的餐补次数)的逻辑:最早打卡时间必须是10:10分(包括10:10分)之前,且最晚的打卡时间必须是20点(包括20点)之后,并且工作时长满足10小时(包括10)以上 餐补金额就是餐补次数*20"
注:正常的给予餐补的时间是早10-晚20点,如果某人早上11点过来,晚上24点打卡,也不给餐补
fread("id X1 X2
1 09:12 17:16 09:10 14:14 17:21
2 09:54 20:24 09:54
") %>%
mutate_if(is.character,str_squish) %>%
# 剔除多余空格
mutate_if(is.character,~str_split(.,' ')) %>%
gather(key,value,-id) %>%
unnest() %>%
group_by(id,key) %>%
summarise(max=max(value),
min=min(value)) %>%
# 求上班和下班时间
mutate(
cnt =
ifelse(
(max>'20:00'
& min<'10:10'
& as.difftime(str_c(max,':00'))-as.difftime(str_c(min,':00')) > 10
)
,1,0
)
) %>%
# 统计餐补次数
select(id,cnt) %>%
group_by(id) %>%
summarise(cnt=sum(cnt)) %>%
mutate(amt = cnt*20) %>%
# 统计餐补金额
set_names(c('id','餐补次数','餐补金额'))
6.9 支付宝交易记录探查
library(data.table)
library(lubridate)
'alipay_record_20181022_1240_1.csv' %>%
fread %>%
mutate(`付款时间` = ymd_hms(`付款时间`)) %>%
# distinct(`类型`)
# filter(`类型` != '支付宝担保交易'
# ,`收/支`=='支出') %>%
# filter(
# `金额(元)` < 300
# ,`金额(元)` > 100
# ) %>%
# distinct(`交易对方`)
# names
filter(`交易对方` %in% str_subset(`交易对方`,'寿司|拉面'))
- 支付宝交易记录,用
fread
可以直接把列表读出,不需要做处理。
7 未整理
7.1 提取分类变量名称
7.2 求colMeans using dplyr
library(tidyverse)
rm(list = ls())
a1<-letters[1:10]
b1<-rnorm(10)
b2<-b3<-rnorm(10)
d1<-data.frame(a1,b1,b2,b3)
d1[9,3]<-NA
d1$b4<-NA
d1 %>%
## mutate(new = rowMeans(. %>% select(b1,b2),
## na.rm = T))
## 不行,这里直接用.不能 %>% pipeline不支持
mutate(new = rowMeans(.[,c("b1","b2")],
na.rm = T))
rowwise
不是很理解。
7.3 ->
相当于你写了一大串%>%
最后弄完了,建立一个新表,但是不必跑回前面,写。。666
7.4 multiple left joins
## -----------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -----------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate,
## rename, summarise, summarize
## The following object is masked from 'package:purrr':
##
## compact
a <- 1:100
x <- data_frame(a = a, b = rnorm(100))
y <- data_frame(a = a, b = rnorm(100))
z <- data_frame(a = a, b = rnorm(100))
join_all(list(x,y,z), by='a', type='left')
也可以使用reduce
函数完成,详见reduce 多表合并。
7.5 为什么select
函数失效?
detach(package:MASS)
MASS
老是抢select
应该tab下这个变量看看是引用哪个包。
7.6 多重复抽样
sample(x, size, replace = FALSE, prob = NULL)
replace=F
,表示不重复抽样,replace=T
表示可以重复抽样。
7.7 xlsx::write.xlsx
导出.xlsx
文件
xlsx::write.xlsx(mtcars, "mtcars.xlsx")
成功了,并且第一列,也就是excel中的A列,是data.frame
的index。
openxlsx::write.xlsx
一直是bug。
7.7.1 openxlsx::write.xlsx
失败分析
Error: zipping up workbook failed. Please make sure Rtools is installed or a zip application is available to R. Try installr::install.rtools() on Windows. If the "Rtools\bin" directory does not appear in Sys.getenv("PATH") please add it to the system PATH or set this within the R session with Sys.setenv("R_ZIPCMD" = "path/to/zip.exe")
7.7.1.1 1st installr::install.rtools()
installr::install.rtools()
7.7.1.2 Sys.getenv(“PATH”)
Sys.getenv("PATH")
7.7.1.3 Sys.setenv(“R_ZIPCMD” = “path/to/zip.exe”)
Sys.setenv("R_ZIPCMD" = "path/to/zip.exe")
7.7.2 其他方式
dataframes2xls::write.xls(mtcars,"mtcars.xls")
这个需要Python的支持。
XLConnect::createSheet(mtcars, name = "CO2")
Error: loadNamespace()里算'rJava'时.onLoad失败了,详细内容:
调用: fun(libname, pkgname)
错误: JAVA_HOME cannot be determined from the Registry
7.8 批量读取文件、进行bind_rows, full join
library(tidyverse)
library(readxl)
path_tree <- file.path(dirname(getwd()),"lijiaxiang","170922 team","xiaosong2","prep")
paths <- list.files(path = path_tree, recursive = T, full.names = T)
file.path
用于设计路径,方便mac和win7的R交互。
list.files
生成路径的list文件,方便for循环。
xiaosong <- plyr::join_all(
list(
paths %>% str_subset("#1/") %>% as_tibble() %>%
mutate(table = map(value,read_excel)) %>%
.$table %>%
bind_rows()
,
paths %>% str_subset("#10/") %>% as_tibble() %>%
mutate(table = map(value,read_excel)) %>%
.$table %>%
bind_rows()
),
by = "采样时间",
type = "full"
)
这个地方是重点,paths
中有两类文件#1
和#10
,
- 需要先列合并,
bind_rows
, - 再进行full join,通过key
"采样时间"
。
先对#1
的文件路径进行处理。
paths %>% str_subset("#1/")
表示只取用#1
的文件路径,建立list,as_tibble()
建立为表格,列名为value
,mutate(table = map(value,read_excel))
建立一个新的列,叫做table
,对value的每个对象,进行,read_excel
函数,相当于for循环。- 提取这一列,使用
.$table
,这是个list bind_rows()
表示列合并这一列。
同样的对#10
进行重复操作。
最后使用plyr::join_all
将两个表格full join。
以下是查缺失值情况。
anyNA(xiaosong)
xiaosong %>%
gather() %>%
group_by(key) %>%
summarise(sum(is.na(value)))
7.9 选出每个组top几的rows
group_by()
和top_n(几, 什么变量)
可以选出每个组top几的rows,最后别忘了ungroup()
。
7.10 批量fill
fill_(df, names(df))
和
fill(df, everything())
两种都可以。
## 查看table所有变量的缺失情况
- r - Using dplyr summarise_each() with is.na() - Stack Overflow
summarise_each(funs(sum(is.na(.)) / length(.)))
。
7.11 read_excel
批量合成表
参考:
- readxl Workflows • readxl
- Add function to read all sheets into a list · Issue #407 · tidyverse/readxl
path <- "40ctdata_resave.xlsx"
cb_data <- path %>%
excel_sheets() %>%
set_names() %>%
map_df(~ read_excel(path = path, sheet = .x, range = "A4:T33",
col_types = rep("text",20)), .id = "sheet") %>%
write_csv("cb_data.csv")
print(cb_data, n = Inf)
我查看了,每个表格,有效行数为29行,20列。 然后你可以使用数据透视表,先看均值查验各个变量。 1160= 1160行,这是合并的结果。
excel_sheets
可以反馈xlsx
的sheet列表。
7.12 mutate add mutiple lag variable
library(data.table)
full_data_ipt_miss_wo_scale_lag_y_15 <-
full_data_ipt_miss_wo_scale %>%
ungroup() %>%
group_by(code) %>%
do(data.frame(., setNames(shift(.$return, 1:15), paste("return_lag_",seq(1:15)))))
setNames
可以正对多个变量进行操作
full_data_ipt_miss_wo_scale_lag_y_15 %>%
select(contains("return")) %>%
summarise_all(funs(sum(is.na(.))))
7.13 select
+matches
正则化
Regex within dplyr/select helpers - tidyverse - RStudio Community 注意是
matches
## set operationset operation in
dplyr
多了setequal
,且可以对table操作。
union
是并集,
union_all
是合并,有重复。
mtcars$model <- rownames(mtcars)
first <- mtcars[1:20, ]
second <- mtcars[10:32, ]
intersect(first, second)
## [[1]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
##
## [[2]]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4
##
## [[3]]
## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7
## [9] 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0
## [17] 440.0 78.7 75.7 71.1
##
## [[4]]
## [1] 110 110 93 110 175 105 245 62 95 123 123 180
## [13] 180 180 205 215 230 66 52 65
##
## [[5]]
## [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92
## [11] 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22
##
## [[6]]
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190
## [9] 3.150 3.440 3.440 4.070 3.730 3.780 5.250 5.424
## [17] 5.345 2.200 1.615 1.835
##
## [[7]]
## [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00
## [9] 22.90 18.30 18.90 17.40 17.60 18.00 17.98 17.82
## [17] 17.42 19.47 18.52 19.90
##
## [[8]]
## [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1
##
## [[9]]
## [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
##
## [[10]]
## [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4
##
## [[11]]
## [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1
##
## [[12]]
## [1] "Mazda RX4" "Mazda RX4 Wag"
## [3] "Datsun 710" "Hornet 4 Drive"
## [5] "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D"
## [9] "Merc 230" "Merc 280"
## [11] "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC"
## [15] "Cadillac Fleetwood" "Lincoln Continental"
## [17] "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla"
##
## [[13]]
## [1] 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4
## [11] 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8
## [21] 19.7 15.0 21.4
##
## [[14]]
## [1] 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
##
## [[15]]
## [1] 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0
## [9] 78.7 75.7 71.1 120.1 318.0 304.0 350.0 400.0
## [17] 79.0 120.3 95.1 351.0 145.0 301.0 121.0
##
## [[16]]
## [1] 123 123 180 180 180 205 215 230 66 52 65 97
## [13] 150 150 245 175 66 91 113 264 175 335 109
##
## [[17]]
## [1] 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93
## [11] 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22
## [21] 3.62 3.54 4.11
##
## [[18]]
## [1] 3.440 3.440 4.070 3.730 3.780 5.250 5.424 5.345
## [9] 2.200 1.615 1.835 2.465 3.520 3.435 3.840 3.845
## [17] 1.935 2.140 1.513 3.170 2.770 3.570 2.780
##
## [[19]]
## [1] 18.30 18.90 17.40 17.60 18.00 17.98 17.82 17.42
## [9] 19.47 18.52 19.90 20.01 16.87 17.30 15.41 17.05
## [17] 18.90 16.70 16.90 14.50 15.50 14.60 18.60
##
## [[20]]
## [1] 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
##
## [[21]]
## [1] 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
##
## [[22]]
## [1] 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
##
## [[23]]
## [1] 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
##
## [[24]]
## [1] "Merc 280" "Merc 280C"
## [3] "Merc 450SE" "Merc 450SL"
## [5] "Merc 450SLC" "Cadillac Fleetwood"
## [7] "Lincoln Continental" "Chrysler Imperial"
## [9] "Fiat 128" "Honda Civic"
## [11] "Toyota Corolla" "Toyota Corona"
## [13] "Dodge Challenger" "AMC Javelin"
## [15] "Camaro Z28" "Pontiac Firebird"
## [17] "Fiat X1-9" "Porsche 914-2"
## [19] "Lotus Europa" "Ford Pantera L"
## [21] "Ferrari Dino" "Maserati Bora"
## [23] "Volvo 142E"
## [1] TRUE
7.14 dplyr::select_vars
suppressMessages(library(tidyverse))
# select_vars(names(mtcars),contains("p"))
names(mtcars) %>% str_subset("p")
## [1] "mpg" "disp" "hp"
参考:
gather_ no longer supports dropping a column? · Issue #109 · tidyverse/tidyr
7.15 DataProfile
library(tidyverse)
mtcars[2,2] <- NA ## Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
reshape2::melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
## No id variables; using all as measure variables
7.16 column_to_rownames
之前要加as.data.frame()
has_rownames(df)
remove_rownames(df)
rownames_to_column(df, var = "rowname")
rowid_to_column(df, var = "rowid")
column_to_rownames(df, var = "rowname")
7.18 split
批量展示
7.19 这是gather最标准的写法,不会报错
gather_(key_col = "key",value_col = "value", gather_cols = x_list)
7.20 %>%
下正则化修改变量名称
df <- df %>%
rename_all(
funs(
stringr::str_to_lower(.) %>%
stringr::str_replace_all(., '\\.', '_')
)
)
注意,stringr
的函数需要%>%
连接 (Marthe 2015)。
7.21 lag
和diff
的取舍
Error in mutate_impl(.data, dots) : Column daily_return must be length 488 (the group size) or one, not 487
的报错。
最简单的改进就是加入c(NA,diff(close))
(sasikala appukuttan 2016),因为这里少了一个单元格,第一个diff
是计算不出来的,因此加入就好。
或者用lag
,因为会直接产生NA
(Lemm 2015),不报错。
因此两个都可以用,但是要注意适用条件。
7.22 rename in pipeline
iris2 <- data.frame(iris)
iris2 %>%
data.table::setnames(
old = "Petal.Length",
new = "petal_length") %>%
head
iris2 %>%
head
iris %>%
head
推荐purrr::set_names
函数 (Rodrigues 2017)。
因为data.table::setnames
有inplace的功能,不适用于重复性研究。
如上面的例子,iris2
是iris
是一个复制后data.frame
, 详见7.23,但是一旦进行了data.table::setnames
的修改后,iris2
的命名就会永久修改。
或者使用
注意中间没有空格。
7.22.1 make.names
的效果
## [1] "a.and.b" "a.and.b.1"
7.23 复制一个data.frame
<-
还是会传递修改情况的。<- data.frame()
完成复制一个data.frame
(Stack Overflow 2017)
7.24 case_when
和str_detect
的使用
library(tidyverse)
files_name <- list.files(file.path(getwd(),"wg"), full.names = T)
files_name_a <- files_name %>% str_subset("a_")
files_name_b <- files_name %>% str_subset("b_")
all_data_a <-
files_name_a %>%
as_tibble() %>%
mutate(table = map(files_name_a, read_csv)) %>%
.[-1] %>%
unnest()
all_data_b <-
files_name_b %>%
as_tibble() %>%
mutate(table = map(files_name_b, read_csv)) %>%
.[-1] %>%
unnest()
setwd
函数最好别用,比较干扰读者的路径。
这个讨论在RStudio Community上面讨论很多, Bryan (2017) 开发了一个 here 包,专门来针对这个问题,但是我觉得不是特别有效,所以没用。
尽量使用file.path
来设置路径就好了。
files_name %>%
as_tibble() %>%
mutate(
table = map(value, read_csv),
tag =
case_when(
str_detect(value,"a_") ~ "a",
str_detect(value,"b_") ~ "b"
)
) %>%
unnest()
这里可以简化用case_when
和str_detect
完成,最后按需用filter
函数来分表。
(Wade 2018)
7.25 dplyr中的set operation
dplyr::setequal(x, y, ...)
会反馈两表的不同之处,例如
FALSE: Cols in y but not x: X1.
,意思是表x
中不含有表y
中的X1
列。
这个时候再用setdiff
分别进行
setdiff(x,y)
setdiff(y,x)
7.26 cumany 和 cumall 函数
cumany()
反馈只要之前满足条件,就会保留记录。因此三种情况中,这种情况保留的记录最多。(Wickham et al. 2018)- 相反,
cumall()
反馈之前需要全部满足条件,才会保留记录。因此三种情况中,这种情况保留的记录最少。
7.27 Window functions
参考 Wickham et al. (2018)
7.27.1 Ranking functions
7.27.1.1 dplyr
tibble(
x = c(1, 1, 2, 2, 2, NA)
) %>%
mutate(
row_number = row_number(x),
min_rank = min_rank(x),
dense_rank = dense_rank(x),
dense_rank_desc = dense_rank(desc(x))
)
7.27.1.2 impala 中的表现
with tbl_1 as
(
select 1 as x union all
select 1 as x union all
select 2 as x union all
select null as x union all
select null as x union all
select null as x union all
select 2 as x union all
select 2 as x
)
select x,
row_number() over(order by x) as x_r_no
from tbl_1
with tbl_1 as
(
select 1 as x union all
select 1 as x union all
select 2 as x union all
select null as x union all
select null as x union all
select null as x union all
select 2 as x union all
select 2 as x
)
select x,
row_number() over(order by x desc) as x_r_no
from tbl_1
在impala里面NULL
默认为最大,和Python Pandas 包的设计相同。
但是和R dplyr 包的设定不同。
7.27.1.3 Pandas
参考 The pandas project (2017) 的 rank
函数中的两种 method,first
和 min
。
print pd.concat([z,pd.DataFrame({'x': [np.nan],'y': 7})]).x.rank(method='first', ascending=True)
print pd.concat([z,pd.DataFrame({'x': [np.nan],'y': 7})]).x.rank(method='first', ascending=False)
print pd.concat([z,pd.DataFrame({'x': [np.nan],'y': 7})]).x.rank(method='min')
0 1.0
1 2.0
2 3.0
0 NaN
Name: x, dtype: float64
0 3.0
1 2.0
2 1.0
0 NaN
Name: x, dtype: float64
0 1.0
1 2.0
2 3.0
0 NaN
Name: x, dtype: float64
dplyr 包在 ranking
函数中进行了解释。
row_number()
: equivalent torank(ties.method = "first")
min_rank()
: equivalent torank(ties.method = "min")
7.27.1.4 总结
Pandas 和 dplyr 对空值不排序和打分。 然而 impala 默认 null 最大。
7.27.1.5 累计百分比
## [1] 1 1 2 2 2 NA
## [1] 40.00% 40.00% 100.00% 100.00% 100.00% NA
## [1] 0.00% 0.00% 50.00% 50.00% 50.00% NA
7.27.1.5.2 ntile()
类似于ggplot2::cut_number
函数,这里做一定的比较。
diamonds %>%
select(carat) %>%
mutate(
grp_1 = cut_number(carat, 10),
grp_2 = ntile(carat, 10),
grp_3 = as.integer(grp_1)
)
因此这里可以说几个ntile
的优点
ntile
省略重命名group,省略as.integer(grp_1)
这一步- 而且不需要面对难看的开闭区间,就算需要知道切分点,直接用quantile 函数不是更好吗?(使用
summarise
函数)
7.27.1.5.3 floor 函数
简单来说就是找到第四列的最大值, 然后建立从0到max值的区间, 区间间隔为5000000但最后的区间结尾应该是最大值。
这个的实现就是floor
和ggplot2::cut_*
函数就可以了,这里使用cut_width
函数。
floor
函数用于向上取整,比如floor(a*5)/5
就是按照5个单位向上取整。cut_width
函数控制区间长度。(Brandl 2017)as.character
、str_remove_all
和separate
函数主要是把得到的区间文本化,提取区间上下限制。
library(tidyverse)
# library(kable)
# library(kableExtra)
data_frame(
x = c(rep(44,10),rep(67,10),rep(101,10))
) %>%
mutate(x_cut = cut_width(floor(x),50)) %>%
mutate(x_cut = as.character(x_cut)) %>%
mutate(x_cut = str_remove_all(x_cut,"\\[|\\]")) %>%
separate(x_cut,into = c('left','right'),sep=',')
7.27.2 Lead and lag
## [1] 2 3 4 5 NA
## [1] NA 1 2 3 4
## [1] NA 1 1 1 1
## [1] 1 1 1 1
## [1] NA 1 1 1 1
x - lag(x)
和c(NA,diff(x))
才等价。
7.28 arrange 搭配复合条件
file.path(
getwd() %>% dirname(),
"fcontest",
"xiaosong_sent_files",
"cor_end.csv"
) %>%
read_csv() %>%
mutate_if(is.double, percent) %>%
gather(key,value,-colnames) %>%
arrange(desc(abs(value)))
一种简单写法desc(abs(value))
7.29 自定义虚拟化
data.table::data.table(
a = c(
rep(0,100),
rnorm(1000, mean = 10, sd = 10)
)
) %>%
# filter(a >= 0) %>%
# ggplot(aes(x = a)) +
# geom_freqpoly()
mutate(b = case_when(
a == 0 ~ "0",
a > 0 ~ as.character(ntile(a,4))
)) %>%
.$b %>%
table()
## .
## 0 1 2 3 4
## 100 13 275 275 275
7.30 json格式文件清理成dataframe
library(tidyverse)
library(rjson)
json_file_path <- file.path(getwd(),"deployment","20180525175742.txt")
json_file_tbl <- readLines(json_file_path) %>% as.data.frame()
names(json_file_tbl) <- "raw"
json_file_path
: 写入json文件的路径,可以是超链接也可以是本地连接readLines
和as.data.frame()
将json
格式的文档,转化成dataframe
格式
json_file_tbl %>%
mutate(raw = paste("[",raw,"]", sep="")) %>%
mutate(raw = map(.x = raw,.f = jsonlite::fromJSON)) %>%
unnest() %>%
gather(key = "userid") %>%
filter(!is.na(value)) %>%
mutate(raw = map(.x = value,.f = jsonlite::fromJSON)) %>%
select(-value) %>%
unnest() %>%
group_by(userid) %>%o
mutate(No = 1:n()) %>%
write_csv(file.path(getwd(),"deployment","data","json_file_to_df.csv")) %>%
select(everything())
No
: 每个userid
的记录标签,每个标签在userid
内是唯一的,这里加入标签的原因是, 一个用户有多条数据,没有时间标签。
7.31 multiple quantile in summarise
p = c(0.25,0.5,0.75)
mtcars %>%
group_by(cyl) %>%
summarise(quantiles = list(p),
mpg = list(quantile(mpg, p))) %>%
unnest() %>%
spread(quantiles,mpg)
Stack Overflow (2015) 使用list
的思路完成。
7.32 读取奇异变量名称的表格
使用read_*
读取文件时,如果有奇异变量名称,就会报错,比如,
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<c4><ea>'
这时加入skip = 1
删除变量名,重命名变量。
7.33 绝对路径和相对路径
../
现在的parent directory
../input
parent directory 下的另外一个input文件夹。
7.34 accounting in pipeline
"39990" %>% as.integer() %>% ./100 %>% accounting()
Error in .(.) : could not find function "."
Dervieux (2018) 提供了两个解决办法,
library(tidyverse)
library(formattable)
"39990" %>% as.integer() %>% magrittr::divide_by(100) %>% accounting()
## [1] 399.90
## [1] 399.90
7.35 不要cross post (Kephart 2018)
否则删除多余的post。
7.37 _
连接的标准字段
## [1] "spider_man_homecoming"
make.names
会出现多个.
连续,不方便str_replace_all
替换.
为_
- 全部小写
7.38 mutate_at
函数加复合函数
mutate_at(vars(m0_deal:m5_deal), function(x){x /reg_cnt})
一个解决方案是
reg_cnt <- fst_data$reg_cnt
在global环境下定义reg_cnt就好了。
因此这是因为函数没法识别reg_cnt
导致的。
mutate_at(vars(m0_deal:m5_deal), ~ . / data$reg_cnt)
7.39 csv
文件的编码问题
可以复制到excel中解决。
7.40 write_excel_csv
不会出现编码的问题
write_excel_csv
比write_csv
的好处之一就是用excel打开csv
文档时,中文不会判定为乱码。
7.41 mutate_impl
问题
Error in mutate_impl(.data, dots) : Evaluation error: 0 (non-NA) cases.
原因是空值影响。
na.omit
是解决办法,用all
和any
函数可以复现。
7.42 rolling_origin
函数
rsample::rolling_origin
可以完成时间序列的随机抽样,具体代码参考
Rolling Origin Forecast Resampling
。
根据以下三个例子的理解,我对这个函数的理解是对每个Split
第一个时间点递增,而非随机的。
set.seed(1131)
ex_data <- data.frame(row = 1:20, some_var = 1:20)
dim(rolling_origin(ex_data))
dim(rolling_origin(ex_data, skip = 2))
dim(rolling_origin(ex_data, skip = 2, cumulative = FALSE))
要看到真实数据,使用analysis
函数。
ex_data %>%
rolling_origin(initial = 5,assess = 1,cumulative = T,skip = 0) %>%
mutate(data = map(.x = splits, .f = analysis)) %>%
select(id,data) %>%
unnest()
initial = 5,assess = 1,cumulative = T,skip = 0
这些参数都是默认的。cumulative = T
导致,每个Slice*
随着*
增加,样本量增加。
ex_data %>%
rolling_origin(initial = 5,assess = 1,cumulative = T,skip = 2) %>%
mutate(data = map(.x = splits, .f = analysis)) %>%
select(id,data) %>%
unnest()
initial = 5,assess = 1,cumulative = T,skip = 2
这些参数都是默认的。skip = 2
和cumulative = T
导致,每个Slice*
随着*
增加,样本量增加更快,因此*
量少。
ex_data %>%
rolling_origin(initial = 5,assess = 1,cumulative = F,skip = 2) %>%
mutate(data = map(.x = splits, .f = analysis)) %>%
select(id,data) %>%
unnest()
initial = 5,assess = 1,cumulative = F,skip = 2
这些参数都是默认的。cumulative = T
导致每个Slice*
随着*
增加,样本量不变skip = 2
导致样本间隔为2
。
7.43 MASS::select
会覆盖dplyr::select
函数
载入程辑包:'MASS'
The following object is masked from 'package:dplyr':
select
这是library
后的警告。
7.44 cross join
full_join
函数并不能完成cross join。
目前的bug是估计bin值。
Error in if (details$repeats > 1) res <- paste(res, "repeated", details$repeats, : argument is of length zero
使用tidyr::crossing
,测试如下。
7.45 推荐fread()
替代read*
,处理中文乱码
# read_lines(a) %>% head()
# read_file_raw(a) %>% head()
# read_csv(a) %>% head()
data.table::fread('a.txt',header=F) %>%
filter(V1 %in% str_subset(V1,'说'))
对于中文,Win7 RStudio中,出现中文乱码的问题。
排除了三种情况,Open with Encoding
, Resave with Encoding
, Default text encoding
三个都设置成UTF-8
。
- 使用
data.table::fread()
函数正常 - 使用
read_*
函数不行
7.46 dummyVars
函数,实现onehot编码功能 (Kaushik 2016)
#Converting every categorical variable to numerical using dummy variables
dmy <- dummyVars(" ~ .", data = train_processed,fullRank = T)
train_transformed <- data.frame(predict(dmy, newdata = train_processed))
7.47 双滚动解决方式
library(tidyverse)
full_join(
data_frame(x = 1:5,z = rep(1,5))
,data_frame(y = 1:5,z = rep(1,5))
,by='z'
) %>%
group_by(x) %>%
mutate(w = cumsum(z)) %>%
ungroup() %>%
group_by(y) %>%
mutate(w = cumsum(w)) %>%
select(-z) %>%
spread(y,w)
7.49 nested 表剔除table列
在使用dplyr
函数时,有一些表是nested,有一些列是data.frame
格式,可以使用
select_if(~!is.data.frame(.))
进行剔除。
7.50 top_n()
聚合展示
top_n()
聚合展示常常会需要结果放在一个单元格
str_flatten(value,collapse=';')
paste(collapse=';')
都可以解决问题。
library(tidyverse)
data_frame(
x = rep(1,4)
,y= rep(c(1,2),2)
,z= 1:4
) %>%
group_by(x,y) %>%
top_n(n=2) %>%
summarise(paste(z,collapse = ',')
,str_flatten(z,collapse = ','))
7.51 检查函数
data_frame(
x1 = c(NA,NaN,Inf,NULL),
x2 = c(NA,NaN,Inf,NULL),
x3 = c(NA,NaN,Inf,NULL)
) %>%
summarise_all(funs(
n1 = sum(is.na(.)),
n2 = sum(is.infinite(.)),
n3 = sum(is.nan(.)),
n4 = sum(is.null(.))
))
NULL
is often returned by expressions and functions whose value is undefined.NA
is a logical constant of length 1 which contains a missing value indicator.
undefined != missing
7.52 read_csv
这是read_csv
的bug
data.table::fread
在读取csv
文件时,没有这个bug。
7.54 ifelse
和 if else
的选取
It tries to convert the data frame to a ‘vector’ (list in this case) of length 1. (Bolker 2010)
ifelse
函数会把反馈值向量化,不能保留原有数据表的格式,因此建议使用if else
举例如下。
library(tidyverse)
c <-
data_frame(
a = 1:2
,b = 3:4
)
d <-
data_frame(
e = 1:2
,f = 3:4
)
ifelse(is.data.frame(c),c,d)
## [[1]]
## [1] 1 2
7.55 比较所有变量的分布
对于两个样本,要分别比较每个变量的分布,
- 对于小样本可以使用
gather
函数和facet_*
函数完成。 - 对于大样本可以使用以下方式进行,中间一个个变咯的生成图,
- 不会因为一个变量的误差而不能导出已成功的图
- 可以中途中断。
library(data.table)
bank66_data <-
fread(file.path('data','180925_mod_data.csv'),encoding = 'UTF-8') %>%
mutate_if(is.character,as.factor)
bank66_data_allsample <-
fread(file.path('data','180925_mod_data_allsample.csv'),encoding = 'UTF-8') %>%
mutate_if(is.character,as.factor)
out_dist <-
function(x){
bind_rows(
bank66_data %>%
mutate(type='black')
,bank66_data_allsample %>%
mutate(type='all')
) %>%
mutate_if(is.character,as.factor) %>%
ggplot(aes(x=log(eval(parse(text = x))+1),y=..density..,col=type)) +
geom_density() +
labs(title = x) +
theme_bw()
ggsave(file.path('archive','dist',paste0(x,'.png')))
}
7.56 contional extraction and anti join
snp_table
是全集\(\mathrm{U}\)- 现在存在特定集合\(\mathrm{A}\),其中\(R^2>0.8\)且\(\mathrm{SNP}=\min{[\mathrm{SNP_A},\mathrm{SNP_B}]}\)。
- 然后求补集\(\mathrm{C_{u}A}\)
library(readxl)
snp_table <- read_excel("../tutoring/pansiyu/snp/snp_table.xlsx")
snp_r2 <- read_excel("../tutoring/pansiyu/snp/snp_r2.xlsx")
snp_r2 %>%
filter(R2>0.8) %>%
left_join(snp_table %>% transmute(SNP,P_A = P),by=c('SNP_A'='SNP')) %>%
# 提取SNP_A 对应的P
left_join(snp_table %>% transmute(SNP,P_B = P),by=c('SNP_B'='SNP')) %>%
# 提取SNP_B 对应的P
mutate(SNP_large = ifelse(P_A>=P_B,SNP_A,SNP_B)) %>%
# 如果P_A 较大就提取SNP_A 反之亦然。
anti_join(snp_table,.,by=c('SNP'='SNP_large'))
# 剔除特定集合
o
7.57 函数被mask
的解决方法
detach
和library
- check Environment 中,想要使用的函数的包是否靠前。 参考 Tidyverse 和 RStudio Community
7.58 tibble > data.table
我在Stack Overflow上写了一个答案,主要测试这两个包的耗时,发现tibble更快。
实现代码如下。
profvis({
for (i in 1:1000){
keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)
new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")
library(data.table)
library(tidyverse)
ref_tbl <-
data.table(o
keywords = keywords
,categories = categories
)
as.data.table(new_labels) %>%
mutate(ref_key = str_extract(new_labels
# ,'keyword[:digit:]'
,(
keywords %>%
str_flatten('|')
# regular expression
)
)) %>%
left_join(
ref_tbl
,by=c('ref_key'='keywords')
)
}
})
profvis({
for (i in 1:1000){
keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)
new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")
library(data.table)
library(tidyverse)
ref_tbl <-
tibble(
keywords = keywords
,categories = categories
)
as_tibble(new_labels) %>%
mutate(ref_key = str_extract(new_labels
# ,'keyword[:digit:]'
,(
keywords %>%
str_flatten('|')
# regular expression
)
)) %>%
left_join(
ref_tbl
,by=c('ref_key'='keywords')
)
}
})
profvis({
for (i in 1:1000){
keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")
lookup_table <- data.frame(keywords, categories)
new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")
library(data.table)
library(tidyverse)
ref_tbl <-
data.table(
keywords = keywords
,categories = categories
)
data.table(new_labels=new_labels) %>%
mutate(ref_key = str_extract(new_labels
# ,'keyword[:digit:]'
,(
keywords %>%
str_flatten('|')
# regular expression
)
)) %>%
left_join(
ref_tbl
,by=c('ref_key'='keywords')
)
}
})
A 引用本书
如果你认为本书对你的博客、论文和产品有用,引用本书我会非常感谢。 格式如下
李家翔. 2019. 使用 Tidyverse 完成函数化编程.
或者 bibtex 的格式
@book{Li2019book,
title = {使用 Tidyverse 完成函数化编程},
author = {李家翔},
note = {\url{https://jiaxiangbu.github.io/book2tidyverse/}},
year = {2019}
}
正如 Molnar (2019) 所言, 对于R语言、函数化编程和 tidyverse 包如何应用到学术和研究,我是很好奇的。如果读者将本书作为参考文献,我是非常乐意了解的你的需求和研究。当然这是不必须的,这只是满足我的好奇心并在此基础思考生成一些有趣的例子。我的邮箱是 alex.lijiaxiang@foxmail.com,欢迎联系。
参考文献
Aizkalns, Jason. 2018. “How to Use Dplyr to Generate a Frequency Table.” https://stackoverflow.com/questions/34860535/how-to-use-dplyr-to-generate-a-frequency-table.
Baert, Suzan. 2018a. “Data Wrangling Part 1: Basic to Advanced Ways to Select Columns.” https://suzanbaert.netlify.com/2018/01/dplyr-tutorial-1/.
———. 2018b. “Data Wrangling Part 2: Transforming Your Columns into the Right Shape.” https://suzan.rbind.io/2018/02/dplyr-tutorial-2/.
———. 2018c. “Data Wrangling Part 3: Basic and More Advanced Ways to Filter Rows.” https://suzan.rbind.io/2018/02/dplyr-tutorial-3/.
———. 2018d. “Data Wrangling Part 4: Summarizing and Slicing Your Data.” https://suzan.rbind.io/2018/04/dplyr-tutorial-4/.
Blagotić, Aleksandar. 2011. “How to Remove Outliers from a Dataset.” https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset.
Bolker, Ben. 2010. “Why Does Ifelse Convert a Data.frame to a List: Ifelse(TRUE, Data.frame(1), 0)) != Data.frame(1)?” https://stackoverflow.com/questions/6310733/why-does-ifelse-convert-a-data-frame-to-a-list-ifelsetrue-data-frame1-0.
Brandl, Holger. 2017. https://stackoverflow.com/questions/43627679/round-any-equivalent-for-dplyr.
Bryan, Jenny. 2017. “Project-Oriented Workflow; Setwd(), Rm(list = Ls()) and Computer Fires.” https://community.rstudio.com/t/project-oriented-workflow-setwd-rm-list-ls-and-computer-fires/3549.
Dervieux, Christophe. 2018. “Formattable Functions Doesn’t Work in Pipeline.” https://community.rstudio.com/t/formattable-functions-doesnt-work-in-pipeline/9404.
Elferts, Didzis. 2013. “Cut() Error - ’Breaks’ Are Not Unique.” https://stackoverflow.com/questions/16184947/cut-error-breaks-are-not-unique.
Kaushik, Saurav. 2016. “Practical Guide to Implement Machine Learning with Caret Package in R (with Practice Problem).” https://www.analyticsvidhya.com/blog/2016/12/practical-guide-to-implement-machine-learning-with-caret-package-in-r-with-practice-problem/.
Kephart, Curtis. 2018. “FAQ: Is It Ok If I Cross-Post?” https://community.rstudio.com/t/faq-is-it-ok-if-i-cross-post/5218.
Lemm, Alexander. 2015. https://stackoverflow.com/questions/28045910/diff-operation-within-a-group-after-a-dplyrgroup-by.
Li, Jiaxiang. 2018. “Add to Tidyverse (Free).” https://www.datacamp.com/courses/add-to-tidyverse-free.
Marthe, Guilherme. 2015. “R Dplyr: Rename Variables Using String Functions.” https://stackoverflow.com/questions/30382908/r-dplyr-rename-variables-using-string-functions.
McBain, Miles, and Christophe Dervieux. 2018. “Best Practices: How to Prepare Your Own Data for Use in a ‘Reprex‘ If You Can’t, or Don’t Know How to Reproduce a Problem with a Built-in Dataset?” https://community.rstudio.com/t/best-practices-how-to-prepare-your-own-data-for-use-in-a-reprex-if-you-can-t-or-don-t-know-how-to-reproduce-a-problem-with-a-built-in-dataset/5346.
Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.
Östblom, Joel, and Rene Niehus. 2018. “Reducing Traffic Mortality in the Usa.” https://www.datacamp.com/projects/464.
Rodrigues, Bruno. 2017. “Lesser Known Purrr Tricks.” http://www.brodrigues.co/blog/2017-03-24-lesser_known_purrr/.
Rudolph, Konrad. 2016. “How Does Dplyr’s Between Work?” https://stackoverflow.com/questions/39997225/how-does-dplyr-s-between-work/39998444#39998444.
sasikala appukuttan, arun kirshna. 2016. “Error When Using ‘Diff’ Function Inside of Dplyr Mutate.” https://stackoverflow.com/questions/35169423/error-when-using-diff-function-inside-of-dplyr-mutate.
Stack Overflow. 2015. “Using Dplyr Window Functions to Calculate Percentiles.” https://stackoverflow.com/questions/30488389/using-dplyr-window-functions-to-calculate-percentiles.
———. 2017. “How Do I Create a Copy of a Data Frame in R.” https://stackoverflow.com/a/45417109/8625228.
———. 2018. “How to Call the Output of a Function in Another Function?” https://stackoverflow.com/questions/50751566/how-to-call-the-output-of-a-function-in-another-function.
The pandas project. 2017. “Top N Rows Per Group.” http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#top-n-rows-per-group.
Thomas, Gregor. 2017. “Obtain Count of Unique Combination of Columns in R Dataframe Without Eliminating the Duplicate Columns from the Data.” https://stackoverflow.com/a/45311312/8625228.
Wade, Martin. 2018. https://community.rstudio.com/t/help-me-write-this-script-for-case-when-inside-dplyr-mutate-and-ill-acknowledge-by-name-you-in-my-article/6564.
Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2018. “Window Functions.” https://cran.rstudio.com/web/packages/dplyr/vignettes/window-functions.html.
Win., J. 2011. “How to Remove Outliers from a Dataset.” https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset.
周运来. 2018. “R语言入门手册: 循环 for, While.” https://mp.weixin.qq.com/s/d686t311CRyDdRbYITqBEA.
张丹. 2018. “R语言高效的管道操作magrittr.” https://mp.weixin.qq.com/s/wD2tK04qGJ8Qvb-sD0Nhcg.
王垠. 2015. “编程的智慧.” Blog. http://www.yinwang.org/blog-cn/2015/11/21/programming-philosophy.
谢益辉. 2017. “幸存者偏差.” https://yihui.name/cn/2017/04/survivorship-bias.