data.table包提供了一个data.frame的高级版本,是一个超高性能处理包,其在处理上代码简介,功能强大,并且效率极高,强烈建议大家使用此包。官方链接:https://github.com/Rdatatable/data.table/wiki
data.table的语法主要是基于[]
,其一般格式是:DT[i, j, by]。
解释:“Take DT, subset rows using i, then calculate j grouped by by.”
注:以下称data.table为数据表。
此包被收录在CRAN里面,所以下载它特别方便,只需要使用install.packages()
即可。
# install.packages("data.table")
library(data.table)
构建一个数据表可以用data.table()
直接构建,也可以将数据框使用data.table()
去转换为数据表:
set.seed(45L)
DT <- data.table(V1 = c(1L, 2L),
V2 = LETTERS[1:3],
V3 = round(rnorm(4), 4),
V4 = 1:12)
DT
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 B -0.7033 2
## 3: 1 C -0.3795 3
## 4: 2 A -0.7460 4
## 5: 1 B 0.3408 5
## 6: 2 C -0.7033 6
## 7: 1 A -0.3795 7
## 8: 2 B -0.7460 8
## 9: 1 C 0.3408 9
## 10: 2 A -0.7033 10
## 11: 1 B -0.3795 11
## 12: 2 C -0.7460 12
set.seed(45L)
DF <- data.frame(V1 = c(1L, 2L),
V2 = LETTERS[1:3],
V3 = round(rnorm(4), 4),
V4 = 1:12)
DF_T <- data.table(DF); DF_T
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 B -0.7033 2
## 3: 1 C -0.3795 3
## 4: 2 A -0.7460 4
## 5: 1 B 0.3408 5
## 6: 2 C -0.7033 6
## 7: 1 A -0.3795 7
## 8: 2 B -0.7460 8
## 9: 1 C 0.3408 9
## 10: 2 A -0.7033 10
## 11: 1 B -0.3795 11
## 12: 2 C -0.7460 12
使用DT[i, j, by]中的i进行子集的选取。
# DT[3:5]
DT[3:5, ]
## V1 V2 V3 V4
## 1: 1 C -0.3795 3
## 2: 2 A -0.7460 4
## 3: 1 B 0.3408 5
# DT[V2 == "A"]
DT[V2 == "A", ]
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 A -0.7460 4
## 3: 1 A -0.3795 7
## 4: 2 A -0.7033 10
# DT[V2 %in% c("A", "C")]
DT[V2 %in% c("A", "C"), ]
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 1 C -0.3795 3
## 3: 2 A -0.7460 4
## 4: 2 C -0.7033 6
## 5: 1 A -0.3795 7
## 6: 1 C 0.3408 9
## 7: 2 A -0.7033 10
## 8: 2 C -0.7460 12
对列进行操作的时候,需要注意返回的类型(向量还是数据表)。
如果想返回向量,可以使用DT[, j]
、DT[[idx]]
或DT[["j"]]
。
如果想返回数据表,可以使用DT[, "j"]
、DT[, .(j)]
或DT[, idx]
。
注:上述针对的是某一列,并且均推荐使用第一种操作方式。
对多列进行操作时,必不可少的就是.()
符号,其相当于list()
DT[, V2]
## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
DT[[2]]
## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
DT[["V2"]]
## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
DT[, !c("V1", "V3", "V4")]
## V2
## 1: A
## 2: B
## 3: C
## 4: A
## 5: B
## 6: C
## 7: A
## 8: B
## 9: C
## 10: A
## 11: B
## 12: C
DT[, names(DT) %like% "2", with = F]
## V2
## 1: A
## 2: B
## 3: C
## 4: A
## 5: B
## 6: C
## 7: A
## 8: B
## 9: C
## 10: A
## 11: B
## 12: C
DT[, sum(V1)]
## [1] 18
DT[, "V2"]
## V2
## 1: A
## 2: B
## 3: C
## 4: A
## 5: B
## 6: C
## 7: A
## 8: B
## 9: C
## 10: A
## 11: B
## 12: C
DT[, 2]
## V2
## 1: A
## 2: B
## 3: C
## 4: A
## 5: B
## 6: C
## 7: A
## 8: B
## 9: C
## 10: A
## 11: B
## 12: C
DT[, .(V2, V3)]
## V2 V3
## 1: A 0.3408
## 2: B -0.7033
## 3: C -0.3795
## 4: A -0.7460
## 5: B 0.3408
## 6: C -0.7033
## 7: A -0.3795
## 8: B -0.7460
## 9: C 0.3408
## 10: A -0.7033
## 11: B -0.3795
## 12: C -0.7460
DT[, list(V2, V3)]
## V2 V3
## 1: A 0.3408
## 2: B -0.7033
## 3: C -0.3795
## 4: A -0.7460
## 5: B 0.3408
## 6: C -0.7033
## 7: A -0.3795
## 8: B -0.7460
## 9: C 0.3408
## 10: A -0.7033
## 11: B -0.3795
## 12: C -0.7460
DT[, .(sum(V1), sd(V3))]
## V1 V2
## 1: 18 0.4546055
DT[, .(Aggregate = sum(V1), Sd.V3 = sd(V3))]
## Aggregate Sd.V3
## 1: 18 0.4546055
DT[, .(V1, Sd.V3 = sd(V3))]
## V1 Sd.V3
## 1: 1 0.4546055
## 2: 2 0.4546055
## 3: 1 0.4546055
## 4: 2 0.4546055
## 5: 1 0.4546055
## 6: 2 0.4546055
## 7: 1 0.4546055
## 8: 2 0.4546055
## 9: 1 0.4546055
## 10: 2 0.4546055
## 11: 1 0.4546055
## 12: 2 0.4546055
DT[, .(print(V2), plot(V3))]
## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
## V1
## 1: A
## 2: B
## 3: C
## 4: A
## 5: B
## 6: C
## 7: A
## 8: B
## 9: C
## 10: A
## 11: B
## 12: C
使用DT[i, j, by]中的by对j进行操作。
DT[, .(V4.Sum = sum(V4)), by = V1]
## V1 V4.Sum
## 1: 1 36
## 2: 2 42
DT[, .(V4.Sum = sum(V4)), by = .(V1, V2)]
## V1 V2 V4.Sum
## 1: 1 A 8
## 2: 2 B 10
## 3: 1 C 12
## 4: 2 A 14
## 5: 1 B 16
## 6: 2 C 18
DT[, .(V4.Sum = sum(V4)), by = sign(V1-1)] #列名会被sign填补
## sign V4.Sum
## 1: 0 36
## 2: 1 42
DT[, .(V4.Sum = sum(V4)), by = .(V1.01 = sign(V1-1))] #自定义列名
## V1.01 V4.Sum
## 1: 0 36
## 2: 1 42
DT[, .N, by = V1] #统计V1列特征值的个数
## V1 N
## 1: 1 6
## 2: 2 6
setorder()
可以按照一个变量或者多个变量排序,默认的排列方式是升序,如果想要降序,只需要在变量名前面加一个-
即可。
setorder(DT, V3)
setorder(DT, V3, -V4)
使用:=
可以对数据的列进行重塑:Updat、Add、Delete。
注:额外的指定DT <- DT[…]是多余的,以下操作会直接在DT上进行修改。并且后面使用[]
会代替DT
使修改后的数据直接返回。
DT[, V1 := round(exp(V1), 2)] []
## V1 V2 V3 V4
## 1: 7.39 C -0.7460 12
## 2: 7.39 B -0.7460 8
## 3: 7.39 A -0.7460 4
## 4: 7.39 A -0.7033 10
## 5: 7.39 C -0.7033 6
## 6: 7.39 B -0.7033 2
## 7: 2.72 B -0.3795 11
## 8: 2.72 A -0.3795 7
## 9: 2.72 C -0.3795 3
## 10: 2.72 C 0.3408 9
## 11: 2.72 B 0.3408 5
## 12: 2.72 A 0.3408 1
DT[, c("V1", "V2") := .(round(exp(V1), 2), LETTERS[4:6])] []
## V1 V2 V3 V4
## 1: 1619.71 D -0.7460 12
## 2: 1619.71 E -0.7460 8
## 3: 1619.71 F -0.7460 4
## 4: 1619.71 D -0.7033 10
## 5: 1619.71 E -0.7033 6
## 6: 1619.71 F -0.7033 2
## 7: 15.18 D -0.3795 11
## 8: 15.18 E -0.3795 7
## 9: 15.18 F -0.3795 3
## 10: 15.18 D 0.3408 9
## 11: 15.18 E 0.3408 5
## 12: 15.18 F 0.3408 1
DT[, V5 := letters[4:6]] []
## V1 V2 V3 V4 V5
## 1: 1619.71 D -0.7460 12 d
## 2: 1619.71 E -0.7460 8 e
## 3: 1619.71 F -0.7460 4 f
## 4: 1619.71 D -0.7033 10 d
## 5: 1619.71 E -0.7033 6 e
## 6: 1619.71 F -0.7033 2 f
## 7: 15.18 D -0.3795 11 d
## 8: 15.18 E -0.3795 7 e
## 9: 15.18 F -0.3795 3 f
## 10: 15.18 D 0.3408 9 d
## 11: 15.18 E 0.3408 5 e
## 12: 15.18 F 0.3408 1 f
DT[, c("V6", "V7") := .(sum(V1), mean(V4))] []
## V1 V2 V3 V4 V5 V6 V7
## 1: 1619.71 D -0.7460 12 d 9809.34 6.5
## 2: 1619.71 E -0.7460 8 e 9809.34 6.5
## 3: 1619.71 F -0.7460 4 f 9809.34 6.5
## 4: 1619.71 D -0.7033 10 d 9809.34 6.5
## 5: 1619.71 E -0.7033 6 e 9809.34 6.5
## 6: 1619.71 F -0.7033 2 f 9809.34 6.5
## 7: 15.18 D -0.3795 11 d 9809.34 6.5
## 8: 15.18 E -0.3795 7 e 9809.34 6.5
## 9: 15.18 F -0.3795 3 f 9809.34 6.5
## 10: 15.18 D 0.3408 9 d 9809.34 6.5
## 11: 15.18 E 0.3408 5 e 9809.34 6.5
## 12: 15.18 F 0.3408 1 f 9809.34 6.5
DT[, V7:= NULL] []
## V1 V2 V3 V4 V5 V6
## 1: 1619.71 D -0.7460 12 d 9809.34
## 2: 1619.71 E -0.7460 8 e 9809.34
## 3: 1619.71 F -0.7460 4 f 9809.34
## 4: 1619.71 D -0.7033 10 d 9809.34
## 5: 1619.71 E -0.7033 6 e 9809.34
## 6: 1619.71 F -0.7033 2 f 9809.34
## 7: 15.18 D -0.3795 11 d 9809.34
## 8: 15.18 E -0.3795 7 e 9809.34
## 9: 15.18 F -0.3795 3 f 9809.34
## 10: 15.18 D 0.3408 9 d 9809.34
## 11: 15.18 E 0.3408 5 e 9809.34
## 12: 15.18 F 0.3408 1 f 9809.34
DT[, c("V3", "V4") := NULL] []
## V1 V2 V5 V6
## 1: 1619.71 D d 9809.34
## 2: 1619.71 E e 9809.34
## 3: 1619.71 F f 9809.34
## 4: 1619.71 D d 9809.34
## 5: 1619.71 E e 9809.34
## 6: 1619.71 F f 9809.34
## 7: 15.18 D d 9809.34
## 8: 15.18 E e 9809.34
## 9: 15.18 F f 9809.34
## 10: 15.18 D d 9809.34
## 11: 15.18 E e 9809.34
## 12: 15.18 F f 9809.34
Col.Chosen = c("V5", "V6")
DT[, Col.Chosen := NULL] [] #“Col.Chosen”在这里是一个变量名称,而不是“V5”&“V6”
## V1 V2 V5 V6
## 1: 1619.71 D d 9809.34
## 2: 1619.71 E e 9809.34
## 3: 1619.71 F f 9809.34
## 4: 1619.71 D d 9809.34
## 5: 1619.71 E e 9809.34
## 6: 1619.71 F f 9809.34
## 7: 15.18 D d 9809.34
## 8: 15.18 E e 9809.34
## 9: 15.18 F f 9809.34
## 10: 15.18 D d 9809.34
## 11: 15.18 E e 9809.34
## 12: 15.18 F f 9809.34
DT[, (Col.Chosen) := NULL] [] #“Col.Chosen”在这里代表“V5”&“V6”
## V1 V2
## 1: 1619.71 D
## 2: 1619.71 E
## 3: 1619.71 F
## 4: 1619.71 D
## 5: 1619.71 E
## 6: 1619.71 F
## 7: 15.18 D
## 8: 15.18 E
## 9: 15.18 F
## 10: 15.18 D
## 11: 15.18 E
## 12: 15.18 F
数据表和数据框不同,其没有行索引,所以有时候得使用setkey
设置一下行索引,并且可以是单索引和多索引。
注:使用索引进行操作效率比较高。
set.seed(45L)
DT <- data.table(V1 = c(1L, 2L),
V2 = LETTERS[1:3],
V3 = round(rnorm(4), 4),
V4 = 1:12)
DT
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 B -0.7033 2
## 3: 1 C -0.3795 3
## 4: 2 A -0.7460 4
## 5: 1 B 0.3408 5
## 6: 2 C -0.7033 6
## 7: 1 A -0.3795 7
## 8: 2 B -0.7460 8
## 9: 1 C 0.3408 9
## 10: 2 A -0.7033 10
## 11: 1 B -0.3795 11
## 12: 2 C -0.7460 12
setkey(DT, V2)
DT["A"]
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 A -0.7460 4
## 3: 1 A -0.3795 7
## 4: 2 A -0.7033 10
DT[c("A", "C")]
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 A -0.7460 4
## 3: 1 A -0.3795 7
## 4: 2 A -0.7033 10
## 5: 1 C -0.3795 3
## 6: 2 C -0.7033 6
## 7: 1 C 0.3408 9
## 8: 2 C -0.7460 12
DT["A", mult = "first"] #挑选出索引是“A”的第一条观测
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
DT["A", mult = "last"] #挑选出索引是“A”的最后一条观测
## V1 V2 V3 V4
## 1: 2 A -0.7033 10
DT[c("A", "G")] #不存在的索引将会用NA填补
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 A -0.7460 4
## 3: 1 A -0.3795 7
## 4: 2 A -0.7033 10
## 5: NA G NA NA
DT[c("A", "G"), nomatch = 0] #不存在的索引不返回
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 A -0.7460 4
## 3: 1 A -0.3795 7
## 4: 2 A -0.7033 10
DT[c("A", "C"), sum(V1)] #对索引是“D”或“F”的V1求和
## [1] 12
DT[c("A", "C"), sum(V1), by = .EACHI] #分别对索引是“D”或“F”的V1求和
## V2 V1
## 1: A 6
## 2: C 6
setkey(DT, V1, V2)
DT[.(2, "C")]
## V1 V2 V3 V4
## 1: 2 C -0.7033 6
## 2: 2 C -0.7460 12
DT[.(2, c("A", "C"))]
## V1 V2 V3 V4
## 1: 2 A -0.7460 4
## 2: 2 A -0.7033 10
## 3: 2 C -0.7033 6
## 4: 2 C -0.7460 12
·.N
:代表行的数量,用by参数分组时则是每一组的行数量
·.SD
:代表整个数据表,用by参数分组时则是每一组的数据表
·.SDcols
:指定.SD
代表的数据表包括哪些列
·I
:
DT[.N] #返回数据表的最后一行
## V1 V2 V3 V4
## 1: 2 C -0.746 12
DT[.N-1] #返回数据表的倒数第二行
## V1 V2 V3 V4
## 1: 2 C -0.7033 6
DT[, .N] #返回数据表的行数
## [1] 12
# 按照“V2”进行拆分整个数据表,依次返回每个数据表
DT[, print(.SD), by = V2]
## V1 V3 V4
## 1: 1 0.3408 1
## 2: 1 -0.3795 7
## 3: 2 -0.7460 4
## 4: 2 -0.7033 10
## V1 V3 V4
## 1: 1 0.3408 5
## 2: 1 -0.3795 11
## 3: 2 -0.7033 2
## 4: 2 -0.7460 8
## V1 V3 V4
## 1: 1 -0.3795 3
## 2: 1 0.3408 9
## 3: 2 -0.7033 6
## 4: 2 -0.7460 12
## Empty data.table (0 rows) of 1 col: V2
# 按照“V2”进行拆分整个数据表,依次返回每个数据表的第一个和最后一个观测
DT[, .SD[c(1, .N)], by = V2]
## V2 V1 V3 V4
## 1: A 1 0.3408 1
## 2: A 2 -0.7033 10
## 3: B 1 0.3408 5
## 4: B 2 -0.7460 8
## 5: C 1 -0.3795 3
## 6: C 2 -0.7460 12
# 按照“V2”进行拆分整个数据表,依次对每个数据表进行列求和
DT[, lapply(.SD, sum), by = V2]
## V2 V1 V3 V4
## 1: A 6 -1.488 22
## 2: B 6 -1.488 26
## 3: C 6 -1.488 30
# 按照“V2”进行拆分整个数据表,依次对每个数据表的“V3”和“V4”列进行列求和
DT[, lapply(.SD, sum), by = V2, .SDcols = c("V3", "V4")]
## V2 V3 V4
## 1: A -1.488 22
## 2: B -1.488 26
## 3: C -1.488 30
DT[, lapply(.SD, sum), by = V2, .SDcols = paste0("V", 3:4)]
## V2 V3 V4
## 1: A -1.488 22
## 2: B -1.488 26
## 3: C -1.488 30
DT[between(V4, 6, 10)]
## V1 V2 V3 V4
## 1: 1 A -0.3795 7
## 2: 1 C 0.3408 9
## 3: 2 A -0.7033 10
## 4: 2 B -0.7460 8
## 5: 2 C -0.7033 6
data.table还可以进行串联操作,与dplyr中的%>%
有点相似。主要目的就是把多个表达式串联起来,从而避免多于的中间变量。
DT[, .(V4.Sum = sum(V4)), by = V1] [V4.Sum > 40]
## V1 V4.Sum
## 1: 2 42
DT[, .(V4.Sum = sum(V4)), by = V1] [order(-V1)]
## V1 V4.Sum
## 1: 2 42
## 2: 1 36
Syntax: for (i in from:to) set(DT, row, column, new value)
rows <- list(3:4, 5:6)
cols <- 1:2
for (i in seq_along(rows)) {
set(DT,
i = rows[[i]],
j = cols[i],
value = NA)
}
setnames(DT, "V2", "Rating")
setnames(DT, c("V3", "V4"), c("V3.DC", "V4.DC")); DT
## V1 Rating V3.DC V4.DC
## 1: 1 A 0.3408 1
## 2: 1 A -0.3795 7
## 3: NA B 0.3408 5
## 4: NA B -0.3795 11
## 5: 1 NA -0.3795 3
## 6: 1 NA 0.3408 9
## 7: 2 A -0.7460 4
## 8: 2 A -0.7033 10
## 9: 2 B -0.7033 2
## 10: 2 B -0.7460 8
## 11: 2 C -0.7033 6
## 12: 2 C -0.7460 12
setcolorder(DT, c("V3.DC", "V1", "Rating", "V4.DC")); DT
## V3.DC V1 Rating V4.DC
## 1: 0.3408 1 A 1
## 2: -0.3795 1 A 7
## 3: 0.3408 NA B 5
## 4: -0.3795 NA B 11
## 5: -0.3795 1 NA 3
## 6: 0.3408 1 NA 9
## 7: -0.7460 2 A 4
## 8: -0.7033 2 A 10
## 9: -0.7033 2 B 2
## 10: -0.7460 2 B 8
## 11: -0.7033 2 C 6
## 12: -0.7460 2 C 12
data.table包改写了reshape2中的数据融合重铸的功能,使其更加方便和强大。
set.seed(45L)
DT <- data.table(
i_1 = c(1:5, NA),
i_2 = c(NA,6,7,8,9,10),
f_1 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
f_2 = factor(c("z", "a", "x", "c", "x", "x"), ordered=TRUE),
c_1 = sample(c(letters[1:3], NA), 6, TRUE),
d_1 = as.Date(c(1:3,NA,4:5), origin="2013-09-01"),
d_2 = as.Date(6:1, origin="2012-01-01"))
DT[, l_1 := DT[, list(c=list(rep(i_1, sample(5,1)))), by = i_1]$c]
DT[, l_2 := DT[, list(c=list(rep(c_1, sample(5,1)))), by = i_1]$c]
DT
## i_1 i_2 f_1 f_2 c_1 d_1 d_2 l_1 l_2
## 1: 1 NA c z a 2013-09-02 2012-01-07 1,1 a,a,a
## 2: 2 6 b a c 2013-09-03 2012-01-06 2,2,2 c,c
## 3: 3 7 a x a 2013-09-04 2012-01-05 3,3,3 a
## 4: 4 8 b c a <NA> 2012-01-04 4,4,4 a,a
## 5: 5 9 b x b 2013-09-05 2012-01-03 5,5,5,5,5 b,b,b
## 6: NA 10 b x NA 2013-09-06 2012-01-02 NA NA,NA,NA,NA
## 如果对不同类型的数据整合在同一列,则会出现警告(会强制转换数据类型)
# melt(DT, id.vars = c("i_1", "i_2")) #同下
melt(DT, id.vars = 1:2)
## i_1 i_2 variable value
## 1: 1 NA f_1 3
## 2: 2 6 f_1 2
## 3: 3 7 f_1 1
## 4: 4 8 f_1 2
## 5: 5 9 f_1 2
## 6: NA 10 f_1 2
## 7: 1 NA f_2 4
## 8: 2 6 f_2 1
## 9: 3 7 f_2 3
## 10: 4 8 f_2 2
## 11: 5 9 f_2 3
## 12: NA 10 f_2 3
## 13: 1 NA c_1 a
## 14: 2 6 c_1 c
## 15: 3 7 c_1 a
## 16: 4 8 c_1 a
## 17: 5 9 c_1 b
## 18: NA 10 c_1 NA
## 19: 1 NA d_1 15950
## 20: 2 6 d_1 15951
## 21: 3 7 d_1 15952
## 22: 4 8 d_1 NA
## 23: 5 9 d_1 15953
## 24: NA 10 d_1 15954
## 25: 1 NA d_2 15346
## 26: 2 6 d_2 15345
## 27: 3 7 d_2 15344
## 28: 4 8 d_2 15343
## 29: 5 9 d_2 15342
## 30: NA 10 d_2 15341
## 31: 1 NA l_1 1,1
## 32: 2 6 l_1 2,2,2
## 33: 3 7 l_1 3,3,3
## 34: 4 8 l_1 4,4,4
## 35: 5 9 l_1 5,5,5,5,5
## 36: NA 10 l_1 NA
## 37: 1 NA l_2 a,a,a
## 38: 2 6 l_2 c,c
## 39: 3 7 l_2 a
## 40: 4 8 l_2 a,a
## 41: 5 9 l_2 b,b,b
## 42: NA 10 l_2 NA,NA,NA,NA
## i_1 i_2 variable value
## 对某一列的数值进行整合
# melt(DT, id.vars = 1:2, measure.vars = "f_1") #同下
melt(DT, id.vars = 1:2, measure.vars = 3)
## i_1 i_2 variable value
## 1: 1 NA f_1 c
## 2: 2 6 f_1 b
## 3: 3 7 f_1 a
## 4: 4 8 f_1 b
## 5: 5 9 f_1 b
## 6: NA 10 f_1 b
## 对相同数据类型的多列进行整合到一列
# melt(DT, id.vars = 1:2, measure.vars = c("d_1", "d_2")) # 同下
melt(DT, id.vars = 1:2, measure.vars = 6:7)
## i_1 i_2 variable value
## 1: 1 NA d_1 2013-09-02
## 2: 2 6 d_1 2013-09-03
## 3: 3 7 d_1 2013-09-04
## 4: 4 8 d_1 <NA>
## 5: 5 9 d_1 2013-09-05
## 6: NA 10 d_1 2013-09-06
## 7: 1 NA d_2 2012-01-07
## 8: 2 6 d_2 2012-01-06
## 9: 3 7 d_2 2012-01-05
## 10: 4 8 d_2 2012-01-04
## 11: 5 9 d_2 2012-01-03
## 12: NA 10 d_2 2012-01-02
## 对不同类型的多列进行整合的时候,使用measure.vars参数时,默认把相同类型的数值整合到一列,即几种数据类型的数据则被整合为几列
melt(DT, id.vars = 1:2, measure.vars = list(3:4, c("d_1", "d_2")))
## i_1 i_2 variable value1 value2
## 1: 1 NA 1 c 2013-09-02
## 2: 2 6 1 b 2013-09-03
## 3: 3 7 1 a 2013-09-04
## 4: 4 8 1 b <NA>
## 5: 5 9 1 b 2013-09-05
## 6: NA 10 1 b 2013-09-06
## 7: 1 NA 2 z 2012-01-07
## 8: 2 6 2 a 2012-01-06
## 9: 3 7 2 x 2012-01-05
## 10: 4 8 2 c 2012-01-04
## 11: 5 9 2 x 2012-01-03
## 12: NA 10 2 x 2012-01-02
## 模糊匹配整合
melt(DT, id.vars = 1:2, measure.vars = patterns("^f_", "^d_"), value.factor = T)
## i_1 i_2 variable value1 value2
## 1: 1 NA 1 c 2013-09-02
## 2: 2 6 1 b 2013-09-03
## 3: 3 7 1 a 2013-09-04
## 4: 4 8 1 b <NA>
## 5: 5 9 1 b 2013-09-05
## 6: NA 10 1 b 2013-09-06
## 7: 1 NA 2 z 2012-01-07
## 8: 2 6 2 a 2012-01-06
## 9: 3 7 2 x 2012-01-05
## 10: 4 8 2 c 2012-01-04
## 11: 5 9 2 x 2012-01-03
## 12: NA 10 2 x 2012-01-02
melt(DT, id.vars = 1:2, measure.vars = patterns("l_", "c_"), na.rm = T)
## i_1 i_2 variable value1 value2
## 1: 1 NA 1 1,1 a
## 2: 2 6 1 2,2,2 c
## 3: 3 7 1 3,3,3 a
## 4: 4 8 1 4,4,4 a
## 5: 5 9 1 5,5,5,5,5 b
## 6: NA 10 1 NA NA
## 7: 1 NA 2 a,a,a NA
## 8: 2 6 2 c,c NA
## 9: 3 7 2 a NA
## 10: 4 8 2 a,a NA
## 11: 5 9 2 b,b,b NA
## 12: NA 10 2 NA,NA,NA,NA NA
## 整合之后的新列会生成默认的列名,我们也可以自定义列名
melt(DT, id = 3:4, measure = c("d_1", "d_2"), variable.name = "newvar", value.name = "newval")
## f_1 f_2 newvar newval
## 1: c z d_1 2013-09-02
## 2: b a d_1 2013-09-03
## 3: a x d_1 2013-09-04
## 4: b c d_1 <NA>
## 5: b x d_1 2013-09-05
## 6: b x d_1 2013-09-06
## 7: c z d_2 2012-01-07
## 8: b a d_2 2012-01-06
## 9: a x d_2 2012-01-05
## 10: b c d_2 2012-01-04
## 11: b x d_2 2012-01-03
## 12: b x d_2 2012-01-02
## 不同类型的数据整合之后也可以自定义列名
melt(DT, id.vars = 1:2, measure.vars = patterns("l_", "c_"), variable.name = "newvar", value.name = "newval")
## i_1 i_2 newvar newval1 newval2
## 1: 1 NA 1 1,1 a
## 2: 2 6 1 2,2,2 c
## 3: 3 7 1 3,3,3 a
## 4: 4 8 1 4,4,4 a
## 5: 5 9 1 5,5,5,5,5 b
## 6: NA 10 1 NA NA
## 7: 1 NA 2 a,a,a NA
## 8: 2 6 2 c,c NA
## 9: 3 7 2 a NA
## 10: 4 8 2 a,a NA
## 11: 5 9 2 b,b,b NA
## 12: NA 10 2 NA,NA,NA,NA NA
melt(DT, id.vars = 1:2, measure.vars = patterns("l_", "c_"), variable.name = "newvar", value.name = c("nv1", "nv2"))
## i_1 i_2 newvar nv1 nv2
## 1: 1 NA 1 1,1 a
## 2: 2 6 1 2,2,2 c
## 3: 3 7 1 3,3,3 a
## 4: 4 8 1 4,4,4 a
## 5: 5 9 1 5,5,5,5,5 b
## 6: NA 10 1 NA NA
## 7: 1 NA 2 a,a,a NA
## 8: 2 6 2 c,c NA
## 9: 3 7 2 a NA
## 10: 4 8 2 a,a NA
## 11: 5 9 2 b,b,b NA
## 12: NA 10 2 NA,NA,NA,NA NA
set.seed(45L)
DT <- data.table(V1 = rep(1:2, each = 4),
V2 = 1:4,
V3 = letters[1:2],
V4 = rnorm(8))
DT
## V1 V2 V3 V4
## 1: 1 1 a 0.3407997
## 2: 1 2 b -0.7033403
## 3: 1 3 a -0.3795377
## 4: 1 4 b -0.7460474
## 5: 2 1 a -0.8981073
## 6: 2 2 b -0.3347941
## 7: 2 3 a -0.5013782
## 8: 2 4 b -0.1745357
## 指定行列索引和填充值
dcast(DT, V1~V2, value.var = "V3")
## V1 1 2 3 4
## 1: 1 a b a b
## 2: 2 a b a b
## 指定行列索引,不指定填充值的时候,会自动选择数值型数据作为填充值
# dcast(DT, V1~V2) #同下
dcast(DT, V1~V2, value.var = "V4")
## V1 1 2 3 4
## 1: 1 0.3407997 -0.7033403 -0.3795377 -0.7460474
## 2: 2 -0.8981073 -0.3347941 -0.5013782 -0.1745357
## 指定行列索引和填充值
# dcast(DT, V1~V3, mean, value.var = "V4") #同下
dcast(DT, V1~V3, mean)
## V1 a b
## 1: 1 -0.01936902 -0.7246939
## 2: 2 -0.69974274 -0.2546649
## 对多列进行求平均
dcast(DT, V1~V3, mean, value.var = c("V2", "V4"))
## V1 V2_a V2_b V4_a V4_b
## 1: 1 2 3 -0.01936902 -0.7246939
## 2: 2 2 3 -0.69974274 -0.2546649
## 对多列进行求平均、求和
dcast(DT, V1~V3, list(mean, sum), value.var = c("V2", "V4"))
## V1 V2_mean_a V2_mean_b V4_mean_a V4_mean_b V2_sum_a V2_sum_b
## 1: 1 2 3 -0.01936902 -0.7246939 4 6
## 2: 2 2 3 -0.69974274 -0.2546649 4 6
## V4_sum_a V4_sum_b
## 1: -0.03873804 -1.4493877
## 2: -1.39948548 -0.5093298
set.seed(45L)
DT <- data.table(V1 = rep(1:2, each = 6),
V2 = rep(rep(1:3, 2), each = 2),
V3 = factor(rep(c(1,3),6),levels=1:3),
V4 = rnorm(6))
DT
## V1 V2 V3 V4
## 1: 1 1 1 0.3407997
## 2: 1 1 3 -0.7033403
## 3: 1 2 1 -0.3795377
## 4: 1 2 3 -0.7460474
## 5: 1 3 1 -0.8981073
## 6: 1 3 3 -0.3347941
## 7: 2 1 1 0.3407997
## 8: 2 1 3 -0.7033403
## 9: 2 2 1 -0.3795377
## 10: 2 2 3 -0.7460474
## 11: 2 3 1 -0.8981073
## 12: 2 3 3 -0.3347941
# V3因子型,因子水平2没有出现过,在融合时还会与其他列进行匹配,就会出现一行全是NA的情况
dcast(DT, V1+V3~V2, drop = F)
## V1 V3 1 2 3
## 1: 1 1 0.3407997 -0.3795377 -0.8981073
## 2: 1 2 NA NA NA
## 3: 1 3 -0.7033403 -0.7460474 -0.3347941
## 4: 2 1 0.3407997 -0.3795377 -0.8981073
## 5: 2 2 NA NA NA
## 6: 2 3 -0.7033403 -0.7460474 -0.3347941
dcast(DT, V1+V3~V2, drop = T)
## V1 V3 1 2 3
## 1: 1 1 0.3407997 -0.3795377 -0.8981073
## 2: 1 3 -0.7033403 -0.7460474 -0.3347941
## 3: 2 1 0.3407997 -0.3795377 -0.8981073
## 4: 2 3 -0.7033403 -0.7460474 -0.3347941
dcast(DT, V1+V3~V2, drop = F, fill = 0) #是NA的地方填充0
## V1 V3 1 2 3
## 1: 1 1 0.3407997 -0.3795377 -0.8981073
## 2: 1 2 0.0000000 0.0000000 0.0000000
## 3: 1 3 -0.7033403 -0.7460474 -0.3347941
## 4: 2 1 0.3407997 -0.3795377 -0.8981073
## 5: 2 2 0.0000000 0.0000000 0.0000000
## 6: 2 3 -0.7033403 -0.7460474 -0.3347941
dcast(DT, V1~V2, mean)
## V1 1 2 3
## 1: 1 -0.1812703 -0.5627926 -0.6164507
## 2: 2 -0.1812703 -0.5627926 -0.6164507
# 挑选V3是1的出来计算
dcast(DT, V1~V2, mean, subset = .(V3 == 1))
## V1 1 2 3
## 1: 1 0.3407997 -0.3795377 -0.8981073
## 2: 2 0.3407997 -0.3795377 -0.8981073
# 计算完挑选V1是1的出来
dcast(DT, V1~V2, mean, subset = .(V1 == 1))
## V1 1 2 3
## 1: 1 -0.1812703 -0.5627926 -0.6164507
# 计算完挑选V2是1的出来
dcast(DT, V1~V2, mean, subset = .(V2 == 1))
## V1 1
## 1: 1 -0.1812703
## 2: 2 -0.1812703
# 挑选V3是1的出来计算,之后挑选V1是1的来展示
dcast(DT, V1~V2, mean, subset = .(V1 == 1 & V3 == 1))
## V1 1 2 3
## 1: 1 0.3407997 -0.3795377 -0.8981073
上面涉及到的几乎都是常用的语法和函数,下面就介绍几个改进了的函数:
%chin%替代了%in%
fsort替代了sort
chmatch替代了match,两个参数返回和前者等长的向量,是前者每一个元素在后者中的索引
chorder或者chgroup代替order,返回一个向量,排列顺序为:最小值在向量中的索引,第二小的…
duplicated替代duplicated
unique替代unique,另有uniqueN直接计算去重之后的个数
还有就是常用的集合函数:增加了all参数,控制重复值。基础函数只能返回去重之后的结果。
函数变化:union intersect setdiff setequal 前面都加了一个f
注:以上几个集合函数作用于数据表,并且要求列名相同。
x <- data.table(a = c(1, 2, 2, 2, 3, 4, 4))
y <- data.table(a = c(2, 3, 4, 4, 4, 5))
fintersect(x, y) #返回相交部分并去重
## a
## 1: 2
## 2: 3
## 3: 4
fintersect(x, y, all = TRUE) #相交,保留重复值
## a
## 1: 2
## 2: 3
## 3: 4
## 4: 4
x <- data.table(a = c(1, 2, 2, 2, 3, 4, 4))
y <- data.table(a = c(2, 3, 4, 4, 4, 5))
fintersect(x, y) #交集,去重
## a
## 1: 2
## 2: 3
## 3: 4
fintersect(x, y, all = TRUE) #交集,保留重复值
## a
## 1: 2
## 2: 3
## 3: 4
## 4: 4
fsetdiff(x, y) #差集,去重
## a
## 1: 1
fsetdiff(x, y, all = TRUE) #差集,保留重复值
## a
## 1: 1
## 2: 2
## 3: 2
funion(x, y) #并集,去重
## a
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
funion(x, y, all = TRUE) #并集,保留重复值
## a
## 1: 1
## 2: 2
## 3: 2
## 4: 2
## 5: 3
## 6: 4
## 7: 4
## 8: 2
## 9: 3
## 10: 4
## 11: 4
## 12: 4
## 13: 5
fsetequal(x, y) #返回一个F,二者不完全相等
## [1] FALSE
y <- x <- 1:5
xy <- data.table(x, y)
shift(x, n = 1, fill = NA, type = "lag")
## [1] NA 1 2 3 4
shift(x, n = 1:2, fill = 0, type = "lag")
## [[1]]
## [1] 0 1 2 3 4
##
## [[2]]
## [1] 0 0 1 2 3
xy[, (c("a"," b")) := shift(.SD, 1, 0, "lead")] []
## x y a b
## 1: 1 1 2 2
## 2: 2 2 3 3
## 3: 3 3 4 4
## 4: 4 4 5 5
## 5: 5 5 0 0
xy[, shift(.SD, 1, 0, "lead", give.names = T)] [] #自动生成名字
## x_lead_1 y_lead_1 a_lead_1 b_lead_1
## 1: 2 2 3 3
## 2: 3 3 4 4
## 3: 4 4 5 5
## 4: 5 5 0 0
## 5: 0 0 0 0