CatBoost 学习笔记
2019-12-03
- 使用 RMarkdown 的
child
参数,进行文档拼接。 - 这样拼接以后的笔记方便复习。
- 相关问题提交到 Issue
1 R 安装
在 github 上找到 R-Package 的路径下载文件夹,
利用函数devtools::install
安装对应的文件夹即可。
完整版本,参考 www.analyticsvidhya.com
2 R 最简例子
参考 tech.yandex.com,
library(catboost)
features <-
data.frame(feature1 = c(1, 2, 3),
feature2 = c('A', 'B', 'C'))
labels <- c(0, 0, 1)
train_pool <- catboost.load_pool(data = features, label = labels)
test_data <-
data.frame(feature1 = c(2, 1, 3),
feature2 = c('D', 'B', 'C'))
test_pool <- catboost.load_pool(test_data)
3 变量重要性
4 函数化编程例子
参考 Usage examples, 但是代码是对象化编程,而非函数化编程。
library(catboost)
suppressMessages(library(tidyverse))
library(magrittr)
train_path = system.file("extdata",
"adult_train.1000",
package = "catboost")
test_path = system.file("extdata",
"adult_test.1000",
package = "catboost")
column_description_path = system.file("extdata",
"adult.cd",
package = "catboost")
pool <- catboost.load_pool(train_path,
column_description = column_description_path)
head(pool, 1) %>% as.vector()
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
column_description_vector[cat_features] <- 'factor'
train <- read.table(
train_path,
head = F,
sep = "\t",
colClasses = column_description_vector,
na.strings = 'NAN'
)
test <- read.table(
test_path,
head = F,
sep = "\t",
colClasses = column_description_vector,
na.strings = 'NAN'
)
train %<>%
mutate_if(is.factor, as.numeric)
test %<>%
mutate_if(is.factor, as.numeric)
target <- 1
train_pool <- catboost.load_pool(as.matrix(train[, -target]),
label = as.matrix(train[, target]),
cat_features = cat_features - 1 - 1)
test_pool <- catboost.load_pool(as.matrix(test[, -target]),
label = as.matrix(test[, target]),
cat_features = cat_features - 1 - 1)
head(train_pool, 1) %>% as.vector()
head(test_pool, 1) %>% as.vector()
as.matrix(data[,-target]
和as.matrix(data[,target])
两个很对称可以学习。cat_features
index = 0
fit_params <- list(
iterations = 100,
metric_period = 10,
loss_function = 'Logloss',
ignored_features = c(4, 9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5
)
Border count The number of splits for numerical features. tech
Allowed values are integers from 1 to 255 inclusively. effectiveml
同时,
ctr_border_count
The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively. effectiveml
这算切分的细节,Xgboost 也有提及。
参考 tech
使用 catboost.save_model
5 Python 安装
参考 Docs 安装 Python。
pip install catboost
安装可视化工具
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension