1. 数据参考 www.kaggle.com
  2. 代码参考 github

scale_pos_weight [default=1]为主要设置的参数。

Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances) xgboost.readthedocs.io

一般不平衡样本中负样本多,这里设定的值就是负/正样本比例。

Generally, the Scale_pos_weight is the ratio of number of negative class to the positive class. Suppose, the dataset has 90 observations of negative class and 10 observations of positive class, then ideal value of scale_pos_Weight should be 9. stats

例如有100个样本,负样本为90个,正样本为10个,那么scale_pos_Weight=9

以下是可以参考的 R 代码。

# install xgboost package, see R-package in root folder
require(xgboost)
## Loading required package: xgboost
require(methods)

testsize <- 550000

dtrain <- read.csv("../refs/higgs-boson/training/training.csv", header=TRUE)
class(dtrain)
## [1] "data.frame"
dim(dtrain)
## [1] 250000     33
table(dtrain[33])
## 
##      b      s 
## 164333  85667
dtrain[33] <- dtrain[33] == "s"
table(dtrain[33])
## 
##  FALSE   TRUE 
## 164333  85667
label <- as.numeric(dtrain[[33]])
data <- as.matrix(dtrain[2:31])
# weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
# sumwpos <- sum(weight * (label==1.0))
# sumwneg <- sum(weight * (label==0.0))
sumwpos <- sum(label==1.0)
sumwneg <- sum(label==0.0)
print(paste("weight statistics: wpos=", sumwpos, "wneg=", sumwneg, "ratio=", sumwneg / sumwpos))
## [1] "weight statistics: wpos= 85667 wneg= 164333 ratio= 1.91827658258139"
xgmat <- xgb.DMatrix(data, label = label, 
                     # weight = weight, 
                     missing = -999.0)
param <- list("objective" = "binary:logitraw",
              "scale_pos_weight" = sumwneg / sumwpos,
              "bst:eta" = 0.1,
              "bst:max_depth" = 6,
              "eval_metric" = "auc",
              "eval_metric" = "ams@0.15",
              "silent" = 1,
              "nthread" = 16)
watchlist <- list("train" = xgmat)
nrounds = 120
print ("loading data end, start to boost trees")
bst = xgb.train(param, xgmat, nrounds, watchlist );
# save out model
xgb.save(bst, "higgs.model")
print ('finish training')