1 Table of Contents

2 1 Wager and Athey (2018)

require an understanding of treatment effect heterogeneity. In this article, we develop a nonparametric causal forest for estimating heterogeneous treatment effects that extends Breiman’s widely used random forest algorithm.

causal forest 考虑了 treatment 非线性情况。

In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

PSM 除了用线性回归,还可以使用 kNN 的方式,如 Agarwal et al. (2020)。

2.1 1.1 Introduction

Definition 1.1: (treatment effect heterogeneity)

为什么不好交叉项(Lo 2002)的做法不好,因为那是 homogeneous 的。

Historically, most datasets have been too small to meaningfully explore heterogeneity of treatment effects beyond dividing the sample into a few subgroups.

Classical approaches to nonparametric estimation of heterogeneous treatment effects include nearest-neighbor matching, kernel methods, and series estimation; see, for example, Crump et al. (2008), Lee (2009), and Willke et al. (2012).

Following Athey and Imbens (2016), our proposed forest is composed of causal trees that estimate the effect of the treatment at the leaves of the trees; we thus refer to our algorithm as a causal forest.

  • Athey and Imbens (2016) causal tree

as well as the adaptive nearest neighbors interpretation of random forests of Lin and Jeon (2006).

Finally, we compare the performance of the causal forest algorithm against classical k-nearest neighbor matching using simulations, finding that the causal forest dominates in terms of both bias and variance in a variety of settings, and that its advantage increases with the number of covariates.

作者主要对比的是 KNN。

2.2 1.2 Causal Forests

2.2.1 1.2.1 Treatment Estimation with Unconfoundedness

Suppose we have access to \(n\) independent and identically distributed training examples labeled \(i = 1,\ldots,n,\) each of which consists of a feature vector \(X_{i} \in \lbrack 0,1\rbrack^{d},\) a response \(Y_{i} \in \mathbb{R}\) and a treatment indicator \(W_{i} \in \{ 0,1\}.\) Following the potential outcomes framework of Neyman (1923) and Rubin (1974) (see Imbens and Rubin 2015 for a review), we then posit the existence of potential outcomes \(Y_{i}^{(1)}\) and \(Y_{i}^{(0)}\) corresponding respectively to the response the \(i\) th subject would have experienced with and without the treatment, and define the treatment effect at \(x\) as

\[\tau(x) = \mathbb{E}\left\lbrack Y_{i}^{(1)} - Y_{i}^{(0)} \mid X_{i} = x \right\rbrack\]

Our goal is to estimate this function \(\tau(x)\). The main difficulty is that we can only ever observe one of the two potential outcomes \(Y_{i}^{(0)},Y_{i}^{(1)}\) for a given training example, and so cannot directly train machine learning methods on differences of the form \(Y_{i}^{(1)} - Y_{i}^{(0)}\)

主要问题是每个样本只有\(Y_{i}^{(1)},Y_{i}^{(0)}\)其中一种可能,不是两种,因此不能直接建模相减。

Definition 1.2: (unconfoundedness) 随机实验的假设,\(Y_{i}^{(1)},Y_{i}^{(0)}\) 随机切分的。 本质上和拒绝推断不一样。拒绝推断时,拒绝样本Y为空,当然也可以设定为Y=0。

In general, we cannot estimate \(\tau(x)\) simply from the observed data \(\left( X_{i},Y_{i},W_{i} \right)\) without further restrictions on the data-generating distribution. A standard way to make progress is to assume unconfoundedness (Rosenbaum and Rubin 1983 ), that is, that the treatment assignment \(W_{i}\) is independent of the potential outcomes for \(Y_{i}\) conditional on \(X_{i}\)

\[\left\{ Y_{i}^{(0)},Y_{i}^{(1)} \right\}\bot W_{i} \mid X_{i}\]

Definition 1.3: (NN matching) 找到最相近的样本,找到样本同时在\(Y_{i}^{(1)},Y_{i}^{(0)}\)的状态。

The motivation behind this unconfoundedness is that, given continuity assumptions, it effectively implies that we can treat nearby observations in \(x\) -space as having come from a randomized experiment; thus, nearest-neighbor matching and other local methods will in general be consistent for \(\tau(x)\). An immediate consequence of unconfoundedness is that

\[\begin{matrix} \mathbb{E}\left\lbrack Y_{i}\left( \frac{W_{i}}{e(x)} - \frac{1 - W_{i}}{1 - e(x)} \right) \mid X_{i} = x \right\rbrack & = \tau(x),\text{\ where\ } \\ e(x) & = \mathbb{E}\left\lbrack W_{i} \mid X_{i} = x \right\rbrack \\ \end{matrix}\]

is the propensity of receiving treatment at \(x.\) Thus, if we knew \(e(x),\) we would have access to a simple unbiased estimator for \(\tau(x);\) this observation lies at the heart of methods based on propensity weighting (e.g., Hirano, Imbens, and Ridder 2003 ).

  • 学习 PS 的做法 Hirano, Imbens, and Ridder 2003

Many early applications of machine learning to causal inference effectively reduce to estimating \(e(x)\) using, for example, boosting, a neural network, or even random forests, and then transforming this into an estimate for \(\tau(x)\) using (3) (e.g., McCaffrey, Ridgeway, and Morral 2004; Westreich, Lessler, and Funk 2010 ). In this article, we take a more indirect approach: we show that, under regularity assumptions, causal forests can use the unconfoundedness assumption (2) to achieve consistency without needing to explicitly estimate the propensity \(e(x).^{2}\)

Definition 1.4: (作者的贡献) 不需要像 1.3 一样显性把 PS 构建出来。

果然和 PSM 类似。

2.2.2 1.2.2 From Regression Trees to Causal Trees and Forests

At a high level, trees and forests can be thought of as nearest neighbor methods with an adaptive neighborhood metric.

决策树模型本身就在 subgrouping 相似的样本,因此就是 NN 方法。因此节点上的样本就是相似的,可以直接得到 PS 状态 1.3,因此直接计算 uplift。 因此也可以说这个时候每个节点里面的 uplift 估计是无偏的。

而且树的发展也决定了搜索一个 similar subgroup 变量用的少,复杂度小。

符合作者说的不显性但是也可以达到的目的 1.4。

Given a test point \(x,\) classical methods such as \(k\) -nearest neighbors seek the \(k\) closest points to \(x\) according to some prespecified distance measure, for example, Euclidean distance. In contrast, tree-based methods also seek to find training examples that are close to \(x\), but now closeness is defined with respect to a decision tree, and the closest points to \(x\) are those that fall in the same leaf as it. The advantage of trees is that their leaves can be narrower along the directions where the signal is changing fast and wider along the other directions, potentially leading a to a substantial increase in power when the dimension of the feature space is even moderately large.

Definition 1.5: (传统决策树 inference 的方法)

In this section, we seek to build causal trees that resemble their regression analogues as closely as possible. Suppose first that we only observe independent samples \(\left( X_{i},Y_{i} \right)\), and want to build a CART regression tree. We start by recursively splitting the feature space until we have partitioned it into a set of leaves \(L\) each of which only contains a few training samples. Then, given a test point \(x\), we evaluate the prediction \(\widehat{\mu}(x)\) by identifying the leaf \(L(x)\) containing \(x\) and setting

\[\widehat{\mu}(x) = \frac{1}{\left| \left\{ i:X_{i} \in L(x) \right\} \right|}\sum_{\left\{ i:X_{i} \in L(x) \right\}}^{}Y_{i}\]

Heuristically, this strategy is well-motivated if we believe the leaf \(L(x)\) to be small enough that the responses \(Y_{i}\) inside the leaf are roughly identically distributed. There are several procedures for how to place the splits in the decision tree; see, for example, Hastie, Tibshirani, and Friedman (2009).

但是得到一个重要的推论,当决策树的节点样本足够小时,的确可以认为节点上的样本都相似了。

In follow-up work, Athey et al. (2019) adapted the causal forest algorithm, enabling it to make use of propensity score estimates \(\widehat{e}(x)\) for improved robustness.

  • Athey et al. (2019) 在 casual tree 里面加入了 PS,阅读,应该不相关。

In the context of causal trees, we analogously want to think of the leaves as small enough that the \(\left( Y_{i},W_{i} \right)\) pairs corresponding to the indices \(i\) for which \(i \in L(x)\) act as though they had come from a randomized experiment. Then, it is natural to estimate the treatment effect for any \(x \in L\) as

这里节点记录的估计值不是传统决策树里面的 y 均值了,而是 treatment y 的均值和 ctrl y 的均值的差值。

\[\begin{matrix} \widehat{\tau}(x) = & \frac{1}{\left| \left\{ i:W_{i} = 1,X_{i} \in L \right\} \right|}\sum_{\left\{ i:W_{i} = 1,X_{i} \in L \right\}}^{Y_{i}} \\ & - \frac{1}{\left| \left\{ i:W_{i} = 0,X_{i} \in L \right\} \right|}\sum_{\left\{ i:W_{i} = 0,X_{i} \in L \right\}}^{} \\ \end{matrix}(\# eq:treatment - effect)\]

In the following sections, we will establish that such trees can be used to grow causal forests that are consistent for \(\tau(x)\).

The causal tree algorithm presented above is a simplification of the method of Athey and Imbens (2016). The main difference between our approach and that of Athey and Imbens (2016) is that they seek to build a single well-tuned tree; to this end, they use fairly large leaves and apply a form propensity weighting based on (3) within each leaf to correct for variations in \(e(x)\) inside the leaf. In contrast, we follow Breiman (2001a) and build our causal forest using deep trees. Since our leaves are small, we do not need to apply any additional corrections inside them.

Athey and Imbens (2016) 是 casual tree 是为了训练好一棵树,因此不可避免的进行了正则化等操作。但是作者的 causal forest 是在 RF 基础上构建,本身具备了正则化,因此可以做得很深。深的好处是可以使得节点上的样本足够少,这样样本之间的差异足够小,就不需要 PS 了。

Finally, given a procedure for generating a single causal tree, a causal forest generates an ensemble of \(B\) such trees, each of which outputs an estimate \({\widehat{\tau}}_{b}(x).\) The forest then aggregates their predictions by averaging them: \(\widehat{\tau}(x) = B^{- 1}\sum_{b = 1}^{B}{\widehat{\tau}}_{b}(x)\) We always assume that the individual causal trees in the forest are built using random subsamples of \(s\) training examples, where \(s/n \ll 1;\) for our theoretical results, we will assume that \(s \asymp n^{\beta}\) for some \(\beta < 1.\) The advantage of a forest over a single tree is that it is not always clear what the “best” causal tree is. In this case, as shown by Breiman (2001a), it is often better to generate many different decent-looking trees and average their predictions, instead of seeking a single highly-optimized tree. In practice, this aggregation scheme helps reduce variance and smooths sharp decision boundaries (Bühlmann and Yu 2002 ).

不投票,而是均值估计。

2.2.3 1.2.3 Asymptotic Inference with Causal Forests

  • Lipschitz continuous 什么意思,这可能导致了相减出错嘛?

Our first result is that causal forests are consistent for the true treatment effect \(\tau(x)\). To achieve pointwise consistency, we need to assume that the conditional mean functions \(\mathbb{E}\left\lbrack Y^{(0)} \mid X = x \right\rbrack\) and \(\mathbb{E}\left\lbrack Y^{(1)} \mid X = x \right\rbrack\) are both Lipschitz continuous. To our knowledge, all existing results on pointwise consistency of regression forests (e.g., Biau 2012; Meinshausen 2006 ) require an analogous condition on \(\mathbb{E}\lbrack Y \mid X = x\rbrack.\) This is not particularly surprising, as forests generally have smooth response surfaces (Bühlmann and Yu 2002). In addition to continuity assumptions, we also need to assume that we have overlap, that is, for some \(\varepsilon > 0\) and all \(x \in \lbrack 0,1\rbrack^{d}\),

\[\varepsilon < \mathbb{P}\lbrack W = 1 \mid X = x\rbrack < 1 - \varepsilon\]

This condition effectively guarantees that, for large enough \(n\), there will be enough treatment and control units near any test point \(x\) for local methods to work.

Beyond consistency, to do statistical inference on the basis of the estimated treatment effects \(\widehat{\tau}(x),\) we need to understand their asymptotic sampling distribution. Using the potential nearest neighbors construction of Lin and Jeon (2006) and classical analysis tools going back to Hoeffding (1948) and Hájek (1968), we show that-provided the subsample size s scales appropriately with \(n -\) the predictions made by a causal forest are asymptotically Gaussian and unbiased. Specifically, we show that

\[(\widehat{\tau}(x) - \tau(x))/\sqrt{\text{Var}\lbrack\widehat{\tau}(x)\rbrack} \Rightarrow \mathcal{N}(0,1)\]

under the conditions required for consistency, provided the subsample size \(s\) scales as \(s \asymp n^{\beta}\) for some \(\beta_{\text{min}} < \beta < 1\)

  • [x] 作者证明了\(\widehat{\tau}(x)\)的估计是满足正态分布的,怎么理解呢?见 1.7。

To define the variance estimates, let \({\widehat{\tau}}_{b}^{*}(x)\) be the treatment effect estimate given by the \(b\) th tree, and let \(N_{ib}^{*} \in \{ 0,1\}\) indicate whether or not the \(i\) th training example was used for the \(b\) th tree. \(^{4}\) Then, we set

\[{\widehat{V}}_{IJ}(x) = \frac{n - 1}{n}\left( \frac{n}{n - s} \right)^{2}\sum_{i = 1}^{n}\text{Cov}_{*}\left\lbrack {\widehat{\tau}}_{b}^{*}(x),N_{ib}^{*} \right\rbrack^{2}\]

where the covariance is taken with respect to the set of all the trees \(b = 1,\ldots,B\) used in the forest. The term \(n(n -\) 1) \(/(n - s)^{2}\) is a finite-sample correction for forests grown by subsampling without replacement; see Proposition 5. We show that this variance estimate is consistent, in the sense that \({\widehat{V}}_{IJ}(x)/\text{Var}\lbrack\widehat{\tau}(x)\rbrack \rightarrow_{p}1\)

作者还估计了\(\widehat{\tau}(x)\)的方差,因此来计算\(\widehat{\tau}(x)\)的估计是满足正态分布的。

2.2.4 1.2.4 Honest Trees and Forests

Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example \(i\), it only uses the response \(Y_{i}\) to estimate the within-leaf treatment effect \(\tau\) using (1.1) or to decide where to place the splits, but not both. We discuss two causal forest algorithms that satisfy this condition.

  • 作者设计实验一个样本中的Y,要么参与 split,要么估计(1.1),不会两个都参与。这也是 uplift 普遍的场景,如果有,那么 uplift 就可以直接计算了。

Our first algorithm, which we call a double-sample tree, achieves honesty by dividing its training subsample into two halves \(\mathcal{I}\) and \(\mathcal{J}\). Then, it uses the \(\mathcal{J}\) -sample to place the splits, while holding out the \(\mathcal{I}\) -sample to do within-leaf estimation; see Procedure 1 for details. In our experiments, we set the minimum leaf size to \(k = 1\). A similar family of algorithms was discussed in detail by Denil, Matheson, and De Freitas \((2014),\) who showed that such forests could achieve competitive performance relative to standard tree algorithms that do not divide their training samples. In the semiparametric inference literature, related ideas go back at least to the work of Schick (1986) .

Figure 1.1: Procedure 1. DOUBLE-SAMPLE TREES

Figure 1.1: Procedure 1. DOUBLE-SAMPLE TREES

就是把 split 和 estimation 分开来。一半负责分节点,一半负责计算公式(1.1)。

We note that sample splitting procedures are sometimes criticized as inefficient because they “waste” half of the training data at each step of the estimation procedure. However, in our case, the forest subampling mechanism enables us to achieve honesty without wasting any data in this sense, because we rerandomize the \(\mathcal{I}/\mathcal{J}\) -data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both \(\mathcal{I}\) and \(\mathcal{J}\) samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.

实验设计很精妙。

Another way to build honest trees is to ignore the outcome data \(Y_{i}\) when placing splits, and instead first train a classification tree for the treatment assignments \(W_{i}\) (Procedure 2). Such propensity trees can be particularly useful in observational studies, where we want to minimize bias due to variation in \(e(x)\). Seeking estimators that match training examples based on estimated propensity is a longstanding idea in causal inference, going back to Rosenbaum and Rubin (1983).

  • Rosenbaum and Rubin (1983) PS 训练决策树的?不太可能那时候还没出来,就是找一个监督学习方法对 is treatment 回归而已,思路很清楚。

既然对 \(W_{i}\) 打分了,那么就是 PS 的做法,然后对每个分支里面的样本计算公式(1.1)。

  • [x] 是一个新的思路,我多想想。这个也是拿到相似样本的方法,而且借助了决策树,不是NN那么复杂度找 subgroup,参考 Lin and Jeon (2006) 。

Definition 1.6: (xgb 实现)

  • 因此倾向值匹配法的处理方式很方便,值得再阅读下。而且 xgb 训练完以后,走样本迁移,替换下 y 指标就好了,类似于重新训练一下。 用两个数据集,保持同样的树结构,计算节点的均值,相同节点,均值的差值就是 delta。

并且 subgroup 就代表了条件概率了,假设一致的情况下了。 考虑到以 max tau 作为切分依据。

Figure 1.2: Procedure 2. PROPENSITY TREES

Figure 1.2: Procedure 2. PROPENSITY TREES

  • remark 1 阅读下有什么不一样的。

Remark 1 . For completeness, we briefly outline the motivation for the splitting rule of Athey and Imbens (2016) we use for our double-sample trees. This method is motivated by an algorithm for minimizing the squared-error loss in regression trees. Because regression trees compute predictions \(\widehat{\mu}\) by averaging training responses over leaves, we can verify that

\[\sum_{i \in \mathcal{J}}^{}\left( \widehat{\mu}\left( X_{i} \right) - Y_{i} \right)^{2} = \sum_{i \in \mathcal{J}}^{}Y_{i}^{2} - \sum_{i \in \mathcal{J}}^{}\widehat{\mu}\left( X_{i} \right)^{2}\]

Thus, finding the squared-error minimizing split is equivalent to maximizing the variance of \(\widehat{\mu}\left( X_{i} \right)\) for \(i \in \mathcal{J};\) note that \(\sum_{i \in \mathcal{J}}^{}\widehat{\mu}\left( X_{i} \right) = \sum_{i \in \mathcal{J}}^{}Y_{i}\) for all trees, and so maximizing variance is equivalent to maximizing the sum of the \(\widehat{\mu}\left( X_{i} \right)^{2}.\) In Procedure \(1,\) we emulate this algorithm by picking splits that maximize the variance of \(\widehat{\tau}\left( X_{i} \right)\) for \(i \in \mathcal{J}\). \(^{6}\)

  • 注意这里是最大化(1.1)

While this article was in press, we became aware of work by Wang et al. (2015), who use what we call propensity forests for average treatment effect estimation.

Remark \(2.\) In Appendix \(B,\) we present evidence that adaptive forests with small leaves can overfit to outliers in ways that make them inconsistent near the edges of sample space. Thus, the forests of Breiman (2001a) need to be modified in some way to get pointwise consistency results; here, we use honesty following, for example, Wasserman and Roeder (2009). We note that there have been some recent theoretical investigations of non-honest forests, including Scornet, Biau, and Vert (2015) and Wager and Walther (2015). However, Scornet, Biau, and Vert (2015) do not consider pointwise properties of forests; whereas Wager and Walther (2015) showed consistency of adaptive forests with larger leaves, but their bias bounds decay slower than the sampling variance of the forests and so cannot be used to establish centered asymptotic normality.

Wager and Walther (2015) 认为节点样本多时估计可以一致(大数定理),但是有偏,因此要节点样本要足够。

  • [ ] 但是切分到最后也可能存在异常值放在一堆的情况,这是一个缺点,作者没有回复

2.3 1.3 Asymptotic Theory for Random Forests

W_i 二分类的模型可以关注 Wang et al. (2015) 。

To use random forests to provide formally valid statistical inference, we need an asymptotic normality theory for random forests. In the interest of generality, we first develop such a theory in the context of classical regression forests, as originally introduced by Breiman (2001a). In this section, we assume that we have training examples \(Z_{i} = \left( X_{i},Y_{i} \right)\) for \(i = 1,\ldots,n,\) a test point \(x\), and we want to estimate true conditional mean function

\[\mu(x) = \mathbb{E}\lbrack Y \mid X = x\rbrack\]

从线性方程的 partial effect 来考虑,RF 本身也具备了 conditional mean 的结果,当然坐着这里是要扩展出 conditional mean delta。

We also have access to a regression tree \(T\) which can be used to get estimates of the conditional mean function at \(x\) of the form \(T\left( x;\xi,Z_{1},\ldots,Z_{n} \right),\) where \(\xi \sim \Xi\) is a source of auxiliary randomness. Our goal is to use this tree-growing scheme to build a random forest that can be used for valid statistical inference about \(\mu(x)\).

看作者是如何论述RF满足近似正太分布的理论的。

Definition 1.7: (Monte Carlo averaging)

We begin by precisely describing how we aggregate individual trees into a forest. For us, a random forest is an average of trees trained over all possible size-s subsamples of the training data, marginalizing over the auxiliary noise \(\xi.\) In practice, we compute such a random forest by Monte Carlo averaging, and set

\[\text{RF}\left( x;Z_{1},\ldots,Z_{n} \right) \approx \frac{1}{B}\sum_{b = 1}^{B}T\left( x;\xi_{b}^{*},Z_{b1}^{*},\ldots,Z_{bs}^{*} \right)\]

where \(\left\{ Z_{b1}^{*},\ldots,Z_{bs}^{*} \right\}\) is drawn without replacement from \(\left\{ Z_{1},\ldots,Z_{n} \right\},\xi_{b}^{*}\) is a random draw from \(\Xi\), and \(B\) is the number of Monte Carlo replicates we can afford to perform. The formulation (12) arises as the \(B \rightarrow \infty\) limit of (11); thus, our theory effectively assumes that \(B\) is large enough for Monte Carlo effects not to matter. The effects of using a finite \(B\) are studied in detail by Mentch and Hooker ( 2016 ); see also Wager, Hastie, and Efron (2014), who recommend taking \(B\) on the order of \(n\).

所以 RF 在生成每个树的时候,然后用 ave 得到估计值,这个过程是蒙特卡洛的模拟思想,因此满足近似大数定理的,是无偏估计(和 boosting)不同。

  • 但是 B 的大小取决于 n(样本大小)的 order 看不懂。

Definition 1 . The random forest with base learner \(T\) and subsample size \(s\) is

\[\begin{matrix} \text{RF}\left( x;Z_{1},\ldots,Z_{n} \right) = \begin{pmatrix} n \\ s \\ \end{pmatrix}^{- 1}\sum_{1 \leq i_{1} < i_{2} < \ldots < i_{s} \leq n}^{}\mathbb{E}_{\xi \sim \Xi} \\ \times \left\lbrack T\left( x;\xi,Z_{i_{1}},\cdots,Z_{i_{s}} \right) \right\rbrack \\ \end{matrix}\]

Next, as described in Section 2 , we require that the trees \(T\) in our forest be honest. Double-sample trees, as defined in Procedure 1, can always be used to build honest trees with respect to the \(\mathcal{I}\) -sample. In the context of causal trees for observational studies, propensity trees (Procedure 2) provide a simple recipe for building honest trees without sample splitting.

两种切分方法中,第一种是用一半样本进行 split 一半样本计算 delta,第二种是用 is treatment 建立树,用样本计算 delta,split 和 delta 计算都没有混淆,是 honest 的。

  • 如果 split 和 delta 都是在一起的,那么就有可能 split 并非是 delta 最大化的情况。

Definition 2. A tree grown on a training sample \(\left( Z_{1} = \left( X_{1},Y_{1} \right),\ldots,Z_{s} = \left( X_{s},Y_{s} \right) \right)\) is honest if (a) \((\) stan- dard case) the tree does not use the responses \(Y_{1},\ldots,Y_{s}\) in choosing where to place its splits; or (b) (double sample case) the tree does not use the \(\mathcal{I}\) -sample responses for placing splits.

通俗说,就是没有利用到样本即 split 又估算 delta。

To guarantee consistency, we also need to enforce that the leaves of the trees become small in all dimensions of the feature space as \(n\) gets large. \(^{7}\) Here, we follow Meinshausen ( \(\left. \ 2006 \right)\), and achieve this effect by enforcing some randomness in the way trees choose the variables they split on: at each step, each variable is selected with probability at least \(\pi/d\) for some \(0 < \pi \leq 1\) (e.g., we could satisfy this condition by completely randomizing the splitting variable with probability \(\pi)\). Formally, the randomness in how to pick the splitting features is contained in the auxiliary random variable \(\xi\).

  • d 是什么?特征变量数。

延续 RF 选变量的逻辑,按照\(\pi/d\)(因为有 d 个变量,因此平分概率)或者\(\pi\)对每个变量进行抽样选择。

Definition 3. A tree is a random-split tree if at every step of the tree-growing procedure, marginalizing over \(\xi,\) the probability that the next split occurs along the \(j\) th feature is bounded below by \(\pi/d\) for some \(0 < \pi \leq 1,\) for all \(j = 1,\ldots,d\).

The remaining definitions are more technical. We use regularity to control the shape of the tree leaves, while symmetry is used to apply classical tools in establishing asymptotic normality.

  • [x] 这里主要提到变量要随机选择,但是我没有感觉为近似正太分布理论服务了什么?是的,这里只是说每个分支后变量被选中的概率小于\(\pi/d\)
  • Definition 4 -> Remark 3 (binary classification) 都没来得及看。

2.3.1 1.3.1 Theoretical Background

There has been considerable work in understanding the theoretical properties of random forests.

The convergence and consistency properties of trees and random forests have been studied by, among others, Biau (2012), Biau, Devroye, and Lugosi (2008), Breiman (2004), Breiman et al. (1984), Meinshausen (2006), Scornet, Biau, and Vert (2015), Wager and Walther (2015), and Zhu, Zeng, and Kosorok (2015). Meanwhile, their sampling variability has been analyzed by Duan (2011), Lin and Jeon (2006), Mentch and Hooker (2016), Sexton and Laake (2009), and Wager, Hastie, and Efron (2014).

However, to our knowledge, our Theorem 3.1 is the first result establishing conditions under which predictions made by random forests are asymptotically unbiased and normal.

  • 回头研究下 Theorem 3.1,不是很看得懂。
  • 也没看完

2.3.2 1.3.2 Bias and Honesty

  • 也没看完

2.3.3 1.3.3 Asymptotic Normality of Random Forests

  • 也没看完

2.3.3.1 1.3.3.1 Regression Trees and Incremental Predictors

Definition 1.8: (tree 和 KNN 没有本质的区别)

Analyzing specific greedy tree models such as CART trees can be challenging. We thus follow the lead of Lin and Jeon (2006), and analyze a more general class of predictors—potential nearest neighbors predictors—that operate by doing a nearest-neighbor search over rectangles; see also Biau and Devroye (2010). The study of potential (or layered) nearest neighbors goes back at least to Barndorff-Nielsen and Sobel (1966).

我觉得思想很不错,tree 和 KNN 没有本质的区别,只是用纵横切分出NN。

  • Definition 7 和 Theorem 3.3 没有看懂,但是大概说的也是 1.8。

2.3.3.2 1.3.3.2 Subsampling Incremental Base Learners

  • 这个没有仔细看。

2.4 1.4 Inferring Heterogeneous Treatment Effects

2.5 1.5 Simulation Experiments

Since causal forests are adaptive nearest neighbor estimators, it is natural to use a nonadaptive nearest neighborhood method as our baseline. We compare our method to the standard \(k\) nearest neighbors ( \(k\) -NN) matching procedure, which estimates the treatment effect as

\[{\widehat{\tau}}_{KNN}(x) = \frac{1}{k}\sum_{i \in \mathcal{S}_{1}(x)}^{}Y_{i} - \frac{1}{k}\sum_{i \in \mathcal{S}_{0}(x)}^{}Y_{i}\]

  • \(Y_{T}^{NN} - Y_{C}^{NN}\)岂不是更加直接?但是效果不好,见 1.3。

2.5.1 1.5.1 Experimental Setup

2.5.2 1.5.2 Results

所以KNN的效果也需要在提高K的基础上才会变好。

所以KNN的效果也需要在提高K的基础上才会变好。

Figure 2 illustrates this phenomenon: although the causal forest faithfully captures the qualitative aspects of the true \(\tau\) -surface, it does not exactly match its shape, especially in the upper-right corner where \(\tau\) is largest. Our theoretical results guarantee that this effect will go away as \(n \rightarrow \infty\). Figure 2 also helps us understand why \(k\) -NN performs so poorly in terms of mean-squared error: its predictive surface is both badly biased and noticeably “grainy,” especially for \(d = 20.\) It suffers from bias not only at the boundary where the treatment effect is the largest, but also where the slope of the treatment effect is high in the interior.

Figure 1.3: 查看图像,明显CF比kNN在边界上更加拟合 True Effect。

Figure 1.3: 查看图像,明显CF比kNN在边界上更加拟合 True Effect。

2.6 1.6 Discussion

Definition 1.9: (ANN)

Our causal forest estimator can be thought of as an adaptive nearest neighbor method, where the data determine which dimensions are most important to consider in selecting nearest neighbors. Such adaptivity seems essential for modern large-scale applications with many features.

In general, the challenge in using adaptive methods as the basis for valid statistical inference is that selection bias can be difficult to quantify; see Berk et al. (2013), Chernozhukov, Hansen, and Spindler (2015), Taylor and Tibshirani (2015), and references therein for recent advances. In this article, pairing “honest” trees with the subsampling mechanism of random forests enabled us to accomplish this goal in a simple yet principled way.

  • adaptive 方法很容易 bias,而且无法证明,参考 Berk et al. (2013), Chernozhukov, Hansen, and Spindler (2015), Taylor and Tibshirani (2015)

所以作者的 procedure 1 & 2 是很厉害的。

看完了。

3 2 Wang et al. (2015)

类似于 Wager and Athey (2018) 使用 is treatment 回归。

类似于 Wager and Athey (2018) 使用 is treatment 回归。

Agarwal, Sumit, Wenlan Qian, Amit Seru, and Jian Zhang. 2020. “Disguised Corruption: Evidence from Consumer Credit in China.” Journal of Financial Economics.

Athey, Susan, Julie Tibshirani, Stefan Wager, and others. 2019. “Generalized Random Forests.” Annals of Statistics 47 (2): 1148–78.

Guelman, Leo, Montserrat Guillén, and Ana M Pérez-Marı́n. 2015. “Uplift Random Forests.” Cybernetics and Systems 46 (3-4): 230–48.

Lin, Yi, and Yongho Jeon. 2006. “Random Forests and Adaptive Nearest Neighbors.” Journal of the American Statistical Association 101 (474): 578–90.

Lo, Victor SY. 2002. “The True Lift Model: A Novel Data Mining Approach to Response Modeling in Database Marketing.” ACM SIGKDD Explorations Newsletter 4 (2): 78–86.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42.

Wang, Pengyuan, Wei Sun, Dawei Yin, Jian Yang, and Yi Chang. 2015. “Robust Tree-Based Causal Inference for Complex Ad Effectiveness Analysis.” In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 67–76.