spatialAnalysis

空间滞后一期的特征变量提取

李家翔 2019-01-26

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

How to create a spatial effect variable and then lag it?

In matrix notation, a spatial lag is just X*y where big X is your spatial weights matrix, and little y is your variable of interest. Stack Overflow

注意 X 是一个很大的矩阵,因此当样本量很大时,这种矩阵需要在 Impala 这种批量处理数据的环境进行。

Assuming you have spherical coordinates (i.e. lat & lon)

  1. see the spDists function in the sp package.
  2. If you have projected coordinates you can just use dist().

If you have made your spatial weights matrix that has n*n dimensions then yes. Other notes - you may want to normalize by the row sums of the matrix - so the resultant variable will simply be the inverse distance weight. Also if you have repeated measures over time you probably want your spatial weights matrix to account for that (e.g. only calculate spatial lags within the same year).

Let’s produce a minimal exmaple. We assume three locations and correspond latitude and longtitude.

参考在Stack Overflow写的一个答案

  1. Build a weight matrix.
  2. Double check the diag is all zero.
  3. Normalize …
  4. One feature (X_1)
  5. The variance of w %*% x is smaller than x.
library(data.table)
sp <- data.table(
  location = letters[1:3]
  ,lat = c(20,40,60)
  ,lon = c(50,70,90)
)
sp
##    location lat lon
## 1:        a  20  50
## 2:        b  40  70
## 3:        c  60  90
library(tidyverse)
w_raw <- 
  sp %>% 
  dist(.$lat,.$lon,method = "euclidean", upper=TRUE) %>% 
  as.matrix()
w_raw
##          1        2        3
## 1  0.00000 34.64102 69.28203
## 2 34.64102  0.00000 34.64102
## 3 69.28203 34.64102  0.00000
w <- w_raw * (1/rowSums(w_raw))
x <- c(5,10,5)
w %*% x
##       [,1]
## 1 6.666667
## 2 5.000000
## 3 6.666667
library(broom)
lm(x ~ (w %*% x)) %>% 
    tidy
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    25.0   3.02e-14   8.29e14 7.68e-16
## 2 w %*% x        -3.00  4.90e-15  -6.13e14 1.04e-15

尝试检验 spatial lag term 是显著的,也可以通过 Moran I 检验

w
##           1         2         3
## 1 0.0000000 0.3333333 0.6666667
## 2 0.5000000 0.0000000 0.5000000
## 3 0.6666667 0.3333333 0.0000000
x
## [1]  5 10  5
w %*% x
##       [,1]
## 1 6.666667
## 2 5.000000
## 3 6.666667

分步解答

c(w[1,] %*% x,w[2,] %*% x,w[3,] %*% x)
## [1] 6.666667 5.000000 6.666667

矩阵相乘参考Blog

分析

w[1,] %*% x == 
w[1,1] * x[1] + 
w[1,2] * x[2] + 
w[1,3] * x[3]
##      [,1]
## [1,] TRUE
  1. w[1,i]为样本1的和各个其他样本的距离,且进行了标准化。
  2. x为变量的值。

因此这个计算方式可以简化为

x_data <- data_frame(
    id = 1:3
    ,x = x
)
w_long <- 
w %>% 
    as.data.frame() %>% 
    rownames_to_column('id_left') %>% 
    gather(id_right,weight,-id_left)
w_long %>% 
    left_join(
        x_data %>% 
            mutate(id = as.character(id))
        ,by = c('id_right' = 'id')
    ) %>% 
    group_by(id_left) %>% 
    summarise(
        x_multipleby_w = sum(weight * x)
    )
## # A tibble: 3 x 2
##   id_left x_multipleby_w
##   <chr>            <dbl>
## 1 1                 6.67
## 2 2                 5.00
## 3 3                 6.67

因此按照这个逻辑,翻译成 SQL,可以在 Impala 和 Hive 中进行 spatial lag term 的构造。