Probability and Statistics - 線性迴歸診斷：異質變異與自相關 Linear Regression Diagnostics : Heteroscedasticity and Autocorrelation

一、線性迴歸模型診斷 (Linear Regression Model Diagnostics)

線性迴歸基於一些前提假設而得，前提假設可簡化數學公式，但相對地也承擔風險。線性迴歸模型診斷即提供迴歸模型是否過度簡化及前提假設是否需修正之檢測。

高斯－馬爾可夫定理的條件是(複習可以看這邊)：

隨機項的(條件)期望值為零：${\rm E}\left(\varepsilon_i\right)=0,$
隨機項的變異數皆相同 (homoscedasticity)：${\rm var}\left(\varepsilon_i\right)=\sigma^2<\infty,$
隨機項無自我相關 (no autocorrelation) ：${\rm cov}\left(\varepsilon_i,\varepsilon_j\right)=0，i\not=j$

[注意] 關於期望值為零這個條件

The mean of the residuals will always be zero provided that there is a constant term in the regression. Without a constant term,

$R^2$ (ESS/TSS) could be negative.
biased slope coefficient estimate.

因此下面只以異質變異和自我相關分章節做討論。

二、異質變異誤差項檢定 (Detection of Heteroscedasticity)

檢定是否存在異質變異的方法有下列三種 :

White 檢定 (最常使用) : 假設誤差項的變異數和自變數或自變數的二次式組合有關。
Breusch-Pagan(BP)/Godfrey 檢定 (最具一般性) : 假設誤差項的變異數可能與其他變數相關。
Goldfeld-Quandt 檢定 (較麻煩,且僅適用於橫斷面資料) : 假設誤差項的變異數和某一個變數有關。

1. Breusch-Pagan (BP) /Godfrey 異質性檢定流程

利用 OLS 估計迴歸式 $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k + u $ 並求得 OLS 殘差項的平方 $\hat u^2$。
估計迴歸式 $\hat u^2 = \delta_0 + \delta_1x_1 + \delta_2x_2 + ... + \delta_kx_k + error$，並求出 $R_{\hat u^2}^2$
求出 F 或 LM 統計量及對應的 p 值 (前者用 $F_{k,n-k-1}$ 分配，後者用 $\chi_k^2$ 分配)。若 p 值夠小，亦即，其小於所選定的顯著水準，則拒絕同質性的虛無假設。

F 統計量可以寫成：$F = \frac{R_{\hat u^2}^2 /k}{ (1-R_{\hat u^2}^2) / (n - k - 1)}$
LM 統計量可以寫成：$LM = n \times R_{\hat u^2}^2$

2. White 異質性檢定流程

用 OLS 估計模型$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k + u $。求得 OLS 殘差和配適值。計算殘差平方 $\hat u^2$ 和配適值平方 $\hat y^2$
做 $\hat u^2 = \delta_0 + \delta_1\hat y$ 的迴歸。求出迴歸的 R 平方值，$R_{\hat u^2}^2$
求出 F 或 LM 統計量，並計算 p 值 (前者用 $F_{k,n-k-1}$ 分配，後者用 $\chi_k^2$ 分配)。

[用心去感覺] 三個獨立變數的white檢定估計式

太多獨立變數是純粹 White 檢定的一個弱點：它在有限個獨立變數的模型中用了太多自由度。
可透過 $\hat u^2 = \delta_0 + \delta_1\hat y$ 估計來檢定異質性，因為展開即是下面的式子。

$\hat u^2 = \delta_0 + \delta_1x_1 + \delta_2x_2 + \delta_3x_3 + \delta_4x_1^2 + \delta_5x_2^2 + \delta_6x_3^2 + \delta_7x_1x_2 + \delta_8x_1x_3 + \delta_9x_2x_3 + error $

3. Goldfeld–Quandt (GQ) 異質性檢定流程

Split the total sample of length T into two sub-samples of length T1 and T2.
The null hypothesis : $H_0 : \sigma_1^2 = \sigma_2^2$
GQ test statistic : GQ = \frac{s_1^2}{s_2^2}

[用心去感覺] GQ 檢定的缺點

the choice of where to split the sample is that usually arbitrary and may crucially affect the outcome of the test.

二、自相關檢定 (Detection of Autocorrelation)

1. Durbin-Watson 自相關檢定

僅可檢定誤差項是否存在一階自我相關；但若迴歸模型的解釋數含有應變數的落後項 (如 $Y_t−1$ )，則無法使用。建議：DW 統計量只當參考，正式檢定還是透過底下兩個方法。

If $e_t$ is the residual associated with the observation at time $t$, then the test statistic is

$d = {\sum_{t=2}^T (e_t - e_{t-1})^2 \over {\sum_{t=1}^T e_t^2}}$, where T is the number of observations.

Since $d$ is approximately equal to $2(1 − r)$, where r is the sample autocorrelation of the residuals, $d = 2$ indicates no autocorrelation.

DW has 2 critical values, an upper critical value ($d_u$) and a lower critical value ($d_L$), and there is also an intermediate region where we can neither reject nor not reject $H_0$.

2. Breush-Godfrey LM 檢定

Breush-Godfrey LM 檢定(Serial Correlation LM Test；序列相關 LM 檢定)：運用殘差來檢定是否具有落後 p 期內的自我相關

Consider a linear regression of any form, for example

$Y_t = \alpha_0+ \alpha_1 X_{t,1} + \alpha_2 X_{t,2} + u_t \,$

where the residuals might follow an $AR(p)$ autoregressive scheme, as follows:

$u_t = \rho_1 u_{t-1} + \rho_2 u_{t-2} + \cdots + \rho_p u_{t-p} + \varepsilon_t. \, $

The simple regression model is first fitted by ordinary least squares to obtain a set of sample residuals $\hat{u}_t$.

Breusch and Godfrey proved that, if the following auxiliary regression model is fitted

$\hat{u}_t = \alpha_0 + \alpha_1 X_{t,1} + \alpha_2 X_{t,2} + \rho_1 \hat{u}_{t-1} + \rho_2 \hat{u}_{t-2} + \cdots + \rho_p \hat{u}_{t-p} + \varepsilon_t \,$

and if the usual $R^2$ statistic is calculated for this model, then the following asymptotic approximation can be used for the distribution of the test statistic

$n R^2\,\sim\,\chi^2_p, \,$

when the null hypothesis ${H_0: \lbrace \rho_i = 0 \text{ for all } i \rbrace }$ holds (that is, there is no serial correlation of any order up to p). Here n is the number of data-points available for the second regression, that for $\hat{u}_t$,

$n=T-p, \, $

where $T$ is the number of observations in the basic series. Note that the value of n depends on the number of lags of the error term (p).

References

wiki - Goldfeld–Quandt test
https://en.wikipedia.org/wiki/Goldfeld%E2%80%93Quandt_test

Pages

2015年11月10日星期二