Day 02 · 手写教程

相关性矩阵

Pearson 不等于相关性的全部

返回方法库 · 共用案例 · 变量百科

相关性矩阵是什么?

相关性矩阵是在回归前看变量之间的两两关系。它告诉你两个变量是不是一起升降,但它不控制其他变量,也不回答因果关系。本案例里,patent_countdfi_index 的 Pearson 相关系数是 0.4395,只能说明二者在样本中同向变化。

相关矩阵还有另一个用途:提前观察控制变量之间有没有非常高的相关。如果两个控制变量高度重合,后面的 VIF 可能会提醒你存在多重共线性。

这一步怎么做?

把因变量、核心解释变量和控制变量一起放进矩阵。不要只算 patent_countdfi_index,那样看不到控制变量之间的关系。Stata 里常用 pwcorr,可以加 sig 输出显著性,再用 star(0.05) 标出 5% 水平显著的相关。

结果应该怎么读?

第一眼看核心变量:Patent 与 DFI 的相关系数为 0.4395,方向与“数字金融发展和企业创新同向变化”的直觉一致。第二眼看控制变量之间:如果有相关系数接近 0.7 或更高,就要在下一步 VIF 里重点检查。第三眼看 Pearson 和 Spearman 是否差很多;差很多通常说明变量右偏、极端值或非线性关系明显。

论文里怎么写?

可以写:相关性分析显示,企业创新与数字普惠金融指数呈正相关关系,相关系数为 0.4395。该结果为后续回归提供了初步描述性证据,但尚不能解释为因果关系。控制变量之间未出现异常高相关,初步看不存在严重线性重合。

常见错误

最常见的错误是把“相关显著”写成“影响显著”。相关矩阵没有固定效应,没有控制变量,也没有处理内生性,所以它只能放在数据描述部分,不能当成研究假设的正式证据。

本页案例代码和输出

下面这部分是本教程对应的实际案例材料,方便你把前面的解释和真实输出对上。

Stata 代码

log using "/root/workspace/empirical-wizard/workspace/5b186599/analysis.log", replace text
global JOB_DIR "/root/workspace/empirical-wizard/workspace/5b186599"
set more off
adopath + "/root/ado/plus"
global DATA_PATH "/root/workspace/empirical-wizard/workspace/test_e2e/csmar_innovation.csv"
import delimited "/root/workspace/empirical-wizard/workspace/test_e2e/csmar_innovation.csv", clear case(preserve)
capture confirm global JOB_DIR
if _rc global JOB_DIR "."
quietly duplicates drop

local statvars "patent_count dfi_index roa lev size growth cashflow tobinq top1 dual"
local timevar "year"
local idvar "stkcd"
local dvvar "patent_count"
local ivvar "dfi_index"

tempfile ewiz_desc
tempname ewiz_post
postfile `ewiz_post' str64 variable double count mean sd min p25 p50 p75 max using `ewiz_desc', replace
foreach v of local statvars {
    capture confirm numeric variable `v'
    if !_rc {
        quietly summarize `v', detail
        post `ewiz_post' ("`v'") (r(N)) (r(mean)) (r(sd)) (r(min)) (r(p25)) (r(p50)) (r(p75)) (r(max))
    }
}
postclose `ewiz_post'

preserve
use `ewiz_desc', clear
rename count N
export delimited using "$JOB_DIR/描述性统计.csv", replace
restore

quietly pwcorr `statvars', sig
matrix EWIZ_C = r(C)
matrix EWIZ_P = r(sig)
local nvars : word count `statvars'
preserve
clear
set obs `nvars'
gen str64 variable = ""
forvalues i = 1/`nvars' {
    local vname : word `i' of `statvars'
    replace variable = "`vname'" in `i'
}
forvalues j = 1/`nvars' {
    gen c`j' = .
}
forvalues i = 1/`nvars' {
    forvalues j = 1/`nvars' {
        replace c`j' = EWIZ_C[`i', `j'] in `i'
    }
}
forvalues j = 1/`nvars' {
    local vname : word `j' of `statvars'
    rename c`j' `vname'
}
export delimited using "$JOB_DIR/相关系数矩阵.csv", replace
restore

* 同时导出 p 值矩阵(上三角 p、下三角系数约定:这里直接把 p 矩阵单独导出
* 供 reporter 读取;渲染时组合成『下三角系数+星号 / 上三角 p』学术格式)
preserve
clear
set obs `nvars'
gen str64 variable = ""
forvalues i = 1/`nvars' {
    local vname : word `i' of `statvars'
    replace variable = "`vname'" in `i'
}
forvalues j = 1/`nvars' {
    gen p`j' = .
}
forvalues i = 1/`nvars' {
    forvalues j = 1/`nvars' {
        capture replace p`j' = EWIZ_P[`i', `j'] in `i'
    }
}
forvalues j = 1/`nvars' {
    local vname : word `j' of `statvars'
    rename p`j' `vname'
}
export delimited using "$JOB_DIR/相关系数矩阵_pvalue.csv", replace
restore

foreach v in `dvvar' `ivvar' {
    if "`v'" != "" {
        capture confirm numeric variable `v'
        if !_rc {
            quietly inspect `v'
            local _nunique = r(N_unique)
            quietly summarize `v', meanonly
            local _vmin = r(min)
            local _vmax = r(max)
            // Treat as dummy ONLY when there are exactly 1-2 unique values AND they are 0/1.
            if `_nunique' <= 2 & `_vmin' >= 0 & `_vmax' <= 1 & `_vmax' - `_vmin' <= 1 & `_vmax' == int(`_vmax') & `_vmin' == int(`_vmin') {
                graph bar (mean) `v', title("`v' Distribution") ytitle("Share") blabel(bar, format(%9.3f))
            }
            else {
                histogram `v', normal title("`v' Distribution") xtitle("`v'") ytitle("Frequency")
            }
            capture graph export "$JOB_DIR/distribution_`v'.png", replace width(1600)
        }
    }
}

if "`timevar'" != "" & "`dvvar'" != "" {
    capture confirm variable `timevar'
    if !_rc {
        preserve
        keep `timevar' patent_count dfi_index
        collapse (mean) patent_count dfi_index, by(`timevar')
        sort `timevar'
        if "`ivvar'" != "" {
            twoway (line `dvvar' `timevar', lcolor(navy) lwidth(medthick) msymbol(o)) (line `ivvar' `timevar', yaxis(2) lcolor(maroon) lpattern(dash) msymbol(diamond)), title("Core Variable Trends") xtitle("`timevar'") ytitle("`dvvar' mean", axis(1)) ytitle("`ivvar' mean", axis(2)) legend(order(1 "`dvvar'" 2 "`ivvar'"))
        }
        else {
            twoway line `dvvar' `timevar', lcolor(navy) lwidth(medthick) msymbol(o) title("Outcome Trend") xtitle("`timevar'") ytitle("`dvvar' mean")
        }
        capture graph export "$JOB_DIR/descriptive_trend.png", replace width(1800)
        restore
    }
}

if "`idvar'" != "" & "`timevar'" != "" {
    capture confirm variable `idvar'
    capture confirm variable `timevar'
    if !_rc {
        preserve
        collapse (count) obs_count=`timevar' (min) start_period=`timevar' (max) end_period=`timevar', by(`idvar')
        export delimited using "$JOB_DIR/panel_structure.csv", replace
        restore
    }
}

di "描述性统计与相关分析输出完成"
log close

输出表

variablepatent_countdfi_indexroalevsizegrowthcashflowtobinqtop1dual
patent_count1.43950948.24073994.013024604.17620607-.027265593.033593308.019011619-.032499224.0060554333
dfi_index.439509481.010009783.0087096244.03340555-.021365453.012453728-.0013975492-.0043962095-.0011652668
roa.24073994.0100097831.026359176-.039585255.044941634.02670249.015529391-.0061969026.025583275
lev.013024604.0087096244.0263591761.070986755-.027704539.020903539.041491505.0043391068.00078530196
size.17620607.03340555-.039585255.0709867551.0048936065-.062741034-.030523267-.033124272.010508359
growth-.027265593-.021365453.044941634-.027704539.00489360651-.012840015.042992894-.01781022.0097842375
cashflow.033593308.012453728.02670249.020903539-.062741034-.0128400151-.016293354.013908384.056830361
tobinq.019011619-.0013975492.015529391.041491505-.030523267.042992894-.0162933541-.0075492994-.0093645761
top1-.032499224-.0043962095-.0061969026.0043391068-.033124272-.01781022.013908384-.00754929941.01563069
dual.0060554333-.0011652668.025583275.00078530196.010508359.0097842375.056830361-.0093645761.015630691

案例图

相关分析 的共用案例输出图。
相关分析 的共用案例输出图。

写作检查

本文在共用企业面板样本上报告相关分析,核心输出见 相关系数矩阵.csv。结果解释时同时关注样本口径、变量构造、系数方向、标准误和适用前提,避免只凭单个 p 值完成方法选择。

  • 确认本页使用的因变量、核心解释变量、控制变量与论文主模型一致。
  • 先看表格里的样本口径,再看系数、p 值或诊断指标。
  • 代码里的输出文件名要能对应网页展示的结果表。

返回方法库 · 打开 empirical-wizard