第 41 章 分析策略

41.1 明确分析目的

作为统计学家,着手分析数据之前,千万记得,必须要制定一个尽可能详尽的分析计划。即使你的分析,可能并不一定受到第三方的监管或者调控,因为同行评审的专家们,喜欢看到你分析的目的明确,假设检验的过程是经过仔细推敲的。同时,也可以避免陷入玩弄数据。

数据分析的目的,可以分成三大类:

  1. 估计一个或者几个暴露变量,对结果变量的影响。以此目的的数据分析过程,需要我们有医学侦探一样的眼光和见解,从数据中判断那些需要被调整和控制的混杂因子,从而提高你的分析效率。最常见的例子是分析随机对照临床实验 (RCT)中,疗效的差异;或者流行病学研究中,分析某种生活习惯,和疾病的发生或者死亡之间的关系。

  2. 在现有的数据库中,寻找并且建立 “最佳”模型。以此目的的数据分析,需要我们对模型中的结果变量有极为深入的了解,把与之相关的所有要因,尽可能多的纳入你的分析模型。常见的例子如,在某个特定人群的数据库中寻找并确定能够决定自杀这一结果变量的决定性因素,之所以有这样的目的,背后可能有决策者希望寻找这些决定性因素后采取一些对策从而达到改善现状的最终目的。所以找到和结果变量相关的因素,是此类研究的重中之重。

  3. 建立预测模型。例如,某项研究的目的是为了能够建立一个能够预测孕期胎儿患有唐氏综合症的预测模型,用能够测量的一些指标(如血液指标,或者母亲的一些健康指标),通过模型的算法,去计算胎儿患病的概率是多大。这样的模型,对与诊断医学有重大意义。所以,此类研究的目的,不是为了寻找确定和胎儿患病相关的全部要因,而是怎样才能提高模型预测的准确度,提高诊断的效率,减少错误诊断,拯救生命。

当然,上述目的中的2 和3 有时候易让人混淆,因为我们可能建立最佳模型,除了想要找到和“自杀”这一结果相关的所有要因,还可能希望通过该模型做出预测,寻找可能自杀的高危人群,进行干预。这并不矛盾。

41.2 分析目的 1.1 – 估计 RCT 中治疗效果 (treatment effect)

先拣最软的柿子捏,RCT 的疗效比较作为数据分析的目的时,情况要比其他的目的相对简单些。 RCT 的随机过程,确保了临床试验不会受到混杂因素的影响。但是我们还会出于为了提高统计分析效能改善估计的精确度的目的,对参与临床试验的受试者最初测量的一些特征进行调整。当然,不是所有的数据专家,也不是所有的 RCT 实施者都同意进行这一调整的。如果确定要调整,放入模型中的变量,可能常常是一开始随机分配时用到的那些用于将受试者分层归类或者最小化 (minimisation) 的那些变量。

基线值调整 (baseline adjustment),在结果变量为连续型,同时模型是线性回归模型时,能够显著提高统计效能 (statistical efficiency),降低估计值的标准误。理论上,一个基线测量时的连续型变量,如果它和实验后测量的连续型结果变量之间的皮尔森相关系数Pearson correlation coefficient\(r\),那么如果你用ANCOVA 模型调整了这个基线值的话,疗效差异估计值的标准误会是没有调整时的\(\sqrt{1-r^2}\) 倍(也就是永远比不调整时要小,大大提高精确度,缩小疗效差异估计值的95% 置信区间)。

但是,但是,但是!如果一个RCT 测量的结果变量是一个二进制变量(死亡/存活),线性回归模型不适用,只能使用逻辑回归时,模型中加入和结果变量相关(和暴露变量无关) 的基线值的做法对分析效能的提高显得十分有限,相反还会受到逻辑回归的不可压缩性较大的影响(Section 39.3.2)。

再把之前讲逻辑回归不可压缩性时用过的例子拿过来这里解释这个现象:

Non-collapsibility of logit link in GLM (stratified data)
Strata 1 </ th> Strata 2 </ th>
Outcome Drug Placebo Drug Placebo
Success 90 50 50 10
Failure 10 50 50 90
Total 100 100 100 100
Odds Ratios 9 9

上面的数据表示,分层变量 (Strata 1-2) 本身和使用药物和安慰剂无交互作用,也和药物使用与临床试验结果之间的关系无关。但是,即使这个分类变量无关,压缩后的数据计算获得的比值比和分层时的比值比差异巨大:

Non-collapsibility of logit link in GLM (collapsed data)
Outcome Drug Placebo
Success 140 60
Failure 60 140
Total 200 200
Odds ratio 5.4

实际在 R 里拟合逻辑回归模型的结果如下:


Call:
glm(formula = Result ~ Treatment, family = binomial(link = logit), 
    data = RCT)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)        0.8473     0.1543   5.491 3.99e-08 ***
TreatmentPlacebo  -1.6946     0.2182  -7.766 8.12e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 554.52  on 399  degrees of freedom
Residual deviance: 488.69  on 398  degrees of freedom
AIC: 492.69

Number of Fisher Scoring iterations: 4

Call:
glm(formula = Result ~ Treatment + Strata, family = binomial(link = logit), 
    data = RCT)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.1972     0.2651   8.290  < 2e-16 ***
TreatmentPlacebo  -2.1972     0.2749  -7.994 1.31e-15 ***
Strata2           -2.1972     0.2749  -7.994 1.31e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 554.52  on 399  degrees of freedom
Residual deviance: 407.29  on 397  degrees of freedom
AIC: 413.29

Number of Fisher Scoring iterations: 4

从结果的回归系数估计和计算的标准误来看,调整了其他的变量会引起:

  1. 使对数比值比的估计量升高 (这是由于模型的不可压缩性) \(1.69 \rightarrow 2.19\)

  2. 对数比值比的标准误估计升高 (非但不能增加估计精确度,反而起到了反作用) \(0.22\rightarrow0.27\)

  3. 对数比值比的统计检验量升高 (由于对数比值比的升高比标准误升高的更多一些) \(7.77\rightarrow7.99\)

事实上,上面的现象在使用逻辑回归的时候基本上都会呈现。在经典论文 (L. D. Robinson and Jewell 1991) 中给出了详细的论证。所以其实使用逻辑回归拟合数据的RCT 临床试验,我们可以推论,当模型中加入第三个仅和结果变量有关的基线共变量 (baseline covariates),如果模型估计的对数比值比在调整前后变化不大(即,不可压缩性造成的影响很小),那这样的调整对于改善分析的统计效能上几乎也没有贡献。 (跟使用线性回归的 RCT 完全不同!)

由于逻辑回归受使用\(\text{logit}\) 链接方程时不可压缩性的局限,同时还由于使用\(\text{log}\) 链接方程时获得的危险度比(risk ratios) 比比值比(odds ratios ) 更加容易让人理解,结果变量为二进制的RCT 临床试验常常会选用\(\text{log}\) 链接方程的广义线性回归模型(见Section ?? 第5 条讨论) 。选用\(\text{log}\) 链接方程的GLM 最大的问题在于,当模型中加入过多的预测变量时,会导致模型**无法收敛(converge)–无法找到极大似然估计* *。

至于使用泊松回归模型的时候,预测变量如果放入不合理,那么很容易违反泊松分布的前提 (方差和均值相同)。对于违反了泊松分布前提,模型变得过度离散(over-dispersed) 的GLM,加入适当的基线共变量(baseline covariates) 则有助于减少模型的过度离散,减小参数估计的标准误(使之变得更精确些)。和线性回归相同的是,泊松回归模型不受不可压缩性 (non-collapsibility) 的影响。

41.2.1 RCT 数据分析的一些不成熟的小建议

  1. RCT 临床试验通常都有严格的数据管理和监控,且统计分析计划 (statistical analysis plan, SAP) 在任何一个 RCT 都已经是必须条件。除此之外,还要在试验进行前就制订所有详细的计划,并写成实验实施计划文件,以供参与的所有人及伦理审查委员会等各种第三方机构的监督。所以,RCT的统计分析计划必须尽量考虑到所有的可能情况,因为一旦开始了试验,分析计划是很难改动的。

  2. SAP 必须详细记录哪些共变量需要被调整,常见的是实验设计阶段用于实施随机化过程的那些特征变量。对于连续型结果变量,(还有过度离散的计数型变量),基线共变量的调整许多时候会有助于改善参数估计的精确度,提高统计效能。对于使用逻辑回归模型的试验,调整基线共变量则没有太多的好处,且调整后的比值比的含义会发生较大的改变,需慎重。

  3. 有些统计学家支持调整基线共变量,认为这样做有助于减少万一随机化不彻底造成的治疗组和对照组之间随机产生的残差偏倚(residual bias),但是你无法提前欲知那些变量可能会产生随机的残差偏倚,这样便无法在事先需要准备的SAP计划文件中明确到底哪些基线变量需要被调整。

  4. 另有许多研究者喜欢在RCT 中寻找交互作用的存在,但是他们常常忽略掉的一点是,一个RCT本身的检验效能是80%-90%,其用于检验交互作用的效能会更低。建议在 RCT 中尽量少 (甚至不建议) 进行任何交互作用的统计检验。

41.3 分析目的 1.2 – 估计流行病学研究中暴露变量和结果变量的关系 (exposure effect)

前文讨论的关于调整仅仅和结果变量相关 (与暴露变量无关) 的基线共变量的内容,同样适用与一般的流行病学研究。流行病学研究中另一个 (应该是更加) 重要的点是,混杂因子的排查和调整。

实例:

  • \(Y\) 标记结果变量,如婴儿的出生体重;
  • \(X_1\) 标记最主要的 (想要分析其与结果变量之间的关系的) 预测变量,如母亲孕期高血压 (是/否);
  • \(X_2, X_3, \cdots, X_Q\) 标记其他非主要预测变量,但是可能是\(X_1, Y\) 之间关系中重要的潜在混杂因子,如婴儿的性别/母亲孕前体重/婴儿胎龄等等。

在这个简单流行病学研究实例中,我们关心的问题包括:

  1. 主要暴露变量–孕期高血压,和结果变量–婴儿出生体重二者的未调整前 (粗) 关系 (crude/before adjustment association) 是什么样的?

  2. 主要暴露变量和结果变量之间的关系是否被其他因素影响 (例如胎龄)?如果有,那么调整后的关系会发生怎样的变化?

  3. 有没有其他的变量会改变 (modify)主要暴露变量和结果变量之间的关系?也就是,有没有那个变量和主要暴露变量有交互作用?

  4. 有没有其他的变量和主要暴露变量无关,却可能和结果变量有关系呢?如果存在这样的变量,模型中调整它在一些情况下可能会改善拟合的结果提高模型的统计效能 (statistical power)。

  5. 收集的变量中,有没有哪个变量可能是在主要暴露变量和结果变量之间因果关系 (如果存在因果关系的话) 的通路上 (on the causal pathway) 的呢?如果有,这样的变量应该被认为是媒介因子 (mediator)。

41.3.1 不成熟的小策略

这是很常见的简单流行病学数据分析。可以按照 (但不一定非要按照) 下面建议的步骤实施统计分析:

  1. 第一步,分析主要暴露变量和结果变量之间的未调整前 (粗) 关系:
    \[g\{ E(Y|X_1) \} = \alpha + \beta_1 X_1\]
  2. 第二步,逐个分析其余的变量和主要暴露变量之间的关系,以及这些潜在的混杂因子和结果变量之间的关系。注意,这一步可能耗时较长,但是它并不是决定模型中是否要加入某个或某些非主要暴露变量的步骤,通过这一步过程有助于我们分析和理解,进一步分析中调整前后的参数估计变化
  3. 第三步,建立主要暴露变量和这些潜在混杂因子同时放入模型中的GLM,逐步放入,一次放入一个(one at a time) 潜在混杂因子,和上一步分析的三者之间的关系相结合,分析调整该潜在混杂因子前后,主要暴露变量的回归系数的参数估计变化的原因。
    \[g\{ E(Y|X_1, X_k) \} = \alpha^* + \beta_1^*X_1 + \beta_kX_k,\; k= 1,\cdots,Q\]

我们来分析这个可以从 Stata 网站上下载的数据

  • 第一步,先看看暴露变量和结果变量之间的关系
lbw <- haven::read_dta("../Datas/lbw.dta")
lbw$race <- factor(lbw$race)
lbw$smoke <- factor(lbw$smoke)
lbw$ht <- factor(lbw$ht)
a <- Epi::stat.table(list("Birthweight <2500g" = low, "History of hypertension"=ht), list(count(),percent(low)), data = lbw, margins = TRUE)
# We first tabulate the data
print(a, digits = c(percent = 2))
 -------------------------------------- 
              -History of hypertension- 
 Birthweight         0       1   Total  
 <2500g                                 
 -------------------------------------- 
 0                 125       5     130  
                 70.62   41.67   68.78  
                                        
 1                  52       7      59  
                 29.38   58.33   31.22  
                                        
                                        
 Total             177      12     189  
                100.00  100.00  100.00  
 -------------------------------------- 
  • 第二步,分析母亲高血压病史和婴儿低出生体重之间的调整前 (粗) 关系。
Model0 <- glm(low~ht, data = lbw, family = binomial(link = "logit"))
summary(Model0); epiDisplay::logistic.display(Model0)

Call:
glm(formula = low ~ ht, family = binomial(link = "logit"), data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.8771     0.1650  -5.315 1.07e-07 ***
ht1           1.2135     0.6083   1.995   0.0461 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 230.65  on 187  degrees of freedom
AIC: 234.65

Number of Fisher Scoring iterations: 4

Logistic regression predicting low 
 
           OR(95%CI)          P(Wald's test) P(LR-test)
ht: 1 vs 0 3.37 (1.02,11.09)  0.046          0.045     
                                                       
Log-likelihood = -115.3249
No. of observations = 189
AIC value = 234.6499

所以,数据提供了一些证据证明母亲的高血压病史和婴儿低出生体重之间可能存在正关系,这个调整前的关系是,粗比值比 (crude odds ratio) 为 3.37 (1.02, 11.09)。

  • 接下来,分析潜在的混杂因子是否和主要暴露变量相关:
# lwt is the last weight of mothers before pregnancy
Model1 <- lm(lwt ~ ht, data = lbw)
summary(Model1); epiDisplay::regress.display(Model1)

Call:
lm(formula = lwt ~ ht, data = lbw)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.500 -17.944  -7.944  10.056 122.056 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  127.944      2.239  57.143  < 2e-16 ***
ht1           29.556      8.886   3.326  0.00106 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.79 on 187 degrees of freedom
Multiple R-squared:  0.05586,   Adjusted R-squared:  0.05081 
F-statistic: 11.06 on 1 and 187 DF,  p-value: 0.00106
Linear regression predicting lwt
 
           Coeff.(95%CI)        P(t-test) P(F-test)
ht: 1 vs 0 29.56 (12.03,47.09)  0.001     0.001    
                                                   
No. of observations = 189

可见,有高血压病史的母亲,孕前体重较高。再看其与结果变量是否有关系:

Model2 <- glm(low ~ lwt, data = lbw, family = binomial(link = "logit"))
summary(Model2); epiDisplay::logistic.display(Model2)

Call:
glm(formula = low ~ lwt, family = binomial(link = "logit"), data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.995763   0.785243   1.268   0.2048  
lwt         -0.014037   0.006169  -2.276   0.0229 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 228.71  on 187  degrees of freedom
AIC: 232.71

Number of Fisher Scoring iterations: 4

Logistic regression predicting low 
 
                 OR(95%CI)      P(Wald's test) P(LR-test)
lwt (cont. var.) 0.99 (0.97,1)  0.023          0.015     
                                                         
Log-likelihood = -114.354
No. of observations = 189
AIC value = 232.7081

由此知,母亲孕前体重较高的人,有较低的可能生下低出生体重的婴儿。这两个单独的关系,各自看都具有5% 的统计学意义,但是这(或者其他变量分析的结果没有统计学意义时) 并不是决定模型中是否加入母亲孕前体重这一潜在的混杂因子的理由。接下来,我们通过模型中加入母亲孕前体重这一变量前后模型的参数估计变化来分析:

Model3 <- glm(low ~ ht + lwt, data = lbw, family = binomial(link = "logit"))
summary(Model3);epiDisplay::logistic.display(Model3)

Call:
glm(formula = low ~ ht + lwt, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.447862   0.820898   1.764  0.07777 . 
ht1          1.854477   0.700824   2.646  0.00814 **
lwt         -0.018629   0.006593  -2.826  0.00472 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 221.17  on 186  degrees of freedom
AIC: 227.17

Number of Fisher Scoring iterations: 4

Logistic regression predicting low 
 
                 crude OR(95%CI)    adj. OR(95%CI)     P(Wald's test)
ht: 1 vs 0       3.37 (1.02,11.09)  6.39 (1.62,25.23)  0.008         
                                                                     
lwt (cont. var.) 0.99 (0.97,1)      0.98 (0.97,0.99)   0.005         
                                                                     
                 P(LR-test)
ht: 1 vs 0       0.006     
                           
lwt (cont. var.) 0.002     
                           
Log-likelihood = -110.5827
No. of observations = 189
AIC value = 227.1654

加入了孕前体重的模型给出的母亲是否有高血压病史对婴儿的低出生体重关系的比值比估计为\(6.39\),这很明显比调整孕前体重前的粗比值比\((3.37)\) 大了很多。这个比值比估计的变化有两个原因:

  1. (常被忽略的) 逻辑回归模型的不可压缩性导致的;
  2. 母亲孕前体重对高血压病史和婴儿的低出生体重之间的关系造成了混杂效应。

上面的分析结果告诉我们,该数据提供了足够的证据证明母亲孕前体重和是否有高血压病史,在调整了彼此之后,仍然独立地和婴儿低出生体重的发生有相关性。这里,我们可以下结论认为,模型中加入母亲孕前体重作为混杂因子,是合情合理的。

完成了目前为止的初步分析和混杂因子的判断以后,下一阶段的分析侧重于寻找有没有任何第三方的预测变量,会对主要暴露变量 \(X_1\) (孕期高血压) 与结果变量 \(Y\) (婴儿出生体重过低) 之间的关系产生交互作用。如果数据中的预测变量有多个,那可能导致需要分析潜在的交互作用有许多对,通常建议在遇到多个预测变量之间的复杂关系需要讨论的时候,建议不要一股脑全部作交互作用的分析,而是限定一个或者几个最有可能有交互作用的变量就可以了。否则模型过于复杂,反而不利于理解。一般生物医学的统计分析中考虑的重要交互作用分析,需要有重要的生物学意义,常见的例子是年龄,性别等。

本节使用的例子中,令人感兴趣的是,母亲的孕前体重,会不会对妊娠高血压的有无与婴儿出生体重过低之间的关系造成交互作用:

Model4 <- glm(low ~ ht*lwt, family = binomial(link = "logit"), data = lbw)
summary(Model4); epiDisplay::logistic.display(Model4)

Call:
glm(formula = low ~ ht * lwt, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.539312   0.917476   1.678  0.09339 . 
ht1          1.286989   2.549353   0.505  0.61368   
lwt         -0.019380   0.007413  -2.614  0.00894 **
ht1:lwt      0.003732   0.016173   0.231  0.81749   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 221.11  on 185  degrees of freedom
AIC: 229.11

Number of Fisher Scoring iterations: 4

Logistic regression predicting low 
 
                 crude OR(95%CI)    adj. OR(95%CI)          P(Wald's test)
ht: 1 vs 0       3.37 (1.02,11.09)  3.62 (0.02,535.73)      0.614         
                                                                          
lwt (cont. var.) 0.99 (0.97,1)      0.98 (0.97,1)           0.009         
                                                                          
ht1:lwt          -                  1.0037 (0.9724,1.0361)  0.817         
                                                                          
                 P(LR-test)
ht: 1 vs 0       0.605     
                           
lwt (cont. var.) 1         
                           
ht1:lwt          0.819     
                           
Log-likelihood = -110.5567
No. of observations = 189
AIC value = 229.1133

由于交互作用项结果为 ht1:lwt 0.003732 0.016173 0.231 0.81749,无足够的证据证明孕前体重会对妊娠高血压和婴儿出生体重过低之间的关系造成交互作用。

如果确认没有交互作用,建立本例最终模型前的几个建议:

  1. 最终分析 \(X_1, Y\) 之间关系的模型,需要加入我们逐一甄别之后确认过的混淆因子,此时称为模型 1
  2. 对于确认不是\(X_1, Y\) 之间关系的混淆因子的那些剩余变量,逐一加入模型1,比较前后是否模型中各个混淆因子的参数估计是否发生了变化(有没有混淆因子的混淆因子?);
  3. 最终模型中的变量,需要包含前两步确认过的全部混淆因子;
  4. 在报告中把调整前后的参数估计整理成表格。

如果在分析过程中发现了有重要意义的交互作用,那么除了包含全部的混淆因子之外,你的最终模型中还需加入重要的交互作用项。此时需要报告的参数估计是有交互作用项部分的分层比值比/其他指标。

41.3.2 补充

除了使用二项分布的逻辑回归之外,当结果变量是连续型或者计数型,也就是分析模型使用线性回归(ANCOVA),或者(可能过度离散的) 泊松回归时,为了提高模型的统计效能,减小参数估计的标准误,模型可以选择进一步调整一个或几个只和结果变量有关的基线变量。此时,在你写论文或者报告时,必须把这些变量和确认是混杂因子的变量加以区分,因为加它们进入模型的目的不同。

41.4 分析目的 2 和 3 – 建立预测模型 (predictive models)

建立预测模型的过程,其实就是选择哪个或者那些变量进入模型的过程。方法有很多,可惜的是,没有哪种是公认完美的。这里只介绍两种最常见,也最常被批评的方法 – 前/后 逐步选择法 (forward stepwise selection/backward elimination)。强调一下,逐步法本身并不是神奇法术,不同的算法选择的变量自然会有不同,如果你用了逐步选择法,选出来的模型变量仅仅只能作为参考,而不能作为最终结论。

vitc <- haven::read_dta("../Datas/vitC.dta")
vitc$ctakers <- factor(vitc$ctakers)
vitc$sex <- factor(vitc$sex)

stats::step(lm(seruvitc~1,data=vitc[complete.cases(vitc),]),direction="forward",scope=~age+height+weight+sex+cigs+ctakers)
Start:  AIC=575.43
seruvitc ~ 1

          Df Sum of Sq   RSS    AIC
+ ctakers  1    6967.4 42660 563.66
+ sex      1    3688.5 45938 570.40
+ cigs     1    2470.9 47156 572.78
+ height   1    1243.7 48383 575.12
<none>                 49627 575.43
+ age      1     273.6 49353 576.93
+ weight   1       1.5 49625 577.43

Step:  AIC=563.66
seruvitc ~ ctakers

         Df Sum of Sq   RSS    AIC
+ sex     1   2713.53 39946 559.68
+ cigs    1   2150.58 40509 560.96
+ height  1   1049.00 41611 563.40
<none>                42660 563.66
+ age     1    247.94 42412 565.13
+ weight  1     18.23 42641 565.62

Step:  AIC=559.68
seruvitc ~ ctakers + sex

         Df Sum of Sq   RSS    AIC
+ cigs    1   1527.70 38418 558.13
<none>                39946 559.68
+ weight  1    445.30 39501 560.66
+ age     1    183.28 39763 561.26
+ height  1    131.72 39814 561.38

Step:  AIC=558.13
seruvitc ~ ctakers + sex + cigs

         Df Sum of Sq   RSS    AIC
<none>                38418 558.13
+ weight  1   257.152 38161 559.52
+ age     1   205.289 38213 559.65
+ height  1    58.275 38360 560.00

Call:
lm(formula = seruvitc ~ ctakers + sex + cigs, data = vitc[complete.cases(vitc), 
    ])

Coefficients:
(Intercept)     ctakers1         sex1         cigs  
     46.028       19.832        9.764      -12.255  
stats::step(lm(seruvitc~.,data=vitc[complete.cases(vitc),]),direction="backward")
Start:  AIC=563.38
seruvitc ~ serial + age + height + cigs + weight + sex + ctakers

          Df Sum of Sq   RSS    AIC
- height   1       1.5 37273 561.38
- age      1      64.2 37335 561.53
- weight   1     174.6 37446 561.80
- serial   1     808.4 38079 563.33
<none>                 37271 563.38
- cigs     1     979.3 38250 563.74
- sex      1    1123.5 38395 564.08
- ctakers  1    6407.2 43678 575.81

Step:  AIC=561.38
seruvitc ~ serial + age + cigs + weight + sex + ctakers

          Df Sum of Sq   RSS    AIC
- age      1      65.3 37338 559.54
- weight   1     217.5 37490 559.91
- serial   1     807.5 38080 561.33
<none>                 37273 561.38
- cigs     1     978.0 38251 561.74
- sex      1    2584.4 39857 565.48
- ctakers  1    6442.4 43715 573.89

Step:  AIC=559.54
seruvitc ~ serial + cigs + weight + sex + ctakers

          Df Sum of Sq   RSS    AIC
- weight   1     366.6 37704 558.43
- serial   1     823.4 38161 559.52
<none>                 37338 559.54
- cigs     1     944.2 38282 559.81
- sex      1    2816.3 40154 564.16
- ctakers  1    6462.2 43800 572.06

Step:  AIC=558.43
seruvitc ~ serial + cigs + sex + ctakers

          Df Sum of Sq   RSS    AIC
- serial   1     713.9 38418 558.13
<none>                 37704 558.43
- cigs     1    1156.1 38861 559.18
- sex      1    2451.9 40156 562.16
- ctakers  1    6385.1 44090 570.66

Step:  AIC=558.13
seruvitc ~ cigs + sex + ctakers

          Df Sum of Sq   RSS    AIC
<none>                 38418 558.13
- cigs     1    1527.7 39946 559.68
- sex      1    2090.6 40509 560.96
- ctakers  1    5841.0 44259 569.01

Call:
lm(formula = seruvitc ~ cigs + sex + ctakers, data = vitc[complete.cases(vitc), 
    ])

Coefficients:
(Intercept)         cigs         sex1     ctakers1  
     46.028      -12.255        9.764       19.832  

41.5 GLM例9

在本次练习中,我们将尝试使用多个不同的分析策略分析前文中使用过的低体重数据。

41.5.1 Part 1

第一部分,假设我们的主要研究目的是想要知道,在尽可能多的考虑所有潜在的混杂因子之后,患有孕期高血压的孕妇是否和低出生体重婴儿的诞生有关。

41.5.1.1 在你熟悉的统计分析软件中载入该数据,重复前文中做过的分析,即调查母亲的体重是否对主要分析的高血压和低出生体重婴儿之间关系造成混杂。在模型中加入他们的交互作用项试以分析。

lbw <- haven::read_dta("../backupfiles/lbw.dta")

a <- Epi::stat.table(list("Birthweight <2500g" = low, 
                          "History of hypertension"=ht), 
                     list(count(),percent(low)), data = lbw, margins = TRUE)
# We first tabulate the data
print(a, digits = c(percent = 2))
 -------------------------------------- 
              -History of hypertension- 
 Birthweight         0       1   Total  
 <2500g                                 
 -------------------------------------- 
 0                 125       5     130  
                 70.62   41.67   68.78  
                                        
 1                  52       7      59  
                 29.38   58.33   31.22  
                                        
                                        
 Total             177      12     189  
                100.00  100.00  100.00  
 -------------------------------------- 
# get the un-adjusted odds ratios for the association
# between maternal hypertension and low birth weight

M0 <- glm(low ~ ht, data = lbw, family = binomial(link = "logit"))
summary(M0); jtools::summ(M0, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht, family = binomial(link = "logit"), data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.8771     0.1650  -5.315 1.07e-07 ***
ht            1.2135     0.6083   1.995   0.0461 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 230.65  on 187  degrees of freedom
AIC: 234.65

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(1) 4.022134
Pseudo-R² (Cragg-Uhler) 0.029611
Pseudo-R² (McFadden) 0.017139
AIC 234.649863
BIC 241.133357
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.416000 0.301044 0.574852 -5.315016 0.000000
ht 3.365385 1.021427 11.088221 1.994814 0.046063
Standard errors: MLE
# get the association between maternal weight and hypertension

M.linear <- lm(lwt ~ ht, data = lbw)
summary(M.linear); jtools::summ(M.linear, confint = TRUE, digits = 6)

Call:
lm(formula = lwt ~ ht, data = lbw)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.500 -17.944  -7.944  10.056 122.056 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  127.944      2.239  57.143  < 2e-16 ***
ht            29.556      8.886   3.326  0.00106 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.79 on 187 degrees of freedom
Multiple R-squared:  0.05586,   Adjusted R-squared:  0.05081 
F-statistic: 11.06 on 1 and 187 DF,  p-value: 0.00106
Observations 189
Dependent variable lwt
Type OLS linear regression
F(1,187) 11.063918
0.055860
Adj. R² 0.050811
Est. 2.5% 97.5% t val. p
(Intercept) 127.943503 123.526516 132.360489 57.142604 0.000000
ht 29.556497 12.027125 47.085869 3.326247 0.001060
Standard errors: OLS
# get the association between maternal weight and low birth weight

M1 <- glm(low ~ lwt, data = lbw, family = binomial(link = "logit"))
summary(M1); jtools::summ(M1, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ lwt, family = binomial(link = "logit"), data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.995763   0.785243   1.268   0.2048  
lwt         -0.014037   0.006169  -2.276   0.0229 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 228.71  on 187  degrees of freedom
AIC: 232.71

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(1) 5.963938
Pseudo-R² (Cragg-Uhler) 0.043683
Pseudo-R² (McFadden) 0.025414
AIC 232.708058
BIC 239.191552
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.706790 0.580837 12.614062 1.268095 0.204764
lwt 0.986061 0.974211 0.998055 -2.275604 0.022870
Standard errors: MLE
# get the association between maternal hypertension and 
# low birth weight adjusted for maternal weight

M2 <- glm(low ~ lwt + ht, data = lbw, family = binomial(link = "logit"))
summary(M2); jtools::summ(M2, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ lwt + ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.447862   0.820898   1.764  0.07777 . 
lwt         -0.018629   0.006593  -2.826  0.00472 **
ht           1.854477   0.700824   2.646  0.00814 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 221.17  on 186  degrees of freedom
AIC: 227.17

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(2) 13.506547
Pseudo-R² (Cragg-Uhler) 0.096991
Pseudo-R² (McFadden) 0.057555
AIC 227.165449
BIC 236.890690
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 4.254010 0.851235 21.259240 1.763755 0.077773
lwt 0.981544 0.968942 0.994309 -2.825577 0.004720
ht 6.388358 1.617508 25.230866 2.646137 0.008142
Standard errors: MLE
lmtest::lrtest(M2, M0)
Likelihood ratio test

Model 1: low ~ lwt + ht
Model 2: low ~ ht
  #Df  LogLik Df  Chisq Pr(>Chisq)   
1   3 -110.58                        
2   2 -115.33 -1 9.4844   0.002072 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## we see that odds ratio for the association between
    ## maternal hypertension and low birth weight after 
    ## adjustment for mother's weight. Clear evidence of 
    ## confounding. 

# fit the model with interaction term

M3 <- glm(low ~ lwt*ht, data = lbw, family = binomial(link = "logit"))
summary(M3); jtools::summ(M3, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ lwt * ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.539312   0.917476   1.678  0.09339 . 
lwt         -0.019380   0.007413  -2.614  0.00894 **
ht           1.286989   2.549353   0.505  0.61368   
lwt:ht       0.003732   0.016173   0.231  0.81749   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 221.11  on 185  degrees of freedom
AIC: 229.11

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(3) 13.558683
Pseudo-R² (Cragg-Uhler) 0.097352
Pseudo-R² (McFadden) 0.057777
AIC 229.113313
BIC 242.080301
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 4.661380 0.771894 28.149527 1.677768 0.093392
lwt 0.980807 0.966660 0.995161 -2.614315 0.008941
ht 3.621866 0.024486 535.729382 0.504830 0.613678
lwt:ht 1.003739 0.972420 1.036067 0.230774 0.817490
Standard errors: MLE
    ## No evidence of an interaction. 

41.5.1.2 把前一步中拟合过的模型的母亲体重变量替换为母亲的年龄。你有怎样的结论?

# get the association between age and hypertension

M.linear <- lm(age ~ ht, data = lbw)
summary(M.linear); jtools::summ(M.linear, confint = TRUE, digits = 6)

Call:
lm(formula = age ~ ht, data = lbw)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2599 -4.2599 -0.2599  2.7401 21.7401 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  23.2599     0.3993  58.254   <2e-16 ***
ht           -0.3432     1.5846  -0.217    0.829    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.312 on 187 degrees of freedom
Multiple R-squared:  0.0002508, Adjusted R-squared:  -0.005095 
F-statistic: 0.04691 on 1 and 187 DF,  p-value: 0.8288
Observations 189
Dependent variable age
Type OLS linear regression
F(1,187) 0.046913
0.000251
Adj. R² -0.005095
Est. 2.5% 97.5% t val. p
(Intercept) 23.259887 22.472202 24.047572 58.253639 0.000000
ht -0.343220 -3.469247 2.782806 -0.216595 0.828760
Standard errors: OLS
M1.1 <- glm(low ~ age, data = lbw, family = binomial(link = "logit"))
summary(M1.1); jtools::summ(M1.1, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ age, family = binomial(link = "logit"), data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.38458    0.73212   0.525    0.599
age         -0.05115    0.03151  -1.623    0.105

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 231.91  on 187  degrees of freedom
AIC: 235.91

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(1) 2.760038
Pseudo-R² (Cragg-Uhler) 0.020387
Pseudo-R² (McFadden) 0.011761
AIC 235.911958
BIC 242.395452
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.469000 0.349813 6.168898 0.525296 0.599378
age 0.950133 0.893223 1.010669 -1.623194 0.104548
Standard errors: MLE
M1.2 <- glm(low ~ age + ht, data = lbw, family = binomial(link = "logit"))
summary(M1.2); jtools::summ(M1.2, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ age + ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.30498    0.74199   0.411   0.6811  
age         -0.05150    0.03194  -1.613   0.1068  
ht           1.21479    0.61213   1.985   0.0472 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 227.93  on 186  degrees of freedom
AIC: 233.93

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(2) 6.745090
Pseudo-R² (Cragg-Uhler) 0.049303
Pseudo-R² (McFadden) 0.028743
AIC 233.926906
BIC 243.652147
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.356594 0.316862 5.808050 0.411028 0.681052
age 0.949800 0.892168 1.011156 -1.612604 0.106831
ht 3.369591 1.015158 11.184607 1.984541 0.047196
Standard errors: MLE
lmtest::lrtest(M0, M1.2)
Likelihood ratio test

Model 1: low ~ ht
Model 2: low ~ age + ht
  #Df  LogLik Df Chisq Pr(>Chisq)  
1   2 -115.33                      
2   3 -113.96  1 2.723    0.09891 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M1.3 <- glm(low ~ age*ht, data = lbw, family = binomial(link = "logit"))
summary(M1.3); jtools::summ(M1.3, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ age * ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.56022    0.77047   0.727   0.4672  
age         -0.06281    0.03342  -1.880   0.0602 .
ht          -4.26977    4.19512  -1.018   0.3088  
age:ht       0.24237    0.18715   1.295   0.1953  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 225.66  on 185  degrees of freedom
AIC: 233.66

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(3) 9.010600
Pseudo-R² (Cragg-Uhler) 0.065472
Pseudo-R² (McFadden) 0.038397
AIC 233.661396
BIC 246.628384
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.751064 0.386795 7.927273 0.727123 0.467150
age 0.939122 0.879583 1.002690 -1.879560 0.060168
ht 0.013985 0.000004 52.065896 -1.017795 0.308776
age:ht 1.274263 0.882989 1.838919 1.295037 0.195308
Standard errors: MLE

并无证据提示年龄是孕期高血压和低出生体重之间的关系的混杂因子 (confounder),也没有证据表明年龄对该关系有交互作用。

41.5.1.3 分析吸烟史是否对孕期高血压和低出生体重之间的关系有混杂作用 (confounding effect)。注意这里吸烟史是一个二进制变量。

M2.1 <- glm(ht ~ smoke, data = lbw, family = binomial(link = "logit"))
summary(M2.1); jtools::summ(M2.1, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = ht ~ smoke, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.7362     0.3900  -7.016 2.29e-12 ***
smoke         0.1116     0.6055   0.184    0.854    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 89.386  on 188  degrees of freedom
Residual deviance: 89.352  on 187  degrees of freedom
AIC: 93.352

Number of Fisher Scoring iterations: 5
Observations 189
Dependent variable ht
Type Generalized linear model
Family binomial
Link logit
χ²(1) 0.033747
Pseudo-R² (Cragg-Uhler) 0.000474
Pseudo-R² (McFadden) 0.000378
AIC 93.351859
BIC 99.835353
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.064815 0.030178 0.139206 -7.015634 0.000000
smoke 1.118012 0.341241 3.662956 0.184238 0.853827
Standard errors: MLE
    ## here both the exposure of interest and the potential confounder
    ## are confounder are binary, so we need to use logistic regression
    ## (not linear regression) to explore the relationship between them.
    ## In the logistic regression model either of the two variables 
    ## could be the dependent variable.

M2.2 <- glm(low ~ smoke, data = lbw, family = binomial(link = "logit"))
summary(M2.2); jtools::summ(M2.2, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ smoke, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0871     0.2147  -5.062 4.14e-07 ***
smoke         0.7041     0.3196   2.203   0.0276 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.80  on 187  degrees of freedom
AIC: 233.8

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(1) 4.867397
Pseudo-R² (Cragg-Uhler) 0.035754
Pseudo-R² (McFadden) 0.020741
AIC 233.804600
BIC 240.288094
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.337209 0.221370 0.513667 -5.062322 0.000000
smoke 2.021944 1.080660 3.783111 2.202647 0.027620
Standard errors: MLE
M2.3 <- glm(low ~ smoke + ht, data = lbw, family = binomial(link = "logit"))
summary(M2.3); jtools::summ(M2.3, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ smoke + ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1787     0.2232  -5.281 1.28e-07 ***
smoke         0.7119     0.3237   2.199   0.0278 *  
ht            1.2300     0.6177   1.991   0.0464 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 225.79  on 186  degrees of freedom
AIC: 231.79

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(2) 8.880061
Pseudo-R² (Cragg-Uhler) 0.064545
Pseudo-R² (McFadden) 0.037840
AIC 231.791935
BIC 241.517177
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.307668 0.198656 0.476499 -5.281299 0.000000
smoke 2.037802 1.080606 3.842877 2.199486 0.027843
ht 3.421389 1.019665 11.480150 1.991493 0.046427
Standard errors: MLE
lmtest::lrtest(M2.3, M0)
Likelihood ratio test

Model 1: low ~ smoke + ht
Model 2: low ~ ht
  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   3 -112.90                       
2   2 -115.33 -1 4.8579    0.02752 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M2.4 <- glm(low ~ smoke*ht, data = lbw, family = binomial(link = "logit"))
summary(M2.4); jtools::summ(M2.4, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ smoke * ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.2000     0.2281  -5.260 1.44e-07 ***
smoke         0.7581     0.3360   2.256    0.024 *  
ht            1.4876     0.7971   1.866    0.062 .  
smoke:ht     -0.6403     1.2368  -0.518    0.605    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 225.53  on 185  degrees of freedom
AIC: 233.53

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(3) 9.144894
Pseudo-R² (Cragg-Uhler) 0.066424
Pseudo-R² (McFadden) 0.038969
AIC 233.527102
BIC 246.494090
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.301205 0.192605 0.471038 -5.259761 0.000000
smoke 2.134286 1.104716 4.123392 2.256359 0.024048
ht 4.426667 0.928062 21.114292 1.866305 0.061999
smoke:ht 0.527108 0.046685 5.951510 -0.517766 0.604622
Standard errors: MLE

模型结果和似然比检验的结果提示孕妇的吸烟史与低出生体重有关。但是调整了吸烟史并不太改变孕期高血压和低出生体重之间的比值比 OR。这主要是因为吸烟史和孕期高血压之间没有太多关系。所以,孕妇的吸烟史似乎不太需要被认定为混杂因子,也没有明显的交互作用。

41.5.1.4 分析人种 (race) 是否有混杂效应。注意这里人种变量的分类多于两个。

M3.1 <- glm(ht ~ as.factor(race), data = lbw, 
            family = binomial(link = "logit"))
summary(M3.1); jtools::summ(M3.1, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = ht ~ as.factor(race), family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -2.9014     0.4593  -6.317 2.67e-10 ***
as.factor(race)2   0.8645     0.7667   1.128    0.259    
as.factor(race)3   0.1446     0.6905   0.209    0.834    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 89.386  on 188  degrees of freedom
Residual deviance: 88.184  on 186  degrees of freedom
AIC: 94.184

Number of Fisher Scoring iterations: 5
Observations 189
Dependent variable ht
Type Generalized linear model
Family binomial
Link logit
χ²(2) 1.201470
Pseudo-R² (Cragg-Uhler) 0.016816
Pseudo-R² (McFadden) 0.013441
AIC 94.184136
BIC 103.909377
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.054945 0.022333 0.135177 -6.316796 0.000000
as.factor(race)2 2.373913 0.528291 10.667355 1.127653 0.259466
as.factor(race)3 1.155556 0.298542 4.472772 0.209375 0.834155
Standard errors: MLE
    ## Here the potential confounder is a 3-level categorical
    ## variable so we need to treat this as the predictor 
    ## variable when exploring its relationship with the 
    ## exposure of interest. The exposure of interest is 
    ## binary, so we need to use logistic regression. 

M3.2 <- glm(low ~ as.factor(race), data = lbw, 
            family = binomial(link = "logit"))
summary(M3.2); jtools::summ(M3.2, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ as.factor(race), family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -1.1550     0.2391  -4.830 1.36e-06 ***
as.factor(race)2   0.8448     0.4634   1.823   0.0683 .  
as.factor(race)3   0.6362     0.3478   1.829   0.0674 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.66  on 186  degrees of freedom
AIC: 235.66

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(2) 5.010366
Pseudo-R² (Cragg-Uhler) 0.036791
Pseudo-R² (McFadden) 0.021351
AIC 235.661630
BIC 245.386871
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.315068 0.197183 0.503433 -4.830132 0.000001
as.factor(race)2 2.327536 0.938507 5.772384 1.823014 0.068301
as.factor(race)3 1.889234 0.955458 3.735596 1.828968 0.067404
Standard errors: MLE
M3.3 <- glm(low ~ as.factor(race) + ht, data = lbw, 
            family = binomial(link = "logit"))
summary(M3.3); jtools::summ(M3.3, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ as.factor(race) + ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -1.2302     0.2455  -5.012 5.39e-07 ***
as.factor(race)2   0.7849     0.4709   1.667   0.0955 .  
as.factor(race)3   0.6383     0.3512   1.817   0.0692 .  
ht                 1.1671     0.6186   1.887   0.0592 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 226.06  on 185  degrees of freedom
AIC: 234.06

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(3) 8.610811
Pseudo-R² (Cragg-Uhler) 0.062633
Pseudo-R² (McFadden) 0.036693
AIC 234.061185
BIC 247.028173
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.292226 0.180627 0.472775 -5.011923 0.000001
as.factor(race)2 2.192239 0.871142 5.516795 1.666990 0.095516
as.factor(race)3 1.893265 0.951153 3.768532 1.817376 0.069160
ht 3.212725 0.955750 10.799480 1.886797 0.059188
Standard errors: MLE
lmtest::lrtest(M3.3, M0)
Likelihood ratio test

Model 1: low ~ as.factor(race) + ht
Model 2: low ~ ht
  #Df  LogLik Df  Chisq Pr(>Chisq)
1   4 -113.03                     
2   2 -115.33 -2 4.5887     0.1008
M3.4 <- glm(low ~ as.factor(race)*ht, data = lbw, 
            family = binomial(link = "logit"))
summary(M3.4); jtools::summ(M3.4, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ as.factor(race) * ht, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -1.2040     0.2488  -4.839  1.3e-06 ***
as.factor(race)2      0.7621     0.4944   1.542    0.123    
as.factor(race)3      0.5814     0.3630   1.602    0.109    
ht                    0.7985     0.9462   0.844    0.399    
as.factor(race)2:ht   0.3365     1.6055   0.210    0.834    
as.factor(race)3:ht   0.9226     1.5161   0.609    0.543    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 225.67  on 183  degrees of freedom
AIC: 237.67

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(5) 9.001675
Pseudo-R² (Cragg-Uhler) 0.065408
Pseudo-R² (McFadden) 0.038359
AIC 237.670321
BIC 257.120803
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.300000 0.184220 0.488546 -4.838993 0.000001
as.factor(race)2 2.142857 0.813108 5.647269 1.541504 0.123194
as.factor(race)3 1.788618 0.878121 3.643181 1.601891 0.109180
ht 2.222222 0.347861 14.196098 0.843937 0.398705
as.factor(race)2:ht 1.400000 0.060185 32.566497 0.209569 0.834004
as.factor(race)3:ht 2.515909 0.128893 49.108965 0.608577 0.542805
Standard errors: MLE

我们发现一些比较弱的证据似乎认为不同人种可能和低出生体重有关,尽管这个关系并无统计学意义。调整了人种之后,孕期高血压的比值比 OR 变小了一点点 (从3.37 降到 3.21)。所以,调整人种恐怕并没有太多的意义。也应该没有交互作用。

41.5.1.5 用相似的方法试分析子宫痉挛 (uterine irritability) 是否是混杂因子,有没有可疑的交互作用?

a <- Epi::stat.table(list("History of Hypertension" = ht, "Uterine Irritability" = ui), 
                     list(count(),percent(low)), data = lbw, margins = TRUE)
# We first tabulate the data
# There is a strong observed association here. 
# The Zero in the contingency table means 
# that fitting a logistic regression model 
# will be uninformative, with subjects in one 
# predictor variable category being dropped
print(a, digits = c(percent = 2))
 --------------------------------------- 
               --Uterine Irritability--- 
 History of           0       1   Total  
 Hypertension                            
 --------------------------------------- 
 0                  149      28     177  
                 100.00  100.00  100.00  
                                         
 1                   12       0      12  
                 100.00      NA  100.00  
                                         
                                         
 Total              161      28     189  
                 100.00  100.00  100.00  
 --------------------------------------- 
M4.1 <- glm(low ~ ui, data = lbw, 
            family = binomial(link = "logit"))
summary(M4.1); jtools::summ(M4.1, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ui, family = binomial(link = "logit"), data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9469     0.1756  -5.392 6.97e-08 ***
ui            0.9469     0.4168   2.272   0.0231 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.60  on 187  degrees of freedom
AIC: 233.6

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(1) 5.076097
Pseudo-R² (Cragg-Uhler) 0.037267
Pseudo-R² (McFadden) 0.021631
AIC 233.595899
BIC 240.079393
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.387931 0.274957 0.547323 -5.391870 0.000000
ui 2.577778 1.138905 5.834499 2.272045 0.023084
Standard errors: MLE
M4.2 <- glm(low ~ ht + ui, data = lbw, 
            family = binomial(link = "logit"))
summary(M4.2); jtools::summ(M4.2, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0719     0.1879  -5.703 1.17e-08 ***
ht            1.4084     0.6150   2.290   0.0220 *  
ui            1.0719     0.4221   2.539   0.0111 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 224.32  on 186  degrees of freedom
AIC: 230.32

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(2) 10.351369
Pseudo-R² (Cragg-Uhler) 0.074950
Pseudo-R² (McFadden) 0.044110
AIC 230.320627
BIC 240.045868
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.342342 0.236853 0.494815 -5.703384 0.000000
ht 4.089474 1.225204 13.649807 2.290238 0.022008
ui 2.921053 1.277126 6.681056 2.539454 0.011103
Standard errors: MLE
lmtest::lrtest(M4.2, M0)
Likelihood ratio test

Model 1: low ~ ht + ui
Model 2: low ~ ht
  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   3 -112.16                       
2   2 -115.33 -1 6.3292    0.01188 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

有一些证据提示子宫痉挛应该对孕期高血压和低出生体重之间有一定的混杂因子效应。表示子宫痉挛和孕期高血压,及其与低出生体重之间关系的比值比 OR 发生了显著的变化 (OR changed materially from 3.37 to 4.09)。

但是由于没有一位母亲同时有孕期高血压且有子宫痉挛,我们并没有办法分析他们之间是否存在交互作用。

41.5.1.6 根据目前为止建立过的模型和他们之间体现的可能存在的关系,你会怎样建立关于有效评价孕期高血压和低出生体重之间关系的逻辑回归模型?

我们认为目前为止对数据中的变量之间关系的评价,我们应该考虑在最终模型中加入母亲的体重,以及子宫痉挛作为混杂因子 (共变量, covariates)。

M5 <- glm(low ~ ht + ui + lwt, data = lbw, 
            family = binomial(link = "logit"))
summary(M5); jtools::summ(M5, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.064868   0.832875   1.279  0.20106   
ht           1.960496   0.695543   2.819  0.00482 **
ui           0.930029   0.433603   2.145  0.03196 * 
lwt         -0.016893   0.006563  -2.574  0.01005 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 216.64  on 185  degrees of freedom
AIC: 224.64

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(3) 18.036428
Pseudo-R² (Cragg-Uhler) 0.127998
Pseudo-R² (McFadden) 0.076858
AIC 224.635568
BIC 237.602556
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.900455 0.566919 14.839219 1.278544 0.201058
ht 7.102846 1.817125 27.763864 2.818654 0.004823
ui 2.534583 1.083484 5.929122 2.144884 0.031962
lwt 0.983248 0.970682 0.995978 -2.574023 0.010052
Standard errors: MLE

此时,模型估计的孕期高血压和低出生体重之间关系的比值比为 7.10。接下来,重新把另外几个数据中的变量一一逐个加入上面的 M5 模型中去,观察孕期高血压和低出生体重之间关系的比值比是否有本质变化。

M5.1 <- glm(low ~ ht + ui + lwt + age, data = lbw, 
            family = binomial(link = "logit"))
summary(M5.1); jtools::summ(M5.1, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt + age, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.746870   1.056801   1.653  0.09834 . 
ht           1.926674   0.695579   2.770  0.00561 **
ui           0.916566   0.433466   2.115  0.03447 * 
lwt         -0.015907   0.006602  -2.410  0.01597 * 
age         -0.035073   0.033364  -1.051  0.29316   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 215.50  on 184  degrees of freedom
AIC: 225.5

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(4) 19.170815
Pseudo-R² (Cragg-Uhler) 0.135648
Pseudo-R² (McFadden) 0.081692
AIC 225.501181
BIC 241.709917
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 5.736617 0.722945 45.520417 1.652979 0.098335
ht 6.866635 1.756572 26.842441 2.769885 0.005608
ui 2.500688 1.069282 5.848257 2.114504 0.034472
lwt 0.984219 0.971566 0.997036 -2.409568 0.015971
age 0.965535 0.904417 1.030784 -1.051211 0.293162
Standard errors: MLE
M5.2 <- glm(low ~ ht + ui + lwt + as.factor(smoke), data = lbw, 
            family = binomial(link = "logit"))
summary(M5.2); jtools::summ(M5.2, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt + as.factor(smoke), family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        0.718477   0.849013   0.846  0.39741   
ht                 1.921019   0.682522   2.815  0.00488 **
ui                 0.896352   0.442908   2.024  0.04299 * 
lwt               -0.016277   0.006545  -2.487  0.01288 * 
as.factor(smoke)1  0.652869   0.335657   1.945  0.05177 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 212.85  on 184  degrees of freedom
AIC: 222.85

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(4) 21.822445
Pseudo-R² (Cragg-Uhler) 0.153350
Pseudo-R² (McFadden) 0.092991
AIC 222.849551
BIC 239.058286
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.051307 0.388463 10.832079 0.846250 0.397413
ht 6.827915 1.791943 26.016698 2.814589 0.004884
ui 2.450646 1.028672 5.838275 2.023788 0.042992
lwt 0.983855 0.971315 0.996556 -2.487140 0.012877
as.factor(smoke)1 1.921045 0.995006 3.708937 1.945047 0.051769
Standard errors: MLE
M5.3 <- glm(low ~ ht + ui + lwt + as.factor(race), data = lbw, 
            family = binomial(link = "logit"))
summary(M5.3); jtools::summ(M5.3, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt + as.factor(race), family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)       0.948747   0.893931   1.061  0.28854   
ht                1.933814   0.706360   2.738  0.00619 **
ui                0.956122   0.433822   2.204  0.02753 * 
lwt              -0.018608   0.006896  -2.698  0.00697 **
as.factor(race)2  1.099131   0.499377   2.201  0.02774 * 
as.factor(race)3  0.430545   0.370301   1.163  0.24496   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 211.53  on 183  degrees of freedom
AIC: 223.53

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(5) 23.138325
Pseudo-R² (Cragg-Uhler) 0.162043
Pseudo-R² (McFadden) 0.098599
AIC 223.533671
BIC 242.984153
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.582471 0.447837 14.891912 1.061320 0.288544
ht 6.915840 1.732171 27.612088 2.737720 0.006187
ui 2.601589 1.111651 6.088480 2.203950 0.027528
lwt 0.981564 0.968386 0.994922 -2.698133 0.006973
as.factor(race)2 3.001558 1.127915 7.987613 2.201006 0.027736
as.factor(race)3 1.538095 0.744359 3.178219 1.162690 0.244955
Standard errors: MLE

可见,除了母亲体重,和子宫痉挛两个变量之外,增加任何一个变量并不再实质改变孕期高血压和低出生体重之间关系的比值比。

41.5.2 Part 2

41.5.2.1 想象另外一个研究者的分析目的是要使用相同的数据来建立一个预测式的模型。预测一名孕妇会产下一位低出生体重胎儿的危险度。

41.5.2.2 使用下面的Stata代码,试用逐步选择法寻找模型:

(Hosmer & Lemeshow data)

i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
                      begin with empty model
p = 0.0229 <  0.0500  adding  lwt
p = 0.0081 <  0.0500  adding  _Iht_1
p = 0.0320 <  0.0500  adding  _Iui_1

Logistic regression                               Number of obs   =        189
                                                  LR chi2(3)      =      18.04
                                                  Prob > chi2     =     0.0004
Log likelihood = -108.31778                       Pseudo R2       =     0.0769

------------------------------------------------------------------------------
         low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         lwt |   .9832484   .0064531    -2.57   0.010     .9706816     .995978
      _Iht_1 |   7.102846   4.940338     2.82   0.005     1.817125    27.76387
      _Iui_1 |   2.534583   1.099004     2.14   0.032     1.083484    5.929122
       _cons |   2.900455   2.415719     1.28   0.201     .5669188    14.83923
------------------------------------------------------------------------------

逐步筛选的过程是先加入和母亲的体重,其次是孕期高血压,最后保留了子宫痉挛作为了低出生体重的预测变量。这和我们在第一部分最终保留的,模型完全一样。

(Hosmer & Lemeshow data)

i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
                      begin with empty model
p = 0.0229 <  0.1000  adding  lwt
p = 0.0081 <  0.1000  adding  _Iht_1
p = 0.0320 <  0.1000  adding  _Iui_1
p = 0.0518 <  0.1000  adding  _Ismoke_1
p = 0.0173 <  0.1000  adding  _Irace_2 _Irace_3

Logistic regression                               Number of obs   =        189
                                                  LR chi2(6)      =      30.43
                                                  Prob > chi2     =     0.0000
Log likelihood = -102.11978                       Pseudo R2       =     0.1297

------------------------------------------------------------------------------
         low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         lwt |   .9834361   .0066887    -2.46   0.014     .9704134    .9966336
      _Iht_1 |   6.490237   4.483259     2.71   0.007     1.676009    25.13302
      _Iui_1 |   2.471801   1.106213     2.02   0.043     1.028189    5.942297
   _Ismoke_1 |   2.817403   1.105908     2.64   0.008     1.305356    6.080917
    _Irace_2 |   3.758631   1.959795     2.54   0.011     1.352705    10.44375
    _Irace_3 |   2.526023   1.087054     2.15   0.031      1.08675    5.871446
       _cons |   1.054066   .9884219     0.06   0.955     .1677556    6.623063
------------------------------------------------------------------------------

这里设定了 pe(0.1) 之后,我们发现又增加了吸烟习惯,和种族两个变量。

(Hosmer & Lemeshow data)

i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
                      begin with empty model
p = 0.0229 <  0.1000  adding  lwt
p = 0.0081 <  0.1000  adding  _Iht_1
p = 0.0320 <  0.1000  adding  _Iui_1
p = 0.0518 <  0.1000  adding  _Ismoke_1
p = 0.0173 <  0.1000  adding  _Irace_2 _Irace_3

Logistic regression                               Number of obs   =        189
                                                  LR chi2(6)      =      30.43
                                                  Prob > chi2     =     0.0000
Log likelihood = -102.11978                       Pseudo R2       =     0.1297

------------------------------------------------------------------------------
         low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         lwt |   .9834361   .0066887    -2.46   0.014     .9704134    .9966336
      _Iht_1 |   6.490237   4.483259     2.71   0.007     1.676009    25.13302
      _Iui_1 |   2.471801   1.106213     2.02   0.043     1.028189    5.942297
   _Ismoke_1 |   2.817403   1.105908     2.64   0.008     1.305356    6.080917
    _Irace_2 |   3.758631   1.959795     2.54   0.011     1.352705    10.44375
    _Irace_3 |   2.526023   1.087054     2.15   0.031      1.08675    5.871446
       _cons |   1.054066   .9884219     0.06   0.955     .1677556    6.623063
------------------------------------------------------------------------------

这时,只有年量被移出了最终模型,获得了和逐步加入变量法 (forward) 相同的模型。但是实际上在很多现实数据中,这种情况很少见。

(Hosmer & Lemeshow data)

i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
                      begin with empty model
p = 0.0229 <  0.0500  adding  lwt
p = 0.0081 <  0.0500  adding  _Iht_1
p = 0.0320 <  0.0500  adding  _Iui_1

Logistic regression                               Number of obs   =        189
                                                  LR chi2(3)      =      18.04
                                                  Prob > chi2     =     0.0004
Log likelihood = -108.31778                       Pseudo R2       =     0.0769

------------------------------------------------------------------------------
         low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         lwt |   .9832484   .0064531    -2.57   0.010     .9706816     .995978
      _Iht_1 |   7.102846   4.940338     2.82   0.005     1.817125    27.76387
      _Iui_1 |   2.534583   1.099004     2.14   0.032     1.083484    5.929122
       _cons |   2.900455   2.415719     1.28   0.201     .5669188    14.83923
------------------------------------------------------------------------------

这时,逐步删除法即使使用了更加严格的p值标准,留下的变量和逐步增加法的模型也不同。所以,分析者采用逐步删除或者逐步增加变量的方法,可能获得截然不同的最终保留模型。

41.5.2.3 建立一个包含全部变量的模型。解释其提示的孕期高血压和低出生体重之间的关系。

M6 <- glm(low ~ ht + ui + lwt + as.factor(race) + 
              age + as.factor(smoke), data = lbw, 
            family = binomial(link = "logit"))
summary(M6); jtools::summ(M6, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt + as.factor(race) + age + as.factor(smoke), 
    family = binomial(link = "logit"), data = lbw)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        0.434255   1.191830   0.364  0.71559   
ht                 1.856482   0.688711   2.696  0.00703 **
ui                 0.895338   0.448475   1.996  0.04589 * 
lwt               -0.016255   0.006857  -2.371  0.01775 * 
as.factor(race)2   1.280069   0.526638   2.431  0.01507 * 
as.factor(race)3   0.902295   0.434317   2.078  0.03776 * 
age               -0.018285   0.035352  -0.517  0.60500   
as.factor(smoke)1  1.027543   0.393899   2.609  0.00909 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 203.97  on 181  degrees of freedom
AIC: 219.97

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(7) 30.701874
Pseudo-R² (Cragg-Uhler) 0.210853
Pseudo-R² (McFadden) 0.130829
AIC 219.970122
BIC 245.904098
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.543812 0.149316 15.961799 0.364359 0.715590
ht 6.401179 1.659695 24.688325 2.695590 0.007026
ui 2.448163 1.016478 5.896342 1.996407 0.045890
lwt 0.983877 0.970743 0.997188 -2.370697 0.017755
as.factor(race)2 3.596888 1.281304 10.097220 2.430645 0.015072
as.factor(race)3 2.465253 1.052375 5.775006 2.077505 0.037755
age 0.981881 0.916152 1.052327 -0.517224 0.605000
as.factor(smoke)1 2.794192 1.291126 6.047055 2.608644 0.009090
Standard errors: MLE

放入了全部的变量之后,调整后的孕期高血压和低出生体重之间关系的比值比变成了 6.40 ,它的95%置信区间是 (1.66, 24.69)。这个结果和逐步变量筛选法,以及第一部分中逐个分析变量之间关系之后建立起来的模型所估算的比值比结果其实十分接近。在汇报这些结果的时候,无论分析者选用了哪种方法,最终获得接近的答案,会让读者相信结果是稳定可靠的。分析者也应该在论文中提示使用不同的策略分析选择变量和模型之后获得的比值比都很接近。当然,有些人会乐意展示所有分析策略的过程和结果。另外值得注意的是,保留不同变量的模型所估计的各自的比值比之间严格来说是不同的条件比值比,由于逻辑回归的不可压缩型,他们不适宜直接拿来比较。

41.5.3 Part 3

已知这个数据中收集到的数据还包含 1)怀孕初期(最初三个月),孕妇访问医生的次数;2)该母亲是否曾经有过流产史。根据这一实情,回答下面的问题:

41.5.3.1 你是否同一把上述这两个中的一个或者两者全部都放入前面两个部分建立的回归模型中做预测变量?在怎样的条件下,你会认为可能增加这两个变量是可以考虑的?

关于怀孕初期访问医生次数这一变量,它很可能其实是处在孕期高血压和低出生体重两者因果关系通路上的变量。所以,增加它作为预测变量可能引起不小的争议。

调整母亲的流产史同样也是一件容易引起争议的事。如果说,研究的目的是要预测某母亲会产下低出生体重胎儿的概率,那么增加这个变量可能会被认可。但是如果说研究目的是为了分析孕期高血压和低出生体重胎儿的出生之间可能存在的病因学关系,那你应该拒绝增加调整这个变量。这是由于母亲的流产史可能和某些基因上的或者环境上的因素有关,间接或者直接的成为母亲的孕期高血压(有或无),及低出生体重胎儿(有或无)的原因。

41.5.3.2 如果你认为应该增加这两个变量中的一个或者全部,请建立包含你认为应该增加的变量的模型,观察结果的变化。

尽管增加这两个变量可能引起一些争议,但是实际上增加了这两个变量中的任何一个对孕期高血压和低出生体重之间的关系影响都几乎可以忽略不计。

M7 <- glm(low ~ ht + ui + lwt + ftv, data = lbw, 
            family = binomial(link = "logit"))
summary(M7); jtools::summ(M7, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt + ftv, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  1.08789    0.83877   1.297  0.19463   
ht           1.94292    0.69698   2.788  0.00531 **
ui           0.92319    0.43443   2.125  0.03358 * 
lwt         -0.01677    0.00659  -2.545  0.01093 * 
ftv         -0.04820    0.16978  -0.284  0.77647   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 216.55  on 184  degrees of freedom
AIC: 226.55

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(4) 18.117740
Pseudo-R² (Cragg-Uhler) 0.128548
Pseudo-R² (McFadden) 0.077205
AIC 226.554256
BIC 242.762991
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.967996 0.573456 15.361253 1.297001 0.194631
ht 6.979073 1.780430 27.357134 2.787610 0.005310
ui 2.517295 1.074361 5.898176 2.125070 0.033581
lwt 0.983369 0.970750 0.996153 -2.544936 0.010930
ftv 0.952940 0.683207 1.329165 -0.283921 0.776471
Standard errors: MLE
M8 <- glm(low ~ ht + ui + lwt + ptl, data = lbw, 
            family = binomial(link = "logit"))
summary(M8); jtools::summ(M8, digits = 6, confint = TRUE, exp = TRUE)

Call:
glm(formula = low ~ ht + ui + lwt + ptl, family = binomial(link = "logit"), 
    data = lbw)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.829114   0.848356   0.977  0.32841   
ht           1.943581   0.700545   2.774  0.00553 **
ui           0.773297   0.445804   1.735  0.08281 . 
lwt         -0.015886   0.006632  -2.395  0.01660 * 
ptl          0.627385   0.337262   1.860  0.06285 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 213.04  on 184  degrees of freedom
AIC: 223.04

Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
χ²(4) 21.636480
Pseudo-R² (Cragg-Uhler) 0.152117
Pseudo-R² (McFadden) 0.092199
AIC 223.035516
BIC 239.244251
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.291289 0.434468 12.083750 0.977319 0.328411
ht 6.983714 1.769217 27.567142 2.774383 0.005531
ui 2.166898 0.904418 5.191676 1.734610 0.082810
lwt 0.984239 0.971529 0.997116 -2.395465 0.016599
ptl 1.872706 0.966923 3.627001 1.860229 0.062853
Standard errors: MLE

Reference

Robinson, Laurence D, and Nicholas P Jewell. 1991. “Some Surprising Results about Covariate Adjustment in Logistic Regression Models.” International Statistical Review, 227–40.