[R-br] glm.fit: fitted probabilities numerically 0 or 1 occurred
Marcelo Luiz de Laia
marcelolaia em gmail.com
Quarta Outubro 12 19:39:36 BRT 2011
On Wed, 12 Oct 2011, Benilton Carvalho wrote:
> Isso deve-se ao fato de vc ter
> combinacoes dos fatores para as quais vc tem 20 germinacoes (ou falta
> delas)... Dai', vc tem variabilidade zero para aquele grupo e suas
> predicoes podem ser bastante viciadas. b
Então, Benilton, eu encontrei um caso em que há uma "incerteza de
predicao" da probabilidade, se é que eu entendi direito. Por favor, veja
abixo.
No entanto, eu não tenho nenhum caso de germinacao das 20 sementes. O máximo
foram 10. Por outro lado, tenho casos em que nenhuma germinou. Por ser
por isso tambem? Nenhuma germinar?
O que poderia ser feito?
No final eu colo o sumario da analise.
Obrigado
y<- c(0,0,0,0,1,1,1,1)
x1<-c(1,2,3,3,5,6,10,11)
x2<-c(3,2,-1,-1,2,4,1,0)
m1<- glm(y~ x1+x2, family=binomial)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
The only warning message that R gives is right after fitting the
logistic model. It says that "fitted probabilities numerically 0 or 1
occurred". Combining this piece of information with the parameter
estimate for x1 being really large (>15), we suspect that there is a
problem of complete or quasi-complete separation. The standard errors
for the parameter estimates are way too large. This usually indicates
a convergence issue or some degree of data separation.
What is quasi-complete separation and what do some of the most
commonly used software packages do when it happens?
Quasi-complete separation in a logistic/probit regression happens when
the outcome variable separates a predictor variable or a combination
of predictor variables to certain degree. Here is an example.
Y X1 X2
0 1 3
0 2 0
0 3 -1
0 3 4
1 3 1
1 4 0
1 5 2
1 6 7
1 10 3
1 11 4
Notice that the outcome variable Y separates the predictor variable X1
pretty well except for values of X1 equal to 3. In other words, X1
predicts Y perfectly when X1 <3 (Y = 0) or X1 >3 (Y=1), leaving only
when X1 = 3 as cases with uncertainty. In terms of expected
probabilities, we have Prob(Y=1 | X1<3) = 0 and Prob(Y=1 | X1>3) = 1,
nothing to be estimated, except for Prob(Y = 1 | X1 = 3).
What happens when we try to fit a logistic or a probit regression
model of Y on X1 and X2 using the data above? It turns out that the
maximum likelihood estimate for X1 does not exist. With this example,
the larger the parameter for X1, the larger the likelihood. In
practice, a value of 15 or larger does not make much difference and
they all basically correspond to predicted probability of 1. The
behavior of different statistical software packages differ at how they
deal with the issue of quasi-complete separation. Below is what each
package of SAS, SPSS, Stata and R does with our sample data and the
logistic regression model of Y on X1 and X2. We present these results
here in the hope that some level of understanding of the behavior of
logistic/probit regression within our familiar software package might
help us identify the problem of separation more efficiently.
summary(gm1)
Generalized linear mixed model fit by the Laplace approximation
Formula: cbind(Germinacao, 20 - Germinacao) ~ Dia + Especie + Matriz +
(Dia | Repeticao)
Data: ka.dat
AIC BIC logLik deviance
192.6 323.8 -54.3 108.6
Random effects:
Groups Name Variance Std.Dev. Corr
Repeticao (Intercept) 1.0810e-09 3.2878e-05
Dia2 3.7027e-01 6.0850e-01 0.000
Dia3 4.7946e-01 6.9243e-01 0.000 1.000
Dia4 4.9575e-02 2.2265e-01 0.000 1.000 1.000
Dia5 9.3664e-03 9.6780e-02 0.000 1.000 1.000 1.000
Dia6 3.1308e-02 1.7694e-01 0.000 1.000 1.000 1.000
1.000
Dia7 1.4786e-02 1.2160e-01 0.000 1.000 1.000 1.000
1.000
1.000
Number of obs: 168, groups: Repeticao, 3
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.6318 2854.0451 -0.008 0.9937
Dia2 16.4579 2854.0452 0.006 0.9954
Dia3 18.2293 2854.0451 0.006 0.9949
Dia4 19.1952 2854.0451 0.007 0.9946
Dia5 19.4628 2854.0451 0.007 0.9946
Dia6 19.7174 2854.0451 0.007 0.9945
Dia7 19.8392 2854.0451 0.007 0.9945
EspecieVinhatico 17.5891 3788.0180 0.005 0.9963
MatrizM04 -1.4221 0.6517 -2.182 0.0291 *
MatrizM05 -19.0112 3788.0181 -0.005 0.9960
MatrizM06 -1.8314 0.7690 -2.382 0.0172 *
MatrizM1 -16.8933 3788.0181 -0.004 0.9964
MatrizM12 -19.0114 3788.0180 -0.005 0.9960
MatrizM15CV15 -15.2958 3788.0181 -0.004 0.9968
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Dia2 Dia3 Dia4 Dia5 Dia6 Dia7 EspcVn
MtrM04
Dia2 -1.000
Dia3 -1.000 1.000
Dia4 -1.000 1.000 1.000
Dia5 -1.000 1.000 1.000 1.000
Dia6 -1.000 1.000 1.000 1.000 1.000
Dia7 -1.000 1.000 1.000 1.000 1.000 1.000
EspeciVnhtc 0.000 0.000 0.000 0.000 0.000 0.000 0.000
MatrizM04 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
MatrizM05 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -1.000
0.000
MatrizM06 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.174
MatrizM1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -1.000
0.000
MatrizM12 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -1.000
0.000
MtrzM15CV15 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -1.000
0.000
MtrM05 MtrM06 MtrzM1 MtrM12
Dia2
Dia3
Dia4
Dia5
Dia6
Dia7
EspeciVnhtc
MatrizM04
MatrizM05
MatrizM06 0.000
MatrizM1 1.000 0.000
MatrizM12 1.000 0.000 1.000
MtrzM15CV15 1.000 0.000 1.000 1.000
>
--
Marcelo
Brazil
Linux user number 487797
Mais detalhes sobre a lista de discussão R-br