**********************************************************; * A Poisson regression as a Generalized Linear Model: * * The Poisson distribution is * * Prob(Y=n) = exp(-mu) mu^n/n! for n=0,1,2,3,4,..., * * This is a good model for counts: The Poisson distribution * can be shown to be the limit of the binomial distribution * * Prob(Y=n) = (N!/((N-n)!n!) p^n(1-p)^{N-n) * * as N->infinity, p->0, and E(Y) = Np -> mu. An example might * be the number of flaws in a piece of paper, since each tiny * square has a small probability of having a flaw, and the * number of tiny squares is huge of the squares are tiny enough. * * Similarly, most counts depending on time might be assumed to be * Poisson distributed by the same argument: The probability of a * count in a particular millisecond is small, and there are many, * many milliseconds in an hour or day. However, if the mean or * intensity of the counts changes randomly over the hour or day, * this could produce bursts of counts that could make the total * count be too heavy-tailed to have a Poisson distribution. * * For a Poisson distribution, E(Y) = mu = Var(Y) * * If Y depends on covariates X (which might be a vector), we can * form a generalized linear model for Poisson variables Y by * * E(Y|X) = (mu|X) = g(a+bX), where 0X_i+1 raises E(Y|X) by a factor of exp(b_i). * * The dataset `allproduct.dat' can be downloaded from the SAS Web site * ftp.sas.com/pub/publications, where it is the dataset POSTDOC * in the file A55770.htm corresponding to Allison's book. * (I added three lines of comments at the top of the file.) **********************************************************; title 'EXAMPLES OF POISSON AND NEGATIVE-BINOMIAL REGRESSION -- YOUR NAME'; options ls=75 ps=60 pageno=1 nocenter; data product; infile "allproduct.dat" firstobs=4; input Pdoc Age Mar Doc Und Ag Arts Cits docid; label Age='Age at PhD' Mar='Married' Doc='Rank of PhD School' Und='Rank of Undergrad School' Ag='Agricultural School'; run; proc chart; title2 "DISTRIBUTION OF COUNTS OF ARTICLES AND CITATIONS:"; title3 "THE DISTRIBUTIONS ARE TOO HEAVY-TAILED TO BE POISSON,"; title4 " BUT MIGHT BE DUE TO HIGHLY VARIABLE COVARIATES"; hbar Arts Cits / discrete; run; proc genmod data=product; title2 "POISSON REGRESSION FOR ARTICLES PUBLISHED:"; title3 " ONLY UND IS SIGNIFICANT, WITH AGE BORDERLINE"; title4 "NOTE: E(Y|X) = exp(A+Sum B_i X_i), SO THAT"; title5 " X_i->X_i+1 MULTIPLIES E(Y|X) BY exp(B_i)"; model Arts = Age Mar Doc Und Ag / dist=poisson; run; proc genmod data=product; title2 "POISSON REGRESSION FOR ARTICLES PUBLISHED"; title3 "FEWER VARIABLES: CORRELATION OF COVARIATES"; title4 " DOESN'T SEEM TO BE THE PROBLEM"; model Arts = Age Und Ag / dist=poisson; run; proc genmod data=product; title2 "POISSON REGRESSION FOR CITATIONS TO THOSE ARTICLES"; title3 "NOW EVERYTHING IS SIGNIFICANT!"; model Cits = Age Mar Doc Und Ag / d=p; run; **********************************************************; * GOODNESS OF FIT AND SCALED DEVIANCES FOR DISCRETE MODELS: * * Let LL(Y,mu(Y),H) be the log likelihood for count data Y for a * model H, where mu(Y) is the fitted value for each observation Y. * Then mu(Y)=Y if we have enough parameters to fit every observation * exactly. * * By definition, the SCALED DEVIANCE of a model H is * * Deviance(H) = 2 (LL(Y,Y) - LL(Y,mu(Y),H)) (*) * * In particular, for nested hypotheses H_0 and H_1, * * Deviance(H_0) > Deviance(H_1) * * since the fitted likelihoods are nested in the opposite direction. * The likelihood-ratio test for H_0 within H_1 can be written as * * Deviance(H_0) - Deviance(H_1) approx chi_square(r) * * where r is the number of extra parameters in H_1 but not in H_0. * * In principle, the largest model H_L has one parameter for each * observation, so that mu(Y)=Y and Deviance(H_L)=0. This means that * approximately * * Deviance(H_0) approx chi_square(N-d) * * if H_0 has d parameters. This cannot be used as a goodness-of-fit * test for H_0, since there are almost as many degrees of freedom as * observations, so that the chiu-square approximation is not valid. * However, we can take * * Deviance(H_0) >> N-d * * as a sign that the model H_0 does not fit the data: * * Intuitively, this would be because the fitted log likelihood * LL(Y,mu(Y),H) in (*) is too small, which says that the model H is * too unlikely for the data Y. * **********************************************************; proc genmod data=product; title2 "NOTE: DEVIANCE/DF > 1 FOR BOTH ARTS and CITS, AND"; title3 " THE COUNT DISTRIBUTION IS VERY HEAVY-TAILED"; title4 "TRY NEGATIVE BINOMIAL REGRESSION FOR ARTICLES PUBLISHED"; title5 "NOTE THAT NOW DEVIANCE/DF < 1 IN BOTH CASES, AND THAT"; title6 " DEVIANCE(Poisson)-DEVIANCE(NegBin), WHICH SHOULD BE"; title7 " CHI-SQUARE WITH DF=1, IS HUGE."; model Arts = Age Mar Doc Und Ag / dist=negbin; run; proc genmod data=product; title2 "NEGATIVE BINOMIAL REGRESSION FOR CITATIONS"; title3 "DEVIANCE/DF IS NOW MUCH BETTER, AND SOME COVARIATES"; title4 " ARE SIGNIFICANT AND SOME ARE NOT."; model Cits = Age Mar Doc Und Ag / d=negbin; run;