Monday, August 27, 2012

Football (Eredivisie) goals

The football season has started in Netherlands, so I went and had a look at last year's scores. I did not find downloadable data, at http://www.eredivisiestats.nl/wedstrijden.php I could copy last season's data and paste into a spreadsheet. The default layout did not look good to me for statistical analysis (first lines shown below), so first part is reformatting to get how many goals a club did against another. This means that each game becomes two lines and an indicator variable is used for who was playing at home.

Datum Thuisclub (home) Uitclub (away) Thuisscore Uitscore
2011-08-05 Excelsior Feyenoord 0 2
2011-08-06 RKC Waalwijk Heracles Almelo 2 2
2011-08-06 Roda JC FC Groningen 2 1

# read data
old <- read.csv('seizoen1112.csv',stringsAsFactors=FALSE)
# reformat lengthwise
StartData <- with(old,data.frame(
OffenseClub = c(Thuisclub,Uitclub),
DefenseClub = c(Uitclub,Thuisclub),
Goals = c(Thuisscore,Uitscore),
OffThuis = rep(c(1,0),each=length(Datum))
))
#remove space as first character in names
levels(StartData$OffenseClub) <- sub('^ ','',levels(StartData$OffenseClub))
levels(StartData$DefenseClub) <- sub('^ ','',levels(StartData$DefenseClub))
# check clubs have same name and ordening in both factors.
all(levels(StartData$OffenseClub)==levels(StartData$DefenseClub))

Distribution of goals

To get to know the data the distribution of goals is most interesting:
xt <- xtabs(~Goals ,data=StartData)
plot(xt)
The first question is; is this approximately Poisson distributed? MASS has some functions.
library(MASS)
(fd <- fitdistr(StartData$Goals, densfun="Poisson"))
     lambda  
  1.62908497 
 (0.05159364)
round(dpois(0:8,coef(fd))*sum(xt))
plot(xt,ylim=c(0,200),ylab='Frequency of goals')
points(x=0:8,y=dpois(0:8,coef(fd))*sum(xt),col='green')
It is also possible to do the same calculations with the fitdistrplus package and get some extra information. Two different algorithms give essentially the same information.
library(fitdistrplus)
(fd2 <- fitdist(StartData$Goals, distr='pois',method='mle'))
Fitting of the distribution ' pois ' by maximum likelihood 

Parameters:
       estimate Std. Error
lambda 1.629085 0.05159362

fitdist(StartData$Goals, distr='pois',method='mme')
Fitting of the distribution ' pois ' by matching moments 
Parameters:
       estimate
lambda 1.629085
plot(fd2)
This is nice. I get the same information as my own plot, but also a CDF and have to do less. If only I knew all packages in R, I would not have to program anything. But there are more extra's in fitdistrplus. Goodness of fit and bootstrap confidence interval.
gofstat(fd2,print.test=TRUE)
Chi-squared statistic:  18.46365 
Degree of freedom of the Chi-squared distribution:  4 
Chi-squared p-value:  0.001001433 
bd <- bootdist(fd2)
summary(bd)
Parametric bootstrap medians and 95% percentile CI 
  Median     2.5%    97.5% 
1.627451 1.527803 1.728736 
plot(density(bd$estim$lambda))
Finally, if desired the same can be done the Bayesian way via Jags
library(R2jags)
calcData <- with(StartData,list(N=length(Goals),Goals=Goals))
mijnmodel <- function() {
  for (i in 1:N) {
    Goals[i] ~ dpois(mu)
  }
  mu ~ dunif(0,3)
}
write.model(mijnmodel,'model.file')
jags.inits <- function() {
  list(
      mu = runif(1,0,3)
  )
}
jags.params=c('mu')
jags(data=calcData, inits=jags.inits, jags.params,model.file='model.file')
Inference for Bugs model at "model.file", fit using jags,
 3 chains, each with 2000 iterations (first 1000 discarded)
 n.sims = 3000 iterations saved
         mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
mu         1.633   0.051    1.533    1.598    1.632    1.667    1.732 1.001  2100
devianc 2048.287   1.338 2047.308 2047.403 2047.757 2048.671 2052.118 1.006  1500

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 0.9 and DIC = 2049.2
DIC is an estimate of expected predictive error (lower deviance is better).
This leads to essentially the same result as before.

Conclusion

Three ways to get parameters of a distribution are used. The most obvious way, via MASS gives parameters and standard deviation. A dedicated package gives all that plus significance and distribution. The Bayesian way gives the distribution too and uses most code by far. The estimates do not differ in any relevant way.

No comments:

Post a Comment