Monday, April 30, 2012

Bayesian ANOVA for sensory panel profiling data

In this post it is examined if it is possible to use Bayesian methods and specifically JAGS to analyze sensory profiling data. The aim is not to obtain different results, but rather to confirm that the results are fairly similar. The data used is the chocolate data from SensoMineR and the script is adapted from various online sources examined over a longer period.

Building the model

JAGS, (but also WinBugs and OpenBugs) are programs which can be used to provide samples from posterior distributions. It is up to the user to provide data and model to JAGS and interpret the samples. Luckily, R provides infrastructure to help both in setting up models and data and in posterior analysis of the samples. Still, there are some steps to be done, before the analysis can be executed; Setting up data, defining model, initializing variables and deciding which parameters of the model are interesting. Only then JAGS can be called.

Setting up data

Data is any data which goes into JAGS. This includes experimental design, measurements, but also number of rows in the data. For this post I have added some extra data, since I want to compare differences between product means. As I want to compare those, I need to have samples from these specific distributions. All the data needs to go into one big list, which will be given to JAGS later on.
# get libraries
library(SensoMineR)
library(coda)
library(R2jags)

# infrastructure for contrasts
FullContrast <- function(n) {
  UseMethod('FullContrast',n)
}
FullContrast.default <- function(n) stop('FullContrast only takes integer and factors')
FullContrast.integer <- function(n) {
  mygrid <- expand.grid(v1=1:n,v2=1:n)
  mygrid <- mygrid[mygrid$v1<mygrid$v2,]
  rownames(mygrid) <- 1:nrow(mygrid)
  as.matrix(mygrid)
}
FullContrast.factor <- function(n) {
  FullContrast(nlevels(n))
}

# reading data
data(chocolates)

# setting up experimental data
data_list <- with(sensochoc,list(Panelist = as.numeric(Panelist),
                              nPanelist = nlevels(Panelist),
                              Product = as.numeric(Product),
                              nProduct = nlevels(Product),
                              y=CocoaA,
                              N=nrow(sensochoc)))
# contrasts
data_list$Panelistcontr <- FullContrast(sensochoc$Panelist)
data_list$nPanelistcontr <- nrow(data_list$Panelistcontr)
data_list$Productcontr <-  FullContrast(sensochoc$Product) 
data_list$nProductcontr <- nrow(data_list$Productcontr)

Model definition

The model can be written in 'plain' R and then given to JAGS. However, JAGS does not have vector operations, hence there are a lot of for loops which would be unacceptable for normal R usage. Besides the additive effects in the first part of the model, there are quite some extras. all the means in the model are coming out of hyperdistributions. In this case these are normal distributed. The precision (and hence variance) of these hyperdistributions are determined on basis of the data. The final part of the model translates the internal parameters into something which is sensible to interpret. The model also needs to be written into a file so JAGS can use it later on, these are the last two lines.



mymodel <- function() {
  # core of the model  
  for (i in 1:N) {
    fit[i] <- grandmean +  mPanelist[Panelist[i]] + mProduct[Product[i]] + mPanelistProduct[Panelist[i],Product[i]]
    y[i] ~ dnorm(fit[i],tau)
  }
  # grand mean and residual 
  tau ~ dgamma(0.001,0.001)
  gsd <-  sqrt(1/tau)
  grandmean ~ dnorm(0,.001)
  # variable Panelist distribution  
  mPanelist[1] <- 0
  for (i in 2:nPanelist) {
    mPanelist[i] ~ dnorm(offsetPanelist,tauPanelist) 
  }
  offsetPanelist ~ dnorm(0,.001)
  tauPanelist ~ dgamma(0.001,0.001)
  sdPanelist <- sqrt(1/tauPanelist)
  # Product distribution 
  mProduct[1] <- 0
  for (i in 2:nProduct) {
    mProduct[i] ~ dnorm(offsetProduct,tauProduct)
  }
  offsetProduct ~ dnorm(0,0.001)
  tauProduct ~ dgamma(0.001,0.001)
  sdProduct <- sqrt( 1/tauProduct)
  # interaction distribution
  for (i in 1:nPanelist) {
    mPanelistProduct[i,1] <- 0
  }
  for (i in 2:nProduct) {
    mPanelistProduct[1,i] <- 0
  }
  for (iPa in 2:nPanelist) {
    for (iPr in 2:nProduct) {
      mPanelistProduct[iPa,iPr] ~dnorm(offsetPP,tauPP)
    }
  }
  offsetPP ~dnorm(0,0.001)
  tauPP ~dgamma(0.001,0.001)
  sdPP <- 1/sqrt(tauPP)
  # getting the interesting data
  # true means for Panelist
  for (i in 1:nPanelist) {
    meanPanelist[i] <- grandmean + mPanelist[i] + mean(mPanelistProduct[i,1:nProduct]) + mean(mProduct[1:nProduct])
  }
  # true means for Product
  for (i in 1:nProduct) {
    meanProduct[i] <- grandmean + mProduct[i] + mean(mPanelistProduct[1:nPanelist,i]) + mean(mPanelist[1:nPanelist])
  }
  for (i in 1:nPanelistcontr) {
    Panelistdiff[i] <- meanPanelist[Panelistcontr[i,1]]-meanPanelist[Panelistcontr[i,2]]
  }
  for (i in 1:nProductcontr) {
    Productdiff[i] <- meanProduct[Productcontr[i,1]]-meanProduct[Productcontr[i,2]]
  }
}

model.file <- file.path(tempdir(),'mijnmodel.txt')
write.model(mymodel,con=model.file)

Initializing variables

Anything values in the model which are not provided by the data, needs to be initialized. It is most convenient to setup a little model which can be used to get these values.

inits <- function() list(
  grandmean = rnorm(1,3,1),
  mPanelist = c(0,rnorm(data_list$nPanelist-1)) ,
  mProduct = c(0,rnorm(data_list$nProduct-1)) ,
  mPanelistProduct = rbind(rep(0,data_list$nProduct),cbind(rep(0,data_list$nPanelist-1),matrix(rnorm((data_list$nPanelist-1)*(data_list$nProduct-1)),nrow=data_list$nPanelist-1,ncol=data_list$nProduct-1)))
,
  tau = runif(1,1,2),
  tauPanelist = runif(1,1,3),
  tauProduct = runif(1,1,3)
  )

parameters of interest and calling JAGS

The parameters of interest is basically anything which we want know anything about. The JAGS call, is just listing all the parts provided before to JAGS. In this case, the model runs fairly quick, so I decided to have some extra iterations (n.iter) and an extra chain. For this moment, I decided not to calculate DIC.
parameters <- c('sdPanelist','sdProduct','gsd','meanPanelist',
  'meanProduct','Productdiff','sdPP')
jagsfit <- jags(data=data_list,inits=inits,model.file=model.file,
   parameters.to.save=parameters,n.chains=4,DIC=FALSE,n.iter=10000)

Result

The first part of the result can be obtained via a simple print of jagsfit
jagsfit

Inference for Bugs model at "/tmp/Rtmp0VFW9g/mijnmodel.txt", fit using jags,
 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 4000 iterations saved
                 mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
Productdiff[1]     0.562   0.312 -0.037  0.356  0.558  0.773  1.186 1.001  4000
Productdiff[2]     2.303   0.317  1.697  2.087  2.299  2.512  2.932 1.001  4000
Productdiff[3]     1.741   0.314  1.118  1.530  1.748  1.948  2.346 1.001  3700
Productdiff[4]     0.836   0.307  0.224  0.630  0.838  1.044  1.424 1.002  2000
Productdiff[5]     0.274   0.303 -0.322  0.069  0.273  0.478  0.870 1.001  2600
Productdiff[6]    -1.467   0.313 -2.081 -1.671 -1.465 -1.257 -0.865 1.001  2700
Productdiff[7]     0.339   0.308 -0.263  0.130  0.338  0.537  0.951 1.001  4000
Productdiff[8]    -0.223   0.299 -0.807 -0.430 -0.218 -0.021  0.349 1.001  4000
Productdiff[9]    -1.965   0.319 -2.587 -2.183 -1.966 -1.757 -1.322 1.001  4000
Productdiff[10]   -0.497   0.294 -1.068 -0.699 -0.495 -0.300  0.082 1.002  1600
Productdiff[11]    0.740   0.317  0.130  0.525  0.733  0.952  1.374 1.001  4000
Productdiff[12]    0.177   0.300 -0.412 -0.019  0.178  0.382  0.753 1.001  4000
Productdiff[13]   -1.564   0.322 -2.186 -1.776 -1.570 -1.349 -0.913 1.001  3300
Productdiff[14]   -0.097   0.300 -0.686 -0.300 -0.090  0.102  0.505 1.002  2100
Productdiff[15]    0.401   0.300 -0.180  0.197  0.405  0.597  0.997 1.001  4000
gsd                1.689   0.067  1.563  1.644  1.687  1.734  1.821 1.002  1700
meanPanelist[1]    6.667   0.492  5.680  6.334  6.659  7.002  7.601 1.001  3100
meanPanelist[2]    5.513   0.436  4.660  5.219  5.512  5.808  6.369 1.001  4000
meanPanelist[3]    6.325   0.443  5.439  6.031  6.331  6.624  7.171 1.001  4000
meanPanelist[4]    7.394   0.443  6.542  7.094  7.400  7.703  8.262 1.002  1800
meanPanelist[5]    7.104   0.445  6.227  6.807  7.104  7.405  7.982 1.001  4000
meanPanelist[6]    6.201   0.436  5.328  5.912  6.208  6.497  7.034 1.001  4000
meanPanelist[7]    6.386   0.444  5.521  6.078  6.384  6.688  7.248 1.003  1100
meanPanelist[8]    6.451   0.443  5.562  6.160  6.452  6.760  7.301 1.001  4000
meanPanelist[9]    5.184   0.441  4.334  4.882  5.184  5.480  6.050 1.002  2300
meanPanelist[10]   8.110   0.460  7.212  7.815  8.107  8.411  9.020 1.001  3700
meanPanelist[11]   5.122   0.448  4.227  4.811  5.121  5.433  5.989 1.002  2100
meanPanelist[12]   7.191   0.441  6.327  6.891  7.181  7.492  8.061 1.001  2900
meanPanelist[13]   6.719   0.438  5.893  6.421  6.720  7.013  7.563 1.001  4000
meanPanelist[14]   6.266   0.440  5.396  5.978  6.266  6.553  7.138 1.002  1700
meanPanelist[15]   6.859   0.434  6.002  6.571  6.867  7.148  7.716 1.002  2200
meanPanelist[16]   6.079   0.434  5.238  5.783  6.082  6.371  6.930 1.002  2100
meanPanelist[17]   6.054   0.433  5.213  5.769  6.051  6.345  6.893 1.002  1800
meanPanelist[18]   6.122   0.420  5.281  5.844  6.129  6.398  6.939 1.002  2200
meanPanelist[19]   6.185   0.438  5.324  5.888  6.182  6.479  7.064 1.002  1500
meanPanelist[20]   4.988   0.456  4.088  4.680  4.997  5.292  5.873 1.001  4000
meanPanelist[21]   4.994   0.455  4.064  4.689  4.996  5.305  5.884 1.001  3400
meanPanelist[22]   5.062   0.451  4.177  4.763  5.065  5.370  5.944 1.001  2600
meanPanelist[23]   7.171   0.453  6.296  6.855  7.172  7.480  8.057 1.001  4000
meanPanelist[24]   5.663   0.442  4.799  5.368  5.671  5.959  6.532 1.002  2400
meanPanelist[25]   7.514   0.455  6.627  7.210  7.518  7.815  8.425 1.001  4000
meanPanelist[26]   6.320   0.439  5.432  6.023  6.325  6.624  7.156 1.001  2900
meanPanelist[27]   6.853   0.431  6.013  6.562  6.843  7.144  7.705 1.001  4000
meanPanelist[28]   7.045   0.440  6.197  6.741  7.043  7.340  7.926 1.001  4000
meanPanelist[29]   4.789   0.456  3.915  4.472  4.789  5.096  5.687 1.001  4000
meanProduct[1]     7.084   0.223  6.653  6.937  7.081  7.230  7.539 1.001  4000
meanProduct[2]     6.522   0.215  6.097  6.375  6.524  6.669  6.928 1.001  4000
meanProduct[3]     4.781   0.227  4.332  4.626  4.778  4.927  5.227 1.001  4000
meanProduct[4]     6.248   0.213  5.838  6.101  6.248  6.394  6.660 1.002  1500
meanProduct[5]     6.745   0.217  6.317  6.605  6.748  6.885  7.171 1.001  4000
meanProduct[6]     6.344   0.221  5.910  6.199  6.347  6.494  6.772 1.001  4000
sdPP               0.140   0.123  0.024  0.050  0.096  0.193  0.461 1.010   290
sdPanelist         0.996   0.178  0.700  0.869  0.978  1.106  1.393 1.001  4000
sdProduct          0.996   0.536  0.426  0.670  0.864  1.164  2.293 1.001  4000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
It is a big table, and it is needed to extract the required data from it. A good way is to plot the results. There are two plots to start, a quick summary and extensive plots. As the second plot command makes one figure per four variables, it is omitted. In the figure, it is observed that some of the product differences are different from 0, this means that it is believed these differences are present. From meanProducts it seems product 3 is quite lower than the other products. There is also quite some variation in meanPanelist. To be specific, panelist 10 scores high, while 9 and 11 score low.
Variables gsd and sdPanelist might be used to examine panel performance, but to examine this better, they should be compared with results from other descriptors.
plot(jagsfit)

Details about products

A main question if obviously, which products are different? For this we can extract some data from a summary
jagsfit.mc <- as.mcmc(jagsfit)
# plot(jagsfit.mc) # this plot give too many figures for the blog
fitsummary <- summary(jagsfit.mc)
# extract differences
Productdiff <- fitsummary$quantiles[ grep('Productdiff',rownames(fitsummary$quantiles)),]
# extract differences different from 0
data_list$Productcontr[Productdiff[,1]>0 | Productdiff[,5]<0,]
# get the product means
ProductMean <- fitsummary$statistics[ grep('meanProduct',rownames(fitsummary$quantiles)),]
rownames(ProductMean) <- levels(sensochoc$Product)
ProductMean
The result shows us a table of product pairs which are different; most of these are related to product 3, but also product 1 is different from 4 and 6.  

> jagsfit.mc <- as.mcmc(jagsfit)
> # plot(jagsfit.mc)
> fitsummary <- summary(jagsfit.mc)
> # extract differences
> Productdiff <- fitsummary$quantiles[ grep('Productdiff',rownames(fitsummary$quantiles)),]
> # extract differences different from 0
> data_list$Productcontr[Productdiff[,1]>0 | Productdiff[,5]<0,]
   v1 v2
2   1  3
3   2  3
4   1  4
6   3  4
9   3  5
11  1  6
13  3  6
> # get the product means > ProductMean <- fitsummary$statistics[ grep('meanProduct',rownames(fitsummary$quantiles)),] > rownames(ProductMean) <- levels(sensochoc$Product) > ProductMean
          Mean        SD    Naive SE Time-series SE
choc1 7.083966 0.2225106 0.003518201    0.003233152
choc2 6.521746 0.2150476 0.003400201    0.003988800
choc3 4.780643 0.2272578 0.003593261    0.003553964
choc4 6.247723 0.2128971 0.003366198    0.003382249
choc5 6.745146 0.2165525 0.003423996    0.003230432
choc6 6.344260 0.2206372 0.003488580    0.003662731

Comparison with ANOVA

A few lines in R will give the standard analysis. The product means are very close. This ANOVA shows only differences involving product 3. This is probably due to usage of TukeyHSD, which can be a bit conservative in the ANOVA while the comparison in the Bayesian model is unprotected.


> library(car)
> aa <- aov(CocoaA ~ Product * Panelist,data=sensochoc)
> Anova(aa)
Anova Table (Type II tests)

Response: CocoaA
                 Sum Sq  Df F value    Pr(>F)    
Product          207.54   5 12.6045 1.876e-10 ***
Panelist         390.43  28  4.2343 1.670e-09 ***
Product:Panelist 322.29 140  0.6991    0.9861    
Residuals        573.00 174                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> TukeyHSD(aa,'Product')
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = CocoaA ~ Product * Panelist, data = sensochoc)

$Product
                  diff        lwr        upr     p adj
choc2-choc1 -0.5344828 -1.5055716  0.4366061 0.6088175
choc3-choc1 -2.4137931 -3.3848819 -1.4427043 0.0000000
choc4-choc1 -0.8275862 -1.7986750  0.1435026 0.1433035
choc5-choc1 -0.2931034 -1.2641923  0.6779854 0.9531915
choc6-choc1 -0.7241379 -1.6952268  0.2469509 0.2674547
choc3-choc2 -1.8793103 -2.8503992 -0.9082215 0.0000014
choc4-choc2 -0.2931034 -1.2641923  0.6779854 0.9531915
choc5-choc2  0.2413793 -0.7297095  1.2124682 0.9797619
choc6-choc2 -0.1896552 -1.1607440  0.7814337 0.9932446
choc4-choc3  1.5862069  0.6151181  2.5572957 0.0000742
choc5-choc3  2.1206897  1.1496008  3.0917785 0.0000000
choc6-choc3  1.6896552  0.7185663  2.6607440 0.0000192
choc5-choc4  0.5344828 -0.4366061  1.5055716 0.6088175
choc6-choc4  0.1034483 -0.8676406  1.0745371 0.9996304
choc6-choc5 -0.4310345 -1.4021233  0.5400544 0.7960685

> model.tables(aa,type='mean',cterms='Product')
Tables of means
Grand mean
         
6.287356 

 Product 
Product
choc1 choc2 choc3 choc4 choc5 choc6 
7.086 6.552 4.672 6.259 6.793 6.362 

Conclusion

JAGS can be used to analyzed sensory profiling data. However, if a simple model such as two way ANOVA is used, it does not seem to be worth the trouble.

No comments:

Post a Comment