Sunday, January 17, 2016

A simple ANOVA

I was browsing Davies Design and Analysis of Industrial Experiments (second edition, 1967). Published by for ICI in times when industry did that kind of thing. It is quite an applied book. On page 107 there is an example where the variance of a process is estimated.

Data

Data is from nine batches from which three samples were selected (A, B and C) and each a duplicate measurement. I am not sure about copyright of these data, so I will not reprint the data here. The problem is to determine the measurement ans sampling error in a chemical process.
ggplot(r4,aes(x=Sample,y=x))+
    geom_point()+
    facet_wrap(~  batch )



Analysis

At the time of writing the book, the only approach was to do a classical ANOVA and calculate the estimates from there.
aov(x~ batch + batch:Sample,data=r4) %>%
  anova
Analysis of Variance Table

Response: x
             Df Sum Sq Mean Sq  F value  Pr(>F)    
batch         8 792.88  99.110 132.6710 < 2e-16 ***
batch:Sample 18  25.30   1.406   1.8818 0.06675 .  
Residuals    27  20.17   0.747                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this case the residual variation is 0.75. The batch:Sample variation estimates is, due to the design, twice the sapling variation plus residual variation. Hence it is estimated as 0.33. How lucky we are to have tools (lme4) which can do this estimate directly. In this case, as it was a well designed experiment, these estimates are the same as from the ANOVA. 
l1 <- lmer(x ~1+ (1 | batch) + (1|batch:Sample) ,data=r4 )

summary(l1)
Linear mixed model fit by REML ['lmerMod']
Formula: x ~ 1 + (1 | batch) + (1 | batch:Sample)
   Data: r4

REML criterion at convergence: 189.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.64833 -0.50283 -0.06649  0.55039  1.57801 

Random effects:
 Groups       Name        Variance Std.Dev.
 batch:Sample (Intercept)  0.3294  0.5739  
 batch        (Intercept) 16.2841  4.0354  
 Residual                  0.7470  0.8643  
Number of obs: 54, groups: batch:Sample, 27; batch, 9

Fixed effects:
            Estimate Std. Error t value

(Intercept)   47.148      1.355    34.8
A next step is confidence intervals around the estimates. Davies uses limits from a Chi-squared distribution for the residual variation, leading to a 90% interval 0.505  to 1.25. In contrast lme4 has two estimators, profile (computing a likelihood profile and finding the appropriate cutoffs based on the likelihood ratio test;and bootstrap (perform parametric bootstrapping with confidence intervals computed from the bootstrap distribution according to boot.type). Each of these takes one or few second on my laptop, not feasible for the pre computer age. The estimates are different, to my surprise more narrow:
Computing profile confidence intervals ...
                   5 %       95 %
.sig01       0.0000000  0.9623748
.sig02       2.6742109  5.9597328
.sigma       0.7017849  1.1007261
(Intercept) 44.8789739 49.4173227

Computing bootstrap confidence intervals ...
                                  5 %       95 %
sd_(Intercept)|batch:Sample  0.000000  0.8880414
sd_(Intercept)|batch         2.203608  5.7998348
sigma                        0.664149  1.0430984

(Intercept)                 45.140652 49.4931109
Davies continues to estimate the ratio to residual for sampling variation, which was the best available for that time. This I won't repeat.

Sunday, January 3, 2016

A plot of 'Who works at home'

I ran across this post containing displays on who works from home. I must say it looks great and is interactive but it did not help me understand the data. So I created this post to display the same data with a boring plot which might help me. For those really interested in this topic, census.gov created a .pdf which contains a full report with much more information than here.

Data

Data is from census.gov. I have taken the first spreadsheet. It is one of those spreadsheets with counts and percentages and empty lines to display categories. Very nice to check some numbers, horrible to process. So, a bit of code to extract the numbers.
library(gdata)
r1 <- read.xls('2010_Table_1.xls',stringsAsFactors=FALSE)
# throw out percentages
r2 <- r1[,r1[4,]!='Percent']
# put all column names in one row
r2$X.6[2] <- r2$X.6[3]
r2$X.8[2] <- r2$X.8[3]
# select part with data
r3 <- r2[2:61,c(1,3,5,6)]
names(r3)[1] <- r3[1,1]
r4 <-r3[c(-1:-3),]
#eliminate one row with mean income. 
r4 <- r4[-grep('$',r4[,2],fixed=TRUE),]
#reshape in long form
r5 <- reshape(r4,
    varying=list(names(r4)[-1]),
    v.names='count',
    direction='long',
    idvar='Characteristic',
    timevar='class',
    times=r3[1,2:4])
row.names(r5) <- 1:nrow(r5)
# remove ',' from numbers and make numerical values. 
# units are in 1000, so update that too
r5$count <- as.numeric(gsub(',','',r5$count))*1000
# clean up numbers used for footnotes
r5$class <- gsub('(1|2|3)','',r5$class)
#some upfront '.' removed.
r5$Characteristic <- gsub('^\\.+','',r5$Characteristic)
# create a factor
r5$Characteristic <- factor(r5$Characteristic,
   levels=rev(r5$Characteristic[r5$class=='Home Workers']))
# and create a higher level factor
r5$Mchar=r5$Characteristic
for (i in 1:nrow(r5)) r5$Mchar[i] <- 
   if(is.na(r5$count[i]) | r5$Mchar[i]=='Total') r5$Mchar[i] 
   else r5$Mchar[i-1]

Plot

The plot is made using old style graphics. I could not get either ggplot2 or lattice to provide the plot I wanted.
# prepare for axis labels
index <- subset(r5,r5$class=='Home Workers',c(Characteristic,Mchar))
index$y=56:1
index2 <- index[index$Characteristic!=index$Mchar | index$Characteristic=='Total',]
index3 <- index[index$Characteristic==index$Mchar & index$Characteristic!='Total',]

r6 <- merge(r5,index)
r6$class <- factor(r6$class)
par(mar=c(5,18,4,2)+.1,cex=.7)
plot(x=r6$count,y=r6$y,axes=FALSE,
    xlab='Count',
    ylab='',
    col=c('red','green','blue')[r6$class],
    frame.plot=TRUE,
 #   log='x',
    ylim=c(2,58))
axis(1)
axis(2,at=index2$y,labels=index2$Characteristic,las=1)
text(y=index3$y-.1,x=30000,labels=index3$Characteristic,adj=0)
legend('topleft',legend=levels(r6$class),
    ncol=3,col=c('red','green','blue'),
    border=NULL,pch=1,
    yjust=0)

Why I did not use ggplot2?

The ideal solution for ggplot2 might look something like this:
r7 <- r5[!is.na(r5$count),]
r7$Mchar <- factor(r7$Mchar,levels=unique(r7$Mchar))
ggplot(data=r7,
        aes(x=Characteristic,y=count,col=class)) +
    geom_point()+
    coord_flip()+
    xlab('')+ylab('')+
    ylim(0,max(r5$count))+
    facet_wrap(~Mchar,scales='free_x',ncol=2)+
    theme(legend.position="bottom")
However, this throws an error:
Error in facet_render.wrap(plot$facet, panel, plot$coordinates, theme,  : 
  ggplot2 does not currently support free scales with a non-cartesian coord or coord_flip.
I also tried the system described here: http://wresch.github.io/2014/05/22/aligning-ggplot2-graphs.html, but I think width has changed in content, could not get that to be satisfactory.

library(gtable)
library(gridExtra)

tt <- as.data.frame(table(r7$Mchar))
tt$Var1
tt$Freq[12] <- tt$Freq[12] +15

la <- lapply(tt$Var1,function(x) {
      r8 <- r5[r5$Mchar==as.character(x) ,]
      r8 <- r8[ !is.na(r8$count),]
      ggplot(data=r8,
              aes(x=Characteristic,y=count,col=class)) +
          geom_point()+
          coord_flip()+
          xlab('')+ylab('')+
          ylim(0,max(r5$count))
    })

# http://wresch.github.io/2014/05/22/aligning-ggplot2-graphs.html
lax <- lapply(la,function(x) x$widths[2:3])
maxwidths <- do.call(grid::unit.pmax,lax)
for(i in 1:12) la[[i]]$widths <- as.list(maxwidths)


la[[12]] <- la[[12]] + 
    theme(legend.position="bottom",  
        plot.margin = unit(c(0.01, 0.1, 0.02, 0.1), "null"))
for (i in 1:11) la[[i]] <- la[[i]] +
      theme(legend.position="none",
          axis.text.x = element_blank(),
          axis.title.x = element_blank(), 
          axis.ticks.x = element_blank(),
         plot.margin = unit(c(0.01, 0.1, 0.02, 0.1), "null"))

lag <- lapply(la,ggplotGrob)


g <- gtable_matrix(name = "demo",
    grobs = matrix(lag, nrow = 12), 
    widths = unit(9, "null"),
    heights = unit(tt$Freq, "null"))

grid.newpage()
grid.draw(g)