Sunday, September 6, 2015

Predicting Titanic deaths on Kaggle V: Ranger

In two previous posts (Predicting Titanic deaths on Kaggle IV: random forest revisited, Predicting Titanic deaths on Kaggle) I was unable to make random forest predict as well as boosting. Hence when I read about an alternative implementation; ranger I took the opportunity to check if with ranger I could improve predictions. The claim ranger makes is that it is faster than RandomForest.
Meanwhile, I have also been reading that RandomForest is not the best implementation. I did not bookmark where, but Google turned up Benchmarking Random Forest Classification on wise.io, and Benchmarking Random Forest Implementations on DataScience.LA.

Data

A slightly adapted version of pre-processing was used. I created a more simple version of the cabin character, thereby moving it from different variables A to F to one multilevel factor with values A, B, C, D, E, X. In addition ticket has been made in a factor with more levels.

Age

Ranger is supposed to be fast, so I took the opportunity to do a replicated cross validation for predicting missing values for age. The cross validation is replicated because in previous experiments I found the difference between two settings might be smaller than the variation within settings. For this errorest() from the ipred package was used and a wrapper around ranger's predict was written to make it function. The plot below shows cross validation error for 10 different runs for each of the settings, so also has an indication of the variation of the cross validation error. From the plot. it was decided to chose mtry=2, nodesize=7. As can be seen from the plot, mtry 1 has the worse predictions, while at nodesize 7 and higher reasonable predictions can be made. In hindsight, bigger nodesizes might even be better, but this was not investigated at the time.

Survival

Again using replicated cross validation, the following prediction errors were found. No mtry=1 in the plot, but that did not perform well. In this plot it does not seem like there are huge differences, but the combination of small nodesize and larger mtry does not seem to pay off. Since these prediction errors are in the same range as previous results it was decided not to make a Kaggle submission on these data.

Number of trees

Since it is unclear to me what the influence of the number of trees was, I did a small experiment with 50, 500 and 5000 trees. Again 10 times a cross validation. In this plot, 50 trees gives a surprising good prediction error for such a simple model, but 5000 is a bit better. Rather than investigating if there is sufficient number of trees, I cut the corner and chose a large number; 200000. It should be noted that this fit into memory and only took a few minutes. However, this did not improve the Kaggle score of my previous random forest attempt.

Conclusion

Ranger is indeed a fast and memory sparse random forest implementation. However, it was not able to improve my prediction error.

Code

Please know that code has been reformatted after pasting in blogger to improve layout. Some intermediate data saving and restoring code has been removed too.
# preparation and data reading section
library(ranger)
#https://www.reddit.com/r/MachineLearning/comments/3hvy7v/ranger_a_fast_implementation_of_random_forests/
library(lattice)
library(latticeExtra)
# has cross validation
library(ipred)

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Embarked[tt$Embarked==''] <- 'S'
tt$Embarked <- factor(tt$Embarked)
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title=='Dr' & tt$Sex=='female'] <- 'Miss'
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer',
   'Major','Rev','Dr')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms',
   'theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
# changed cabin character
tt$cabchar <- substr(tt$Cabin,1,1)
tt$cabchar[tt$cabchar %in% c('F','G','T')] <- 'X';
tt$cabchar <- factor(tt$cabchar)
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
tt$ticket <- sub('[[:digit:]]+$','',tt$Ticket)
tt$ticket <- toupper(gsub('(\\.)|( )|(/)','',tt$ticket))
tt$ticket[tt$ticket %in% c('A2','A4','AQ3','AQ4','AS')] <- 'An'
tt$ticket[tt$ticket %in% c('SCA3','SCA4','SCAH','SC','SCAHBASLE','SCOW')] <- 'SC'
tt$ticket[tt$ticket %in% c('CASOTON','SOTONO2','SOTONOQ')] <- 'SOTON'
tt$ticket[tt$ticket %in% c('STONO2','STONOQ')] <- 'STON'
tt$ticket[tt$ticket %in% c('C')] <- 'CA'
tt$ticket[tt$ticket %in% c('SOC','SOP','SOPP')] <- 'SOP'
tt$ticket[tt$ticket %in% c('SWPP','WC','WEP')] <- 'W'
tt$ticket[tt$ticket %in% c('FA','FC','FCC')] <- 'F'
tt$ticket[tt$ticket %in% c('PP','PPP','LINE','LP','SP')] <- 'PPPP'
tt$ticket <- factor(tt$ticket)

#end of preparation and data reading

# age section
# get an age without missings
forage <- tt[!is.na(tt$Age) & tt$status=='train',
   names(tt) %in% c('Age','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','cabchar','ncabin','ticket')]
# oe is side of vessel, not relevant for age?

totest <- expand.grid(mtry=1:4,min.node.size=1:11,rep=1:10)

la <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Age ~ .,
          mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          model=ranger,
          predict=function(object,newdata) 
            predict(object,data=newdata)$predictions,
          write.forest=TRUE,
          data=forage)
      cc <- c(mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          error=ee$error)
      # print(cc)
      cc
    })
sla <- do.call(rbind,la)
sla <- as.data.frame(sla)

useOuterStrips(
    densityplot(~ error | factor(mtry)+factor(min.node.size), 
        data=sla))

# 2,7?
rfa1 <- ranger(Age ~ .,
    data=forage,
    mtry=2,
    write.forest=TRUE,
    min.node.size=7)

tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(rfa1,tt[is.na(tt$AGE),])$predictions
table(tt$age)
# end of age section

#final data section
train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]
#end of final data section

#model selection 1
forSurf <- train[,names(train) %in% c('Survived','AGE','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','ncabin','ticket','oe')]

totest <- expand.grid(mtry=2:5,
  min.node.size=c(1:4,seq(6,12,2),15,20,25),rep=1:10)

la2 <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          model=ranger,
          predict=function(object,newdata) 
            predict(object,data=newdata)$predictions,
          write.forest=TRUE,
          data=forSurf
      )
      cc <- c(mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          error=ee$error)
      cat('.')
      if (totest$mtry[ii]==max(totest$mtry) & 
          totest$min.node.size[ii]==max(totest$min.node.size)) 
        cat('\n')
      cc
    })
sla2 <- do.call(rbind,la2)
sla2 <- as.data.frame(sla2)

useOuterStrips(
    densityplot(~ error | factor(mtry)+factor(min.node.size), 
        data=sla2))

############

totest <- expand.grid(num.trees=c(50,500,5000),rep=1:10)

la3 <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=4,
          min.node.size=12,
          num.trees=totest$num.trees[ii],
          model=ranger,
          predict=function(object,newdata) 
            predict(object,data=newdata)$predictions,
          write.forest=TRUE,
          data=forSurf
      )
      cc <- c(num.trees=totest$num.trees[ii],error=ee$error)
      cat('.')
      if (totest$num.trees[ii]==max(totest$num.trees)) cat('\n')
      cc
    })
sla3 <- do.call(rbind,la3)
sla3 <- as.data.frame(sla3)
densityplot(~ error | factor(num.trees), data=sla3)

#########
rang3 <- ranger(Survived ~ .,
    mtry=4,
    min.node.size=12,
    write.forest=TRUE,
    data=forSurf,
    num.trees=200000) 

pp <- predict(rang3,test)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp$predictions,row.names=NULL)
write.csv(x=out,
    file='rf.1.sep.csv',
    row.names=FALSE,
    quote=FALSE)

# Your submission scored 0.75598, which is not an improvement of your best score.

No comments:

Post a Comment