Saturday, September 19, 2015

Predicting Titanic deaths on Kaggle VI: Stan

It is a bit a contradiction. Kaggle provides competitions on data science, while Stan is clearly part of the (Bayesian) statistics. Yet after using random forests, boosting and bagging, I also think this problem has a suitable size for Stan, which I understand can handle larger problems than older Bayesian software such as JAGS.
What I aim to do is enter a load of variables in the Stan model. Aliasing will be ignored, and I hope the hierarchical model will provide suitable shrinkage for terms which are not relevant. 

Data

The data has been mildly adapted from previously. Biggest change is that I have decided to make age into a factor, based on the value when present, with a special level for missing. The alternative in context of a Bayesian model would be to make fitting age part of the model, but that seems more complex than I am willing to go at this point.

Model

The idea of the model is pretty simple. it will be logistic regression, with only factors in the independent variables. All levels in all factors are used. So when Sex is entered in the model, it will add two parameter values, one for male, one for female (see code below, variable fsex). When passenger class is entered in the model, it has three values (variabe fpclass). Upon the parameter values there is a prior distribution. I have chosen a normal prior distribution, around zero with standard deviation sdsex and sdpclass respectively. These have a common prior (sd1). I have used both half normal with standard deviation 3 as below and uniform (0,3) for the prior.
  parameters {
    real fsex[2];
    real intercept;
    real fpclass[3];
    real <lower=0> sdsex;
    real <lower=0> sdpclass;
    real <lower=0> sd1;
  }
  transformed parameters {
    real expect[ntrain];
    for (i in 1:ntrain) {
      expect[i] <- inv_logit(
        intercept+
        fsex[sex[i]]+
        fpclass[pclass[i]]
        );
      }    
  }
  model {        
    fsex ~ normal(0,sdsex);
    fpclass ~ normal(0,sdpclass);

    sdsex ~  normal(0,sd1);
    sdpclass ~ normal(0,sd1);
    sd1 ~ normal(0,3);
    intercept ~ normal(0,1);
    survived ~ bernoulli(expect);
  }

Predictions

In this phase I have decided to make the predictions within the Stan program. The way which seemed to work is to duplicate all independent variables and do the predictions in the generated quantities section of the program. These predictions are actually probabilities of survival for each passenger for each MCMC sample. Hence a bit of post processing is used, the mean probability is calculated and a cut off of 0.5 is used to decide the final classification.
Since the Stan code runs pretty quickly after it has been compiled it is feasible to run this as a two stage process. A first run to examine the model parameters. A second run just to obtain the predictions.

Model 1

This is the parameter output of a model with just age and passenger class. It has been added here mostly to show a simple full coded example what the code looks like. With sdsex bigger than sdpclass it follows that sex had a bigger influence than class on the chance of survival.
Inference for Stan model: 8fb625a6ccf29aab919e1dcd494247aa.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
intercept    -0.07    0.03 0.83   -1.68   -0.65   -0.07    0.49    1.61   670 1.00
fsex[1]       1.40    0.04 0.89   -0.46    0.83    1.40    1.98    3.07   533 1.01
fsex[2]      -1.24    0.04 0.89   -3.11   -1.81   -1.23   -0.65    0.45   529 1.01
fpclass[1]    0.94    0.03 0.70   -0.40    0.50    0.93    1.33    2.34   536 1.01
fpclass[2]    0.12    0.03 0.69   -1.24   -0.31    0.11    0.52    1.55   527 1.01
fpclass[3]   -0.93    0.03 0.69   -2.25   -1.35   -0.94   -0.50    0.47   499 1.01
sdsex         2.06    0.05 1.20    0.76    1.23    1.74    2.49    5.36   648 1.00
sdpclass      1.40    0.04 0.90    0.50    0.84    1.15    1.67    3.86   661 1.00
sd1           2.41    0.04 1.27    0.71    1.45    2.18    3.11    5.48  1071 1.00
lp__       -421.36    0.09 2.12 -426.25 -422.56 -421.00 -419.82 -418.19   616 1.00

Samples were drawn using NUTS(diag_e) at Sun Sep 13 12:28:49 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Model 2

Model 2 has all linear effects in it. Predictions using this model were submitted, it gave a score of 0.78947 which was not an improvement over previous scores. I have one submission using bagging which did better, got same result with boosting and worse results with random forest. 
Note that while this model looks pretty complex, this score was obtained without any interactions. 
Inference for Stan model: 6278b3ebade9802ad9544b1242bada20.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

                mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
intercept       0.28    0.05  0.74   -1.20   -0.16    0.28    0.77    1.68   229 1.01
sd1             0.80    0.03  0.23    0.43    0.63    0.77    0.97    1.31    63 1.05
fsex[1]         0.24    0.03  0.56   -0.64   -0.01    0.13    0.39    1.85   353 1.01
fsex[2]        -0.05    0.04  0.50   -1.29   -0.24    0.00    0.22    0.89   131 1.02
fpclass[1]      0.54    0.07  0.57   -0.39    0.15    0.53    0.90    1.77    60 1.03
fpclass[2]      0.15    0.06  0.50   -0.68   -0.27    0.16    0.50    1.15    76 1.01
fpclass[3]     -0.78    0.05  0.48   -1.66   -1.10   -0.79   -0.47    0.18   102 1.01
fembarked[1]    0.26    0.02  0.33   -0.29    0.06    0.22    0.43    1.02   298 1.01
fembarked[2]    0.08    0.02  0.32   -0.55   -0.07    0.06    0.20    0.78   456 1.00
fembarked[3]   -0.18    0.02  0.31   -0.80   -0.33   -0.18   -0.01    0.45   378 1.01
foe[1]         -0.32    0.08  0.53   -1.63   -0.65   -0.28    0.04    0.54    41 1.07
foe[2]          0.04    0.02  0.43   -0.73   -0.11    0.01    0.19    1.02   433 1.01
foe[3]          0.34    0.03  0.48   -0.41    0.04    0.26    0.57    1.54   227 1.03
fcabchar[1]    -0.25    0.18  0.46   -1.16   -0.49   -0.13    0.03    0.45     7 1.17
fcabchar[2]    -0.04    0.03  0.35   -0.86   -0.16   -0.04    0.10    0.62   143 1.03
fcabchar[3]     0.07    0.01  0.27   -0.47   -0.06    0.05    0.18    0.75   325 1.03
fcabchar[4]    -0.09    0.04  0.31   -0.85   -0.23   -0.03    0.08    0.48    55 1.05
fcabchar[5]     0.22    0.02  0.33   -0.32    0.00    0.17    0.40    1.03   219 1.02
fcabchar[6]     0.31    0.04  0.36   -0.25    0.02    0.25    0.59    1.12    76 1.03
fcabchar[7]    -0.21    0.06  0.40   -1.03   -0.43   -0.11    0.04    0.45    44 1.05
fage[1]         0.14    0.06  0.32   -0.38   -0.04    0.04    0.28    0.78    27 1.10
fage[2]         0.45    0.14  0.59   -0.12    0.02    0.17    0.77    1.68    19 1.17
fage[3]         0.10    0.15  0.53   -0.74   -0.15   -0.01    0.12    1.37    13 1.24
fage[4]        -0.10    0.01  0.31   -0.86   -0.25   -0.05    0.03    0.48   558 1.01
fage[5]         0.17    0.10  0.43   -0.52   -0.04    0.04    0.25    1.07    20 1.14
fage[6]         0.00    0.01  0.21   -0.42   -0.09   -0.01    0.09    0.47   250 1.01
fage[7]         0.09    0.03  0.20   -0.32   -0.01    0.05    0.22    0.47    50 1.05
fage[8]        -0.22    0.03  0.31   -1.06   -0.34   -0.15   -0.01    0.17   129 1.03
fage[9]        -0.02    0.07  0.39   -0.98   -0.16   -0.01    0.12    0.60    27 1.09
fage[10]        0.00    0.01  0.19   -0.43   -0.06    0.01    0.06    0.40   813 1.00
fticket[1]      0.07    0.03  0.21   -0.33   -0.06    0.04    0.22    0.51    54 1.05
fticket[2]     -0.04    0.03  0.34   -0.87   -0.17   -0.01    0.17    0.59   108 1.02
fticket[3]     -0.17    0.05  0.43   -1.23   -0.38   -0.09    0.06    0.60    69 1.04
fticket[4]      0.01    0.05  0.31   -0.48   -0.19    0.01    0.19    0.69    40 1.06
fticket[5]      0.02    0.02  0.36   -0.64   -0.13   -0.02    0.15    0.95   210 1.01
fticket[6]      0.03    0.05  0.32   -0.63   -0.15    0.02    0.21    0.55    37 1.07
fticket[7]      0.14    0.02  0.38   -0.40   -0.05    0.05    0.26    1.18   245 1.01
fticket[8]      0.01    0.01  0.35   -0.84   -0.12    0.02    0.20    0.73   929 1.01
fticket[9]      0.01    0.01  0.32   -0.69   -0.12    0.03    0.14    0.65   766 1.00
fticket[10]    -0.23    0.04  0.46   -1.53   -0.39   -0.08    0.03    0.38   149 1.02
fticket[11]    -0.02    0.01  0.32   -0.66   -0.16   -0.04    0.11    0.70  1001 1.01
fticket[12]     0.37    0.05  0.47   -0.16    0.03    0.20    0.57    1.58    82 1.04
fticket[13]    -0.20    0.02  0.34   -1.05   -0.36   -0.14    0.01    0.36   221 1.03
ftitle[1]       1.26    0.08  0.73   -0.03    0.75    1.20    1.74    2.85    79 1.03
ftitle[2]       0.47    0.04  0.70   -1.05    0.13    0.45    0.87    1.79   372 1.01
ftitle[3]      -2.27    0.03  0.66   -3.59   -2.59   -2.34   -1.88   -0.86   486 1.01
ftitle[4]       0.78    0.06  0.73   -0.86    0.35    0.89    1.25    2.10   144 1.01
fsibsp[1]       0.70    0.06  0.51   -0.33    0.36    0.69    1.09    1.69    77 1.04
fsibsp[2]       0.48    0.08  0.53   -0.62    0.14    0.49    0.93    1.46    49 1.06
fsibsp[3]       0.67    0.11  0.65   -0.60    0.23    0.61    1.16    1.85    35 1.07
fsibsp[4]      -1.58    0.04  0.64   -2.89   -2.03   -1.52   -1.16   -0.38   292 1.02
fparch[1]       0.25    0.02  0.33   -0.25    0.03    0.22    0.39    1.02   450 1.01
fparch[2]       0.04    0.01  0.33   -0.53   -0.13    0.00    0.17    0.83   521 1.01
fparch[3]      -0.03    0.06  0.38   -0.64   -0.22   -0.01    0.14    0.78    35 1.07
fparch[4]      -0.47    0.18  0.54   -1.48   -0.81   -0.31   -0.02    0.22     9 1.12
ffare[1]       -0.01    0.01  0.15   -0.39   -0.04    0.00    0.03    0.29   421 1.01
ffare[2]       -0.02    0.01  0.15   -0.41   -0.06    0.00    0.02    0.26   419 1.01
ffare[3]       -0.02    0.01  0.15   -0.35   -0.06    0.00    0.02    0.26   671 1.00
ffare[4]        0.00    0.01  0.16   -0.31   -0.05    0.00    0.04    0.36   586 1.01
ffare[5]        0.08    0.01  0.20   -0.15   -0.01    0.01    0.12    0.66   247 1.02
sdsex           0.51    0.04  0.44    0.03    0.21    0.35    0.68    1.64   123 1.01
sdpclass        0.75    0.04  0.36    0.32    0.48    0.68    0.94    1.59    64 1.04
sdembarked      0.43    0.02  0.32    0.06    0.21    0.39    0.53    1.24   328 1.01
sdoe            0.56    0.05  0.38    0.06    0.32    0.47    0.73    1.51    48 1.05
sdcabchar       0.39    0.03  0.26    0.03    0.18    0.36    0.59    0.97    61 1.04
sdage           0.33    0.06  0.29    0.02    0.09    0.23    0.53    0.89    23 1.15
sdticket        0.35    0.03  0.24    0.04    0.19    0.30    0.45    0.99    68 1.05
sdtitle         1.29    0.02  0.35    0.75    1.07    1.19    1.47    2.15   320 1.01
sdsibsp         1.00    0.02  0.37    0.49    0.74    0.95    1.16    1.95   222 1.01
sdparch         0.47    0.12  0.37    0.01    0.16    0.38    0.73    1.25    10 1.12
sdfare          0.14    0.02  0.17    0.00    0.02    0.09    0.19    0.60    92 1.03
lp__         -342.28    2.74 15.87 -372.33 -353.23 -344.26 -330.95 -310.12    34 1.09

Samples were drawn using NUTS(diag_e) at Sun Aug 30 13:00:19 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Code

Data reading

# preparation and data reading section
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Embarked[tt$Embarked==''] <- 'S'
tt$Embarked <- factor(tt$Embarked)
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 999
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100,1000))

tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title=='Dr' & tt$Sex=='female'] <- 'Miss'
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major','Rev','Dr')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
# changed cabin character
tt$cabchar <- substr(tt$Cabin,1,1)
tt$cabchar[tt$cabchar %in% c('F','G','T')] <- 'X';
tt$cabchar <- factor(tt$cabchar)
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
tt$ticket <- sub('[[:digit:]]+$','',tt$Ticket)
tt$ticket <- toupper(gsub('(\\.)|( )|(/)','',tt$ticket))
tt$ticket[tt$ticket %in% c('A2','A4','AQ3','AQ4','AS')] <- 'An'
tt$ticket[tt$ticket %in% c('SCA3','SCA4','SCAH','SC','SCAHBASLE','SCOW')] <- 'SC'
tt$ticket[tt$ticket %in% c('CASOTON','SOTONO2','SOTONOQ')] <- 'SOTON'
tt$ticket[tt$ticket %in% c('STONO2','STONOQ')] <- 'STON'
tt$ticket[tt$ticket %in% c('C')] <- 'CA'
tt$ticket[tt$ticket %in% c('SOC','SOP','SOPP')] <- 'SOP'
tt$ticket[tt$ticket %in% c('SWPP','WC','WEP')] <- 'W'
tt$ticket[tt$ticket %in% c('FA','FC','FCC')] <- 'F'
tt$ticket[tt$ticket %in% c('PP','PPP','LINE','LP','SP')] <- 'PPPP'
tt$ticket <- factor(tt$ticket)
tt$fare <- cut(tt$Fare,breaks=c(min(tt$Fare)-1,quantile(tt$Fare,seq(.2,.8,.2)),max(tt$Fare)+1))

train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]

#end of preparation and data reading

options(width=90)

First model

datain <- list(

    survived = c(0,1)[train$Survived],
    ntrain = nrow(train),
    ntest=nrow(test),
    sex=c(1:2)[train$Sex],
    psex=c(1:2)[test$Sex],
    pclass=c(1:3)[train$Pclass],
    ppclass=c(1:3)[test$Pclass]    
)

parameters=c('intercept','fsex','fpclass','sdsex','sdpclass','sd1')
my_code <- ' 
    data {
    int<lower=0> ntrain;
    int<lower=0> ntest;
    int survived[ntrain];
    int <lower=1,upper=2> sex[ntrain];
    int <lower=1,upper=2> psex[ntest];
    int <lower=1,upper=3> pclass[ntrain];
    int <lower=1,upper=3> ppclass[ntest];

    }
    parameters {
    real fsex[2];
    real intercept;
    real fpclass[3];
    real <lower=0> sdsex;
    real <lower=0> sdpclass;
    real <lower=0> sd1;
    }
    transformed parameters {
    real expect[ntrain];
    for (i in 1:ntrain) {
    expect[i] <- inv_logit(
      intercept+
      fsex[sex[i]]+
      fpclass[pclass[i]]
    );
    }
    
    }
    model {        
    fsex ~ normal(0,sdsex);
    fpclass ~ normal(0,sdpclass);

    sdsex ~  normal(0,sd1);
    sdpclass ~ normal(0,sd1);
    sd1 ~ normal(0,3);
    intercept ~ normal(0,1);
    survived ~ bernoulli(expect);
    }
    generated quantities {
    real pred[ntest];
    for (i in 1:ntest) {
    pred[i] <- inv_logit(
    intercept+
        fsex[psex[i]]+
        fpclass[ppclass[i]]

    );
    }
    }
    '

fit1 <- stan(model_code = my_code, 
    data = datain, 
    pars=parameters,
    iter = 1000, 
    chains = 4,
    open_progress=FALSE)

fit1

Second model


datain <- list(
    survived = c(0,1)[train$Survived],
    ntrain = nrow(train),
    ntest=nrow(test),
    sex=c(1:2)[train$Sex],
    psex=c(1:2)[test$Sex],
    pclass=c(1:3)[train$Pclass],
    ppclass=c(1:3)[test$Pclass],
    embarked=c(1:3)[train$Embarked],
    pembarked=c(1:3)[test$Embarked],
    oe=c(1:3)[train$oe],
    poe=c(1:3)[test$oe],
    cabchar=c(1:7)[train$cabchar],
    pcabchar=c(1:7)[test$cabchar],
    age=c(1:10)[train$age],
    page=c(1:10)[test$age],
    ticket=c(1:13)[train$ticket],
    pticket=c(1:13)[test$ticket],
    title=c(1:4)[train$Title],
    ptitle=c(1:4)[test$Title],
    sibsp=c(1:4,rep(4,6))[train$SibSp+1],
    psibsp=c(1:4,rep(4,6))[test$SibSp+1],
    parch=c(1:4,rep(4,6))[train$Parch+1],
    pparch=c(1:4,rep(4,6))[test$Parch+1],
    fare=c(1:5)[train$fare],
    pfare=c(1:5)[test$fare]
)

parameters=c('intercept','sd1',
    'fsex','fpclass','fembarked',
    'foe','fcabchar','fage',
    'fticket','ftitle',
    'fsibsp','fparch',
    'ffare',
    'sdsex','sdpclass','sdembarked',
    'sdoe','sdcabchar','sdage',
    'sdticket','sdtitle',
    'sdsibsp','sdparch',
    'sdfare')
my_code <- ' 
    data {
    int<lower=0> ntrain;
    int<lower=0> ntest;
    int survived[ntrain];
    int <lower=1,upper=2> sex[ntrain];
    int <lower=1,upper=2> psex[ntest];
    int <lower=1,upper=3> pclass[ntrain];
    int <lower=1,upper=3> ppclass[ntest];
    int <lower=1,upper=3> embarked[ntrain];
    int <lower=1,upper=3> pembarked[ntest];
    int <lower=1,upper=3> oe[ntrain];
    int <lower=1,upper=3> poe[ntest];
    int <lower=1,upper=7> cabchar[ntrain];
    int <lower=1,upper=7> pcabchar[ntest];
    int <lower=1,upper=10> age[ntrain];
    int <lower=1,upper=10> page[ntest];
    int <lower=1,upper=13> ticket[ntrain];
    int <lower=1,upper=13> pticket[ntest];
    int <lower=1,upper=4> title[ntrain];
    int <lower=1,upper=4> ptitle[ntest];
    int <lower=1,upper=4> sibsp[ntrain];
    int <lower=1,upper=4> psibsp[ntest];
    int <lower=1,upper=4> parch[ntrain];
    int <lower=1,upper=4> pparch[ntest];
    int <lower=1,upper=5> fare[ntrain];
    int <lower=1,upper=5> pfare[ntest];
    
    }
    parameters {
    real fsex[2];
    real intercept;
    real fpclass[3];
    real fembarked[3];
    real foe[3];
    real fcabchar[7];
    real fage[10];
    real fticket[13];
    real ftitle[4];
    real fparch[4];
    real fsibsp[4];
    real ffare[5];
    real <lower=0> sdsex;
    real <lower=0> sdpclass;
    real <lower=0> sdembarked;
    real <lower=0> sdoe;
    real <lower=0> sdcabchar;
    real <lower=0> sdage;
    real <lower=0> sdticket;
    real <lower=0> sdtitle;
    real <lower=0> sdparch;
    real <lower=0> sdsibsp;
    real <lower=0> sdfare;

    real <lower=0> sd1;
    }
    transformed parameters {
    real expect[ntrain];
    for (i in 1:ntrain) {
    expect[i] <- inv_logit(
    intercept+
    fsex[sex[i]]+
    fpclass[pclass[i]]+
    fembarked[embarked[i]]+
    foe[oe[i]]+
    fcabchar[cabchar[i]]+
    fage[age[i]]+
    fticket[ticket[i]]+
    ftitle[title[i]]+
    fsibsp[sibsp[i]]+
    fparch[parch[i]]+
    ffare[fare[i]]
    );
    }
    
    }
    model {        
    fsex ~ normal(0,sdsex);
    fpclass ~ normal(0,sdpclass);
    fembarked ~ normal(0,sdembarked);
    foe ~ normal(0,sdoe);
    fcabchar ~ normal(0,sdcabchar);
    fage ~ normal(0,sdage);
    fticket ~ normal(0,sdticket);
    ftitle ~ normal(0,sdtitle);
    fsibsp ~ normal(0,sdsibsp);
    fparch ~ normal(0,sdparch);
    ffare ~ normal(0,sdfare);

    sdsex ~  normal(0,sd1);
    sdpclass ~ normal(0,sd1);
    sdembarked ~ normal(0,sd1);
    sdoe ~ normal(0,sd1);
    sdcabchar ~ normal(0,sd1);
    sdage ~ normal(0,sd1);
    sdticket ~ normal(0,sd1);
    sdtitle ~ normal(0,sd1);
    sdsibsp ~ normal(0,sd1);
    sdparch ~ normal(0,sd1);
    sdfare ~ normal(0,sd1);
    sd1 ~ normal(0,1);
    intercept ~ normal(0,1);

    survived ~ bernoulli(expect);
    }
    generated quantities {
    real pred[ntest];
    for (i in 1:ntest) {
    pred[i] <- inv_logit(
    intercept+
    fsex[psex[i]]+
    fpclass[ppclass[i]]+
    fembarked[pembarked[i]]+
    foe[poe[i]]+
    fcabchar[pcabchar[i]]+
    fage[page[i]]+
    fticket[pticket[i]]+
    ftitle[ptitle[i]]+
    fsibsp[psibsp[i]]+
    fparch[pparch[i]]+
    ffare[pfare[i]]

    );
    }
    }
    '

fit1 <- stan(model_code = my_code, 
    data = datain, 
    pars=parameters,
    iter = 1000, 
    chains = 4,
    open_progress=FALSE)

fit1
#plot(fit1,ask=TRUE)
#traceplot(fit1,ask=TRUE)
fit2 <- stan(model_code = my_code, 
    data = datain, 
    fit=fit1,
    pars=c('pred'),
    iter = 2000, 
    warmup =200,
    chains = 4,
    open_progress=FALSE)

fit3 <- as.matrix(fit2)[,-419]
#plots of individual passengers
#plot(density(fit3[,1]))

#plot(density(fit3[,18]))
#plot(density(as.numeric(fit3),adjust=.3))
decide1 <- apply(fit3,2,function(x) mean(x)>.5)
decide2 <- apply(fit3,2,function(x) median(x)>.5)
#table(decide1,decide2)

out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=as.numeric(decide1),
    row.names=NULL)
write.csv(x=out,
    file='stanlin.csv',
    row.names=FALSE,
    quote=FALSE)

3 comments:

  1. I'm trying to get the second model to work but I'm getting errors. Perhaps the Stan format has changed? I put the code on Github. Any ideas?

    https://github.com/melling/Probabilistic_Programming/blob/main/rstan/titanic_kaggle/second_wiek_model.R

    I extracted the RStan code:

    https://github.com/melling/Probabilistic_Programming/blob/main/rstan/titanic_kaggle/second_wiek_model.stan

    ReplyDelete
  2. I had to add this line to get the gender to be translated properly for datain.

    tt$Sex <- factor(tt$Sex)

    Also, needed to change the API call to get it to run.

    model = stan_model("second_wiek_model.stan")

    fit_pred = sampling(model, data_in, warmup=200, iter=2000, chains=4,
    par=c("pred"),
    show_messages=TRUE, verbose=TRUE)

    Created a new file: second_wiek_model_v2.R

    ReplyDelete
    Replies
    1. When placed the Stan code back within the R code (my_code var), both examples worked after I added the line: tt$Sex <- factor(tt$Sex)

      https://github.com/melling/Probabilistic_Programming/blob/main/rstan/titanic_kaggle/first_model.R

      https://github.com/melling/Probabilistic_Programming/blob/main/rstan/titanic_kaggle/second_wiek_model.R

      I leaved these untouched for others.

      Delete