0. Introduction

Titanic: Machine Learning from Disaster is a Kaggle beginner-friendly challenge, with the goal to predict who has survived or who was more likely to survive based on the background information such as age, sex, cabin class, ticket. Since many wonderful tutorials (both on Python and R) are available for this challenge, it is highly recommended for beginners to work on it and this is how I come here. It is a binary classification problem (survived or not). The CSV training and test datasets can be downloaded from the above website.

1. Libraries

library(randomForest) # random Forest
library(party)        # conditional inference trees and forests
library(e1071)        # support vector machine
library(mice)         # multiple imputation
library(ggplot2)      # nice plots

2. Check the data

Load the CSV datasets and combine them into one dataset for preprocessing, excluding the predictend Suvived.

trainData <- read.csv('train.csv')
testData <- read.csv('test.csv')
allData <- rbind(trainData[,-2],testData)

Check the feature space and missing values

str(allData)
## 'data.frame':    1309 obs. of  11 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
nrow(allData[!complete.cases(allData),])
## [1] 264

Check the pattern of missing values.

md.pattern(allData)
##      PassengerId Pclass Name Sex SibSp Parch Ticket Cabin Embarked Fare
## 1045           1      1    1   1     1     1      1     1        1    1
##  263           1      1    1   1     1     1      1     1        1    1
##    1           1      1    1   1     1     1      1     1        1    0
##                0      0    0   0     0     0      0     0        0    1
##      Age    
## 1045   1   0
##  263   0   1
##    1   1   1
##      263 264

There are 236 missing values in Age and 1 missing value in Fare.

2. Handling missing values

Missing values can usually be delt with three ways: listwise deletion, multiple imputation and rational approaches. Since the missing values here are in the test dataset and the corresponding features are by intuition quite relevant to the prediction, a rational approach is employed to fill the missings. For Fare, the single missing value is replaced with the median of the Fare in the associated Pclass, which are actually highly correlated.

# correlation
cor(allData[!is.na(allData$Fare),]$Pclass,allData[!is.na(allData$Fare),]$Fare)
## [1] -0.5586287
# boxplot
ggplot(data = allData,aes(x=factor(Pclass),y=Fare,fill=factor(Pclass))) + geom_boxplot(notch = FALSE)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

# replace missing
allData[is.na(allData$Fare),]$Fare <-  median(allData[allData$Pclass==3,]$Fare,na.rm = TRUE)

For the feature Age, in the Kaggle forum many use multiple imputation from the package mice: imp <- mice(allData,seed = 123). On the other hand, the missing value is often replaced by the median age in the corresponding Title class. Title is a derived feature from the Name variable. Since Title and Age has some sort of correlation, it is hence reasonable to infer the age in this fasion, which may be even more appropriate.

Let’s first creat the Title feature.

allData$Title <- sub('.*, (\\w+)\\. .*','\\1',allData$Name)
allData[!(allData$Title %in% c('Miss','Mr','Mrs','Master')),]$Title <- 'Respected'
table(allData$Sex,allData$Title)
##         
##          Master Miss  Mr Mrs Respected
##   female      0  260   0 197         9
##   male       61    0 757   0        25

Replace the missing ages with the median in the corresponding Title. Note that using median instead of mean is to reduce the influence of outliers.

for (ttl in levels(factor(allData$Title))){
  allData[(is.na(allData$Age)) & (allData$Title == ttl),]$Age <- 
    median(allData[allData$Title==ttl,]$Age,na.rm = TRUE)
}

Now, let us check that all missing values are gone.

sum(is.na(allData))
## [1] 0

3. Feature Selection

Feature selection is a key but tricky step in the learning process, involving many “blank arts” and domain knowledge. In this exercise, very simple feature selection strategy is adopted. Besides the Title that was already extracted from the Name, family size is another frequently used feature in Kaggle forum, which is the sum of SibSp and Parch.

allData$FamilySize <- allData$Parch + allData$SibSp +1

Not all features are useful in the prediction, e.g., PassengerID,Name, while some are redundant, SibSp,Parch. Finally, 7 features are retained for this exercise and the corresponding train and test datasets are created.

myfeatures <- c('Pclass','Sex','Age','Fare','Embarked','Title','FamilySize')
allData$Pclass <- factor(allData$Pclass) # as factor
allData$Title <- factor(allData$Title)   # as factor
train <- cbind(allData[1:nrow(trainData),myfeatures],trainData['Survived'])
test <- allData[(nrow(trainData)+1):nrow(allData),myfeatures]

4. Fit the models

Three classifiers, including random forest, conditional inference forest and support vector machine with radial kernel are considered. Due to the presence of the (multi-)collinearity, linear models such as logit regression or linear discriminant analysis are not considered in this exercise. Let’s start with random forest. To find the best parameters mtry and ntree, 10-fold cross validation is performed for the parameter tuning.

set.seed(66)
fit.tune <- tune.randomForest(factor(Survived)~.,data = train,mtry=c(2:5),ntree = c(500,1000,1500,2000))
summary(fit.tune)
## 
## Parameter tuning of 'randomForest':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  mtry ntree
##     2   500
## 
## - best performance: 0.1649688 
## 
## - Detailed performance results:
##    mtry ntree     error dispersion
## 1     2   500 0.1649688 0.02743717
## 2     3   500 0.1750562 0.02687145
## 3     4   500 0.1817728 0.03264138
## 4     5   500 0.1851311 0.03098648
## 5     2  1000 0.1694507 0.02315271
## 6     3  1000 0.1761923 0.03034424
## 7     4  1000 0.1840075 0.03271581
## 8     5  1000 0.1840075 0.03095327
## 9     2  1500 0.1672035 0.02544027
## 10    3  1500 0.1761673 0.03105381
## 11    4  1500 0.1840200 0.02969082
## 12    5  1500 0.1884894 0.03068226
## 13    2  2000 0.1750562 0.03121808
## 14    3  2000 0.1750437 0.03200101
## 15    4  2000 0.1862547 0.03521310
## 16    5  2000 0.1873783 0.03555929

The best model based on the out-of-bag error rate are selected to predict the test dataset. The trainning accuracy (correct classification rate) is shown. The relative importance of different features is also shown, indicating that Title, Fare and Sex are the most important features.

fit.rf <- fit.tune$best.model
mean(fit.rf$predicted==train$Survived)
## [1] 0.8338945
varImpPlot(fit.rf)

Similar procedure (including cross validation) can be applied to conditional inference forest and support vector machine. To save time, fixed parameters are chosen to fit the models. Build the conditional inference forest classifier.

set.seed(66)
fit.cf <- cforest(factor(Survived)~., data=train,
                   controls = cforest_unbiased(ntree=2000, mtry=3))
fit.cf
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  2000 
## 
## Response:  factor(Survived) 
## Inputs:  Pclass, Sex, Age, Fare, Embarked, Title, FamilySize 
## Number of observations:  891
pred.cf <- predict(fit.cf)
mean(pred.cf==train$Survived)
## [1] 0.8518519

Build the support vector machine classifier

set.seed(66)
#fit.tune <- tune.svm(factor(Survived)~.,data=train, kernel="radial",
#                      gamma=10^(-2:2),cost=10^(-2:4))
#fit.svm <- fit.tune$best.model
fit.svm <- svm(factor(Survived)~.,data=train,
               kernel="radial",gamma=0.1,cost=1)
summary(fit.svm)
## 
## Call:
## svm(formula = factor(Survived) ~ ., data = train, kernel = "radial", 
##     gamma = 0.1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1 
## 
## Number of Support Vectors:  398
## 
##  ( 200 198 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
mean(fit.svm$fitted==train$Survived)
## [1] 0.8372615

5. Prediction

Perform the prediction using above classifier and save the results in accordance with Kaggle’s requirement. The function kagglePred defined below predicts and save the result to files.

kagglePred <- function(myfit,test,filename,...){
  mypred <- predict(myfit,test,...)
  myresult <- data.frame(PassengerID = testData$PassengerId,
                        Survived = mypred)
  write.csv(myresult,file=filename,row.names=FALSE)
}

Use kagglePred to predict.

kagglePred(fit.rf,test,'rf.csv')
kagglePred(fit.cf,test,'cf.csv',OOB=TRUE,type="response")
kagglePred(fit.svm,test,'svm.csv')

The primary scores in Kaggle for above predictions from random forest, conditional inference forest and support vector machine are 0.77512, 0.80861, 0.78947, respectively. These scores are all lower than the trainning scores but preserves the accuracy order among these three methods. Note that the winning of conditional inference forest here does not generalize to other situations.

6. Discussion and Conclusion

Motivated by “learning by doing”, the Titanic Kaggle challenge was taken here as a small exercise for the classification methodology. The problem is unfolded in small steps, from loading the datasets, exploratory data analysis, missing value handling, feature selection, to model fitting and predicting. Three different classifiers, including random forest, conditional inference forest and support vector machine, were considered, with some parameters tuned by 10-fold cross validation. In this particular case, contional inference tree seems outperform the other two, which nevertheless does not guarantee any further generalization.

Feature selection is one of the key elements in the prediction. Several different combinations, though not presented in the report, were tried, which gave quite different results. Some tutorials in Kaggle forum using different feature spaces improved the ranking in the leaderbord a lot. As to which features are most relevant to the prediction accuracy, it is like a piece of black art but more systematic method for feature selection are desired and should be in the learning list.

Multicollinearity is mysterious. It is intuitive to remove the redundant features that are highly correlated, such as Title and Sex, or Pclass and Fare. However, including all of them in the classifier considered here improves the prediction accuracy. Nevertherless, in linear models such as logit regression, (multi-)collinearity does influence the prediction, making the standard error of coefficients larger and hence decreasing the power of the hypothesis tests for the coefficients (the probability to reject the null hypothesis).