0. Introduction

US Baby Name is a project in kaggle, to explore the naming trends of US-born babies. This R markdown is motivated by curiosity and is regarded as a small excecise for time series prediction. Among many interesting questions that can be answered with this database, the key questions in this report are

  1. how many babies will be born in the coming years;

  2. how to select a suitable model.

To reproduce this R markdown results, the SQLite database on above kaggle website has to be downloaded and placed in the same folder. Also, forecast v7.1 is required, which includes ggplot2 functions for ts and the corresponding fit objects.

1. Load necessary libraries

sqldf is to query the data using sql language. RSQLite will be also required to connect with and query from the relational database. wordcloud and tm are required to make the wordcloud. reshape2 is used to restructure the data frames to be suitable for ggplot2 plots.

library(sqldf)  
library(forecast)
library(tseries)
library(ggplot2)
library(reshape2)
library(wordcloud)
library(tm)

User-defined ggplot2 theme.

mytheme <- theme(plot.title=element_text(size=16),
                 axis.text=element_text(size=10),
                 axis.title=element_text(size=12),
                 legend.position=c(.15,.8)
                 )

2. Connect, query data from, and disconnect the SQL database

# connect the relational database
conn <-  dbConnect(SQLite(),'database.sqlite')
# query the number of newborns and names
db <- dbGetQuery(conn, 
                "select Year,Gender,sum(Count) as bCount, count(*) as nCount 
                from NationalNames 
                group by Year, Gender "
                )
# query the name ending with 'ena' and count # of appearance
nam  <- dbGetQuery(conn,
                   "select Name, count(*) as count 
                   from NationalNames 
                   where Name like '%ena' 
                   group by Name 
                   order by count desc")
# disconnect from the database
dbDisconnect(conn)

The number of newborns bCount and the number of names nCount are obtained from the first query and stored in the data frame db. The names ending with ena are stored in nam for the future use. Note that using SQLite to query the database instead of loading the whole .csv data saves significantly the memory, which is specially efficient for large dat set.

3. Have some fun first

Make a wordcloud of the names end with ‘ena’. Not all matched names are included in the wordcloud due to limited space.

nam2 <- nam[c(seq(1,101,20),102:600),]
set.seed(123456) # reproducible
wordcloud(nam2$Name,nam2$count,scale=c(3,0.1),rot.per=0.30, colors = brewer.pal(8, 'Dark2'))

Are you Lena, Filomena or Keena? If yes, congratulations!

4. Look at the data

First of all, let us plot the data. The following plot shows the time series of number of newborns and number of different names for male and females. melt from package reshapes transform the data frame db into long format which is easier for ggplot2 to handle.

dblong <- melt(db,id.vars = c('Year','Gender'))
head(dblong)
##   Year Gender variable  value
## 1 1880      F   bCount  90993
## 2 1880      M   bCount 110491
## 3 1881      F   bCount  91954
## 4 1881      M   bCount 100745
## 5 1882      F   bCount 107850
## 6 1882      M   bCount 113688
ggplot(data=dblong,aes(x=Year,y=value,color=Gender)) + 
  geom_line(size=1) +
  facet_grid(variable~.,scales = 'free') +
  labs(y='Count')

Several observations can be drawn:

  1. The number of newborns (bCount) has stagnated since around 1950;

  2. The numbers of newborn boys and girls are almost the same, with boys slightly more than girls;

  3. The number of girl names are significantly higher than that of boy names, evidencing the diversity among the baby girl names.

  4. The number of names and of newborns are positively correlated. Both have a small peak around 1920, whereas the number of newborns experienced a second peak around 1950, corresponding to the babyboomers generation after the second world war.

Next, we reformat the data frame to get the time series format, which is required to be handled by the forecast package. bts is the total number of newborns in each year. nts is the total number of names in each year.

dbCount <-sqldf("select Year,sum(bCount) as bCount, sum(nCount) as nCount from db group by Year")
bts <- ts(dbCount$bCount,start=dbCount$Year[1],frequency = 1)
nts <- ts(dbCount$nCount,start=dbCount$Year[1],frequency = 1)

5. Short-term forecast with Arima (0,1,1)

In this section, we use Arima with order (0,1,1) to make the prediction. This model is suggested by AIC and cross validation results for short-term forecast, shown later. We will follow the classical precedure step by step.

  1. check the stationary of the time series and determine the order of difference or transformation. The time series itself tells us it is not stationary and we need to take extra measure to make a stationary time series. Transformation and difference are the most popular way to render a stationary time series. Transformation is to achieve equal variance which we will not need fot the time series considered here.

  2. Determine the order. The order of difference can be found out by the function ndiffs.

ndiffs(bts)
## [1] 1
adf.test(diff(bts))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(bts)
## Dickey-Fuller = -3.2889, Lag order = 5, p-value = 0.07611
## alternative hypothesis: stationary

ndiffs suggests firse-order diffence and the adf test suggests the differenced time series is marginally stationary. Next, check the ACF and PACF of the difference of bts.

ggtsdisplay(diff(bts))

From the plots, (p=0, q=1) and (p=1,q=1) seem reasonable choices. The automated process from auto.arima suggest (0,1,1) with drift. We will proceed with this order.

  1. Let’s forecast!
fit <- Arima(bts,order=c(0,1,1),include.drift = TRUE)
  1. Check the residuals: the ACF and PACF and the Ljung-Box test suggest that we are in the good side: the residuals are approximately white noise.
ggtsdisplay(fit$residuals)

Box.test(fit$residuals,type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  fit$residuals
## X-squared = 0.04586, df = 1, p-value = 0.8304
  1. Now, we plot the data and the 6-steps forecast. Exciting?!
autoplot(forecast(fit,6))+mytheme

Is the prediction reasonable? It depends on one’s interpretation. To me, it is not bad.

Next, let’s look at the performance of different model.

6. Time series h-steps cross valiation (out-of-sample)

First, we define a function obtain the prediction error from the “rolling forecast origin” cross validation procedure. The getError function returns the h-steps forecast relative error for a specific model. Since the training set is rolling one-step forward for each iteration, the function return a matrix of (n-k-h+1) rows and h columns, where k is the minimal number of data in the training set.

getError <- function(tserie,fun='ets', model='ZZZ',order=c(1,1,0),k=50,h=1,fixTrainLen = TRUE){
  n <- length(tserie)
  e <- matrix(NA,n-k-h+1,h) 
  for (i in k:(n-h))
  {
    tstart <- ifelse(fixTrainLen,start(bts)[1]+i-k,start(bts)[1])
    train <- window(tserie,start=tstart,end=start(bts)[1]+i-1)
    test <- window(tserie,start=end(train)[1]+1,end=end(train)[1]+h)
    if (fun == 'ets'){
      fit <- match.fun(fun)(train,model= model) 
    } 
    else if (fun == 'Arima'){
      fit <- match.fun(fun)(train,order=order,include.drift=TRUE)        
    }
    else {
      fit <- match.fun(fun)(train)
    }
    fc <- forecast(fit,h)$mean
    e[i-k+1,] <- (test-fc)/test # relative error
  }
  
  return(e)
}

The training set can be either variable or fixed length.

Let’s have a look at the prediction error. We use ets as an example to evaluate 10-steps forecast performance. AAN is chosen because the automated ets of the whole data set suggests this model and we could see from the data that no seasonal pattern exists.

# 10-steps prediction
e <- getError(bts,fun = 'ets',model = 'AAN',k=50,h=10)

# define the error plotting function
eplot <- function(e){
  e <- data.frame(e)
  names(e) <- 1:length(e)
  elong <- melt(e)
  ggplot(elong,aes(x=variable,y=value)) + 
    geom_boxplot(fill='pink',color='black',notch = TRUE) +
    #  geom_point(position = 'jitter',color='darkgreen',alpha=.5) +
    labs(x='Prediction steps', y='Prediction out-of-sample relative error')
}

# plot the absolute relative error
eplot(abs(e))

Plotted is the boxplot of the absolute relative error. It is shown that the median of the prediction absolute error increases with the prediction steps, as well as the variances. Except outliers, the relative error for one-step prediction is within about 5%.

Next we define cross-validation (CV) score function. The mean of the absolute relative error is defined here as the CV score. The following functions compare several models from the ets and arima family and plot the CV score.

cvscore <- function(tseri,k=50,h=10,fixLen=TRUE){
  cvscore <- data.frame(steps=1:h)
  e <- getError(tseri,fun='ets',model='AAN',k=k,h=h,fixTrainLen = fixLen)
  cvscore$ets1 <- colMeans(abs(e),na.rm=TRUE)
  cvscore$ets1SD <- apply(abs(e),2,sd)
  e <- getError(tseri,fun='ets',model='ANN',k=k,h=h,fixTrainLen = fixLen)
  cvscore$ets2 <- colMeans(abs(e),na.rm=TRUE)
  cvscore$ets2SD <- apply(abs(e),2,sd)
  e <- getError(tseri,fun='Arima',order = c(0,1,1),k=k,h=h,fixTrainLen = fixLen)
  cvscore$arima1 <- colMeans(abs(e),na.rm=TRUE)
  cvscore$arima1SD <- apply(abs(e),2,sd)
  e <- getError(tseri,fun='Arima',order = c(1,1,0),k=k,h=h,fixTrainLen = fixLen)
  cvscore$arima2 <- colMeans(abs(e),na.rm=TRUE)
  cvscore$arima2SD <- apply(abs(e),2,sd)
  e <- getError(tseri,fun='Arima',order = c(1,1,1),k=k,h=h,fixTrainLen = fixLen)
  cvscore$arima3 <- colMeans(abs(e),na.rm=TRUE)
  cvscore$arima3SD <- apply(abs(e),2,sd)
  
  return(cvscore)
}

# funtion that plot the comparative CV scores data frame
cvplot <- function(cvdf){
  ns <- length(cvdf$steps)
  ccode <- c('ets AAN'='black','ets ANN'='red',
             'Arima (0,1,1)'='blue','Arima (1,1,0)'='brown',
             'Arima (1,1,1)'='darkgreen')
  cvplot <- ggplot(cvdf,aes(steps),fill=lbl) + 
    geom_line(aes(y=ets1,color=names(ccode)[1])) + 
    geom_point(aes(y=ets1)) +
    geom_errorbar(aes(ymin=ets1-ets1SD,ymax=ets1+ets1SD),width=.5) +
    geom_line(aes(y=ets2,color=names(ccode)[2])) +
    geom_line(aes(y=arima1,color=names(ccode)[3])) +
    geom_line(aes(y=arima2,color=names(ccode)[4])) +
    geom_line(aes(y=arima3,color=names(ccode)[5])) +
    scale_x_continuous(breaks=seq(ns)) +
    scale_color_manual(name='Fit Type',values=ccode) +
    labs(y='CV scores')
  
  return(cvplot)
}

Calculate the CV scores and plot the results

cvs <- cvscore(bts)
cvplot(cvs) + mytheme

Except ets AAN model, the rest perform similarly. The error bar corresponds to the standard deviation from the ets AAN model prediction. Statistically, it is hard to conclude which one is the best, since all of them are within the statistical uncertainty.

7. Comparing other measures of model performance (in-sample)

In this block, we compare other three error measures (AIC, RMSE and MAPE) for the whole data set bts, in order to see whether the results is consistent with that from the cross validation. All the fitting are automated within each family, and hence the AIC score is already the minimal within each family. The four family of methods are ets, arima, tbats and nnetar. No AIC score is available for nnetar method since it is not based on maximum likelihood precedure.

fitets <- ets(bts)
fitarima <- auto.arima(bts)
fittbats <- tbats(bts)
fitnnetar <- nnetar(bts)
accuMeasure <- data.frame(
  method=c('ets','arima','tbats','nnetar'),
  aic = c(fitets$aic,fitarima$aic,fittbats$AIC,NA),
  rmse = c(accuracy(fitets)[2],accuracy(fitarima)[2],
           accuracy(fittbats)[2],accuracy(fitnnetar)[2]),
  mape = c(accuracy(fitets)[5],accuracy(fitarima)[5],
           accuracy(fittbats)[5],accuracy(fitnnetar)[5])
  )
measlong <- melt(accuMeasure,id.vars = 'method')
ggplot(measlong,aes(x=method,y=value,fill=method)) + 
  geom_bar(stat = 'identity',alpha=0.5) +
  facet_grid(variable~.,scales = 'free') +
  labs(x='family of methods', y='In-sample error measure')

According to AIC and MRSE scores, arima is the best, whereas ets has the smallest MAPE. This inconsistency actually reflects the complexity in model selection: no single accuracy measure determines the model performance. The selection has to be based on the domain knowledge and many other factors. Currently, cross validation is thought to be the most reliable way to evaluate the model performance.

8. More data is better than less data

One might think it is more accurate to use only the latest data where the number of newborns is about stagnated. The following is an example, using the data staring from 1960.

btsshort <- window(bts,start=1960)
ndiffs(btsshort)
## [1] 1
adf.test(diff(btsshort))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(btsshort)
## Dickey-Fuller = -3.752, Lag order = 3, p-value = 0.02867
## alternative hypothesis: stationary
ggtsdisplay(diff(btsshort))

fit <- Arima(btsshort,order=c(0,1,1),include.drift = TRUE)
ggtsdisplay(fit$residuals)

Box.test(fit$residuals,type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  fit$residuals
## X-squared = 0.21928, df = 1, p-value = 0.6396
autoplot(forecast(fit,6))+mytheme

Is it better? Probably not. Actually we can check the cross validation score by using the training set with fixed and varied length.

e <- data.frame(steps=1:10)
e$varLength <- colMeans(getError(bts,fun='ets',model = 'AAN',h=10,fixTrainLen = FALSE))
e$fixLength <- colMeans(getError(bts,fun='ets',model = 'AAN',h=10,fixTrainLen = TRUE))
elong <- melt(e,id='steps')
ggplot(elong, aes(x=steps,y=value,color=variable, shape=variable)) + 
  geom_point(size=2) + geom_line() +
  labs(y='Out-of-sample absolute relative error') +
  mytheme

The results show that more data achieves lower CV score.

9. A question regarding Rob’s one-step forecast method

Last, we check the out-of-sample one-step forecast proposed by Rob using ets or Arima, http://robjhyndman.com/hyndsight/out-of-sample-one-step-forecasts/

train=window(bts,end=2000)
h <- 1
test <- window(bts,start=2001,end=2001+h-1)
fit <- ets(train)
forecast(fit,h)$mean
## Time Series:
## Start = 2001 
## End = 2001 
## Frequency = 1 
## [1] 3804610
fitted(ets(test,model=fit))
## Time Series:
## Start = 2001 
## End = 2001 
## Frequency = 1 
## [1] 3740300

It seems the out-of-sample forecast by ets(test,model=fit) is not the same as the prediction by forecast(fit,h). Am I wrong? If you have the answer, I appreciate if you let me know.

10. Take-home message

Thank you for your persistence till the end.

Finally, let’s go back to our original question.

  1. Prediction. The prediction with Arima (0,1,1) and drift based on the whole data seems give reasonable increasing trend. Most models considered here gives prediction within the statistical confidential interval, but not the increasing trend.

  2. Model selection. From the above tests including cross validation, one can not say which one is the best. Some have good short-term prediction performance while others have better long-term prediction accuracy. Some is easier to interpret while others have higher accuracy. Instead, more data reduce the prediction error systematically.

Below are some lesson that I’ve learned from this excecise.

  1. No perfect model. Model performance is data- and question- dependent;

  2. More data is better than less data;

  3. RSQLite is a friend for memory-efficiency, while sqldf complements the built-in indexing of data frame fairly well.

  4. Jupyter notebook has different R version as RStudio. They do not communicate. :( Will they become friends soon? I hope!

  5. When you connect to a relational database and find that the opened connection is suprisingly empty, check the current working dir and the path to the file.

  6. Use a reference handbook and google.com interactively during the learning-by-doing journey.

Before drawing the end, I’d like to leave you and myself a question, which persists appearing during the course of this project:

How does the prediction from forecast package make sense in real-world applications, given that 1) the prediction curve is relatively simple and 2) the confidential interval of the forecast is so huge even for the one-step forecast?

Should you have any questions or comments, please do not hesitate to contact me.