# Introduction

The dataset used in this project is obtained fromm the UCI Machine Learning Repository. This dataset is related to direct telemarketing campaigns of a Portuguese banking institution, collected from 2008 to 2013. The aim is to assess if the product (bank term deposit) would be (or not) subscribed based on the predictors. There are 16 predictors (9 categorical and 7 numeric) and 1 response (Has the client subscribed a term deposit?). Some of the predictors are:

1. Month: last contact month of year
2. Duration: last contact duration, in seconds
3. Day: last contact day of the month

These variables exist because it is normal to contact the prospective customer several times before they would subscribe the bank term deposit.

First, data was loaded to study the number of observations.

# Loading data
myfile = "../../static/data/bank-full.csv"
full_dat <- read_delim(myfile, ";", escape_double = FALSE, trim_ws = TRUE)
# Changing into Factors
cols = c(2:5, 7:9, 11, 16,17)
full_dat[,cols] = data.frame(apply(full_dat[,cols] ,2,as.factor))

nrow(full_dat)
## [1] 45211

# Data Manipulation and Descriptive Analysis

Descriptive analyses were ran to look at the proportion of yes and no in the dataset in general, check the proportion of missing data and to examine the proportion of responses when a particular variable is in interest.

## Missing data, NAs

# Explore the NAs
#Make a copy of dataset for that
dat_NA = full_dat
# Substitute unknowns in the dataset with NA
dat_NA[dat_NA == "unknown"] <- NA
library(VIM)
aggr_plot <- aggr(dat_NA, col = c('cornflowerblue','firebrick1'), numbers = TRUE, sortVars = TRUE, labels = names(full_dat), cex.axis = .7, gap = 3, ylab = c("Proportion of missing data","Maps"))
##
##  Variables sorted by number of missings:
##   Variable       Count
##   poutcome 0.817478047
##    contact 0.287983013
##  education 0.041074075
##        job 0.006370131
##        age 0.000000000
##    marital 0.000000000
##    default 0.000000000
##    balance 0.000000000
##    housing 0.000000000
##       loan 0.000000000
##        day 0.000000000
##      month 0.000000000
##   duration 0.000000000
##   campaign 0.000000000
##      pdays 0.000000000
##   previous 0.000000000
##          y 0.000000000

About 80% of the predictor poutcome, 30% of predictor contact, 4% of predictor education and 0.6% of predictor job have missing data (see Figure 1). There will be 37369 rows that had to be deleted if rowwise NA deletion was executed. The other way is to remove entirely those variables that contains NA including poutcome, contact, education and job. However, this is not desirable because we dont know whether those variables are important or not. In short, deletion is not an ideal way.

This leads to another possibility: Imputation. One of the easiest way is to use the KNN Imputation. This algorithm identifies k closest observations to the to-be-imputed observation and take the weighted average based on the distance. Yet, imputing ~ 80% of the data for poutcome variable is not sensible at all.

Since all of the missing data exist at categorical data, the strategy used was to treat the “unknown” as a category itself (pg 332 ESL II).

## Imbalance Classification

counts <- table(full_dat$y) barplot(counts,col = c("darkblue","red"),legend = rownames(counts), main = "Term Deposit") Due to the highly imbalance of the proportion of responses (as shown in Figure 2, and due to large sample size of 45211, the dataset was subsetted at random with controlling the proportion of responses to be equal. This process is known as undersampling. From there, a dataset of size N = 2130 was drawn at random. N = 2130 # Subset the data to "Yes" responses and "No" responses yes_dat <- full_dat[full_dat$y == "yes",]

## Decision Tree fitting

Decision Tree (DT) was used to fit the data. Decision Tree breaks the space of the data into partitions and predict the value by taking the average of each partitions. The rule of dividing the model space is specified by minimizing the classification error rate and while controlling the size of the tree (i.e., number of terminal nodes). The following image is obtained from ISL.

library(tree)
# Fitting tree
treemod = tree(y ~., subset = train, data = dat)
sum_tree = summary(treemod)
err_tree = sum_tree$misclass[1]/sum_tree$misclass[2]

The error of misclassification was 0.1802817. However, this is the traning error rate and it could be biased.

## Cross-validation, Tree-pruning and Training Errors

Cross-validation was used to choose the best number of terminal nodes and the decision tree was pruned.

set.seed(1001)
cvtree = cv.tree(treemod ,FUN = prune.misclass )
choose_nodes = cbind(cvtree$size, cvtree$dev)
colnames(choose_nodes) = c("Number of terminal nodes", "CV error")
choose_nodes
##      Number of terminal nodes CV error
## [1,]                       12      248
## [2,]                       10      248
## [3,]                        8      244
## [4,]                        7      247
## [5,]                        4      273
## [6,]                        2      312
## [7,]                        1      818
plot(cvtree$size ,cvtree$dev ,type = "b", xlab = "Number of terminal nodes", ylab = "cross-validation error", col = "purple4")

As shown in Figure 5, the best number of terminal nodes was 8 because it gives the lowest CV errors. The tree was pruned accordingly.

prunetree = prune.misclass(treemod ,best = 8)
sum_prune= summary(prunetree)
err_prune = sum_prune$misclass[1]/sum_prune$misclass[2]
cbind(err_tree, err_prune)
##       err_tree err_prune
## [1,] 0.1802817 0.1849765

The misclassification error of pruned tree is higher in this case. However, note that this is just a training error.

## Testing Error and Prediction Accuracy

The testing error was inspected too.

# Test Performance
tree.pred = predict (prunetree , x.test ,type="class")
table(tree.pred, y.test)
##          y.test
## tree.pred  no yes
##       no  414  85
##       yes 130 436

The confusion error above shows that the fitted tree tends to predict a Yes response to term subscriptiion. The testing error rate was 0.202. In other words, the prediction accuracy was 0.798.

## Interpretation

The visualization of decision tree is shown below.

plot(prunetree)
text(prunetree ,pretty = 0)

The key variables shown in Figure 6 are:

• duration: last contact duration, in seconds
• contact: contact communication type (“unknown”, “telephone”, “cellular”)
• poutcome: outcome of the previous marketing campaign (“unknown”,“other”,“failure”,“success”)
• month: last contact month of year
• housing: has housing loan? (yes or no)

The decision tree predicts that when the representative contacted the client for more or equal to 692 (174.5+517.5) seconds on last phone call, the client would subscribe for the term deposit. If the last contact duration is less than 174.5 seconds, the client would reject the offer.

If the previous contact duration is between 174.5 and 517.5 seconds, and the contact communication type is unknown, the model predicts a NO to the subscription. One can interpret this as the reluctants of the bank representatives to record the communication type since they were rejected.

If the communication type was known, and the previous marketing outcomes were “failure” or “unknown”, it is predicted that the client would subscribe to the term deposit. This make sense because it is less likely for a person to subscribe twice.

Interestingly, if the the previous marketing outcomes were “success” or “other”, it’s likely that the client would subscribe if the phone call was made at the end of last quarter (i.e., March, June, September and December).

The model further predicts that the marketing campaign would success if the client had no housing loan. Yet, it is possible to persuade the client to subscribe to the term deposit if the last contact duration was longer than 370 seconds.