Predicting Credit Card Default Status

A Logistic Regression Model using R.

Playing with logistic regression model with Default dataset
R
Stats
ML
Author

Kittipos Sirivongrungson

Published

May 4, 2022

My motivation and code I used in here is an adaptation from Dr.Julia Silage’s blog bird-baths. I’ve applied the similar modeling process to Default dataset from {ISLR} package [1].

The goal is to build logistic regression model to predict default status.

Explore Data

Code
library(tidyverse)
library(ISLR)
theme_set(theme_bw())

Let’s take a look at the Default data set. It has 2 numeric variables: balance and income; and 2 factor variables: default and student

Code
summary(Default)
 default    student       balance           income     
 No :9667   No :7056   Min.   :   0.0   Min.   :  772  
 Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
                       Median : 823.6   Median :34553  
                       Mean   : 835.4   Mean   :33517  
                       3rd Qu.:1166.3   3rd Qu.:43808  
                       Max.   :2654.3   Max.   :73554  

Detailed summary can be done with skimr::skim()

Code
skimr::skim(Default)
Data summary
Name Default
Number of rows 10000
Number of columns 4
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
default 0 1 FALSE 2 No: 9667, Yes: 333
student 0 1 FALSE 2 No: 7056, Yes: 2944

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
balance 0 1 835.37 483.71 0.00 481.73 823.64 1166.31 2654.32 ▆▇▅▁▁
income 0 1 33516.98 13336.64 771.97 21340.46 34552.64 43807.73 73554.23 ▂▇▇▅▁

Please note that there are no missing values, which is great!

Goal

The goal here is to build a logistic regression model to predict default.

Plot Relationship

I will explore the relationship between default and other variables (income, balace, student). Let’s make some plots!

Code
Default %>% 
  ggplot(aes(balance, income, color = default)) + 
  geom_point(alpha = 0.4) +
  scale_color_brewer(palette = "Set1", direction = -1) +
  labs(title = "Default status by income and balance") 

By visual inspection, balance looks like a better predictor for default than income.

Code
p1 <- Default %>% 
  ggplot(aes(balance, default, fill = student)) +
  geom_boxplot(alpha = 0.8) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Distribution of default",
       subtitle = "by balance and student status",
       caption = "Data from ISLR package") 
  
p1 

This plot shows that balance and student seem to be a decent predictor of default.

Simple Model

Note

Goal: Outcome variable = default (factor)

First, I will use glm() function to build logistic regression model using all predictors.

Code
glm_fit_all <- glm(default ~ ., data = Default, family = binomial)
summary(glm_fit_all)

Call:
glm(formula = default ~ ., family = binomial, data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8

Coefficient of income is not significant (p = 0.7115203). Let’s try remove income out.

Code
glm_fit_st_bal <- glm(default ~ student + balance, 
    data = Default, 
    family = binomial)
glm_fit_st_bal

Call:  glm(formula = default ~ student + balance, family = binomial, 
    data = Default)

Coefficients:
(Intercept)   studentYes      balance  
 -10.749496    -0.714878     0.005738  

Degrees of Freedom: 9999 Total (i.e. Null);  9997 Residual
Null Deviance:      2921 
Residual Deviance: 1572     AIC: 1578

Model that include only student and balance has lower estimation of test error (i.e., AIC & BIC).

Code
list("All Predictors" = glm_fit_all,
     "student + balance" = glm_fit_st_bal) %>% 
  map_dfr( 
    broom::glance, 
    .id = "Model") %>% 
  select(Model, AIC, BIC)
# A tibble: 2 × 3
  Model               AIC   BIC
  <chr>             <dbl> <dbl>
1 All Predictors    1580. 1608.
2 student + balance 1578. 1599.

Modeling Process

Now, let’s apply full modeling process with tidymodels.

Code
library(tidymodels)

Split Data

First, we need to spending data budgets in 2 portions: training and testing data.

Code
set.seed(123)
# Split to Test and Train
Default_split <- initial_split(Default, strata = default)

Default_train <- training(Default_split) # 75% to traning data
Default_test <- testing(Default_split) # 25% to test data

Default_split
<Analysis/Assess/Total>
<7500/2500/10000>

Resampling Method

I will use 10-Fold Cross-Validation to resample the training data.

Code
set.seed(234)

Default_folds <- vfold_cv(Default_train, v = 10, strata = default)
Default_folds 
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [6750/750]> Fold01
 2 <split [6750/750]> Fold02
 3 <split [6750/750]> Fold03
 4 <split [6750/750]> Fold04
 5 <split [6750/750]> Fold05
 6 <split [6750/750]> Fold06
 7 <split [6750/750]> Fold07
 8 <split [6750/750]> Fold08
 9 <split [6750/750]> Fold09
10 <split [6750/750]> Fold10

Note that I use strata = default because the frequency of each class is quite difference.

Code
Default %>% count(default)
  default    n
1      No 9667
2     Yes  333

For each folds, 6750 rows will spend on fitting models and 750 spend on analysis of model performance:

Code
Default_folds$splits[[1]] 
<Analysis/Assess/Total>
<6750/750/7500>

Specification

Model Specification

Code
glm_spec <- logistic_reg()
glm_spec
Logistic Regression Model Specification (classification)

Computational engine: glm 

Feature Engineering Specification

Code
rec_basic <- recipe(default ~ ., data = Default) %>% 
  step_dummy(all_nominal_predictors())

summary(rec_basic)
# A tibble: 4 × 4
  variable type    role      source  
  <chr>    <chr>   <chr>     <chr>   
1 student  nominal predictor original
2 balance  numeric predictor original
3 income   numeric predictor original
4 default  nominal outcome   original

Preview of engineered training data

You can see that student was replaced by student_Yes with 0-1 encoding. What does it mean?

Code
rec_basic %>% 
  prep(log_change = T) %>% 
  bake(new_data = Default_train)
step_dummy (dummy_72FQt): 
 new (1): student_Yes
 removed (1): student
# A tibble: 7,500 × 4
   balance income default student_Yes
     <dbl>  <dbl> <fct>         <dbl>
 1    730. 44362. No                0
 2    817. 12106. No                1
 3   1074. 31767. No                0
 4    529. 35704. No                0
 5    786. 38463. No                0
 6    826. 24905. No                0
 7    809. 17600. No                1
 8   1161. 37469. No                0
 9      0  29275. No                0
10      0  21871. No                1
# … with 7,490 more rows

From contrast() function we can see the dummy encoding of the student (factor) variable: No = 0, Yes = 1.

Code
contrasts(Default$student)
    Yes
No    0
Yes   1

Workflow

workflow() will bundle blueprint of feature engineering and model specification together.

Code
wf_basic <- workflow(rec_basic, glm_spec)
wf_basic
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

Fit Model to Resampled Data

Code
doParallel::registerDoParallel()

ctrl_preds <- control_resamples(save_pred = TRUE)
## Fit
rs_basic <- fit_resamples(wf_basic, resamples =  Default_folds, control = ctrl_preds)
Warning: package 'rlang' was built under R version 4.1.2
Warning: package 'vctrs' was built under R version 4.1.2
Code
head(rs_basic)
# A tibble: 6 × 5
  splits             id     .metrics         .notes           .predictions      
  <list>             <chr>  <list>           <list>           <list>            
1 <split [6750/750]> Fold01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 6]>
2 <split [6750/750]> Fold02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 6]>
3 <split [6750/750]> Fold03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 6]>
4 <split [6750/750]> Fold04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 6]>
5 <split [6750/750]> Fold05 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 6]>
6 <split [6750/750]> Fold06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [750 × 6]>

ROC Curve

From ROC curve we can see that it has pretty good upper-left bulging curve.

Code
augment(rs_basic) %>% 
  roc_curve(truth = default, .pred_No) %>% 
  autoplot()

This would result in AUC (area under the ROC curve) and accuracy close to 1.

Code
collect_metrics(rs_basic)
# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.974    10 0.00194 Preprocessor1_Model1
2 roc_auc  binary     0.953    10 0.00461 Preprocessor1_Model1

Improve the Model

As we can see from the beginning, removing income from predictor result in better estimation of test error by AIC and BIC.

Now, I will remove income from the recipes:

Code
rec_simple <- rec_basic %>% 
  remove_role(income, old_role = "predictor")

summary(rec_simple)
# A tibble: 4 × 4
  variable type    role      source  
  <chr>    <chr>   <chr>     <chr>   
1 student  nominal predictor original
2 balance  numeric predictor original
3 income   numeric <NA>      original
4 default  nominal outcome   original

And I will update the WorkFlow

Code
wf_simple <- workflow(rec_simple, glm_spec)

Then the rest is the same. So, I wrote a simple wrapper function to do it.

Code
update_workflow <- function(wf) {

  ctrl_preds <- control_resamples(save_pred = TRUE)
  rs <- fit_resamples(wf, resamples = Default_folds, control = ctrl_preds)
  rs
  
}

rs_simple <- update_workflow(wf_simple)
Code
rs_ls <- list("All Predictors" = rs_basic,
              "student + balance" = rs_simple)
Code
roc_df <- rs_ls %>% 
  map_dfr(~augment(.x) %>% roc_curve(truth = default, .pred_No),
      .id = "Predictors"
      )

ROC curve of the improved model

This plot show comparison of ROC curves of the 2 logistic regression models.

  • Red line shows model that use all predictors: student, balance and income.
  • Blue line shows model that use 2 predictor: student and balance.
Code
roc_df %>% 
  ggplot(aes(1-specificity, sensitivity, color = Predictors)) +
  geom_line(size = 1, alpha = 0.6, linetype = "dashed") +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted")

Code
rs_ls %>% 
  map_dfr(
    collect_metrics, .id = "Features"
  )
# A tibble: 4 × 7
  Features          .metric  .estimator  mean     n std_err .config             
  <chr>             <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 All Predictors    accuracy binary     0.974    10 0.00194 Preprocessor1_Model1
2 All Predictors    roc_auc  binary     0.953    10 0.00461 Preprocessor1_Model1
3 student + balance accuracy binary     0.974    10 0.00194 Preprocessor1_Model1
4 student + balance roc_auc  binary     0.953    10 0.00461 Preprocessor1_Model1

You can see that a simpler model result in a similar (or may be slightly improved) AUC. So it’s reasonable to prefer it over more complicated model.

Evaluate Model

In this section, I will evaluate the simpler model with 2 predictors (student, balance).

Fit to Training data

First, fit the model to training data.

Code
Default_fit <- fit(wf_simple, Default_train)
Default_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)      balance  student_Yes  
  -10.85340      0.00578     -0.71313  

Degrees of Freedom: 7499 Total (i.e. Null);  7497 Residual
Null Deviance:      2158 
Residual Deviance: 1139     AIC: 1145

Use Testing data to Predict

Then, use test data to predict default.

Code
Default_pred <- 
  augment(Default_fit, Default_test) %>% 
  bind_cols(
    predict(Default_fit, Default_test, type = "conf_int")
  )
Code
library(latex2exp)
p2 <- Default_pred %>% 
  ggplot(aes(balance, .pred_Yes, fill = student)) +
  geom_line(aes(color = student)) +
  geom_ribbon(aes(ymin = .pred_lower_Yes, ymax = .pred_upper_Yes),
              alpha = 0.3) +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  labs(y = TeX("$\\hat{p}(default)$"),
       caption = "Area indicate 95% CI of the estimated probability") +
  labs(title = "Predicted probability of default", 
       subtitle = "by logistic regression with 2 predictors")

p2 

This plot show estimated probability of default if we know the values of predictors: student and balance by using logistic regression model fitted on training data. The prediction was made by plugging test data to the model.

Summary

Summary Plots

The left-sided plot showed default status by balance and student status as observed in the Default data set. After multivariate logistic regression model (default on balance and student) was fitted to the training data, the predicted probability of default using the model was shown in the right-sided plot.

References

[1]
G. James, D. Witten, T. Hastie, R. Tibshirani, ISLR: Data for an introduction to statistical learning with applications in r, 2021. https://www.statlearning.com.