This repository has been archived by the owner on Aug 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path04-dataset.Rmd
743 lines (534 loc) · 32.3 KB
/
04-dataset.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
# Dataset
```{r 04-ds-clean-load,echo=FALSE,child="CleanLoad.Rmd"}
```
The data is sourced as a _SQLite_ database that downloaded from teh Kaggle website [@KaggleLC] and imported as a `tibble dataframe` with the `RSQLite` package. The variables were reformatted according to their respective types. The full list of variable is given in Appendix (see Table \@ref(tab:08-variable-description)). This dataset will be reduced as we focused on the core intent of modeling the probability of default.
Note that the dataset was anonymised (all identifying ID numbers are deleted) and we therefore removed the corresponding empty columns from the dataset. Since the identification `ID`s have been removed to anonymise the dataset, we cannot see if a borrower borrowed several times.
## Preamble
The LendingClub dataset, although rich, is difficult to interpret. The only explanation of what the variables mean comes from a spreadsheet attached to the dataset. The explanations are not precise and/or subject to conflicting interpretation. Despite serching the LendingClub website, no further original information was found. We collected a number of reasonable assumptions in Appendix (see subsection \@ref(sec:list-assumptions) in Appendix).
The dataset has been used a number of times in the past by various people. One paper [@kim2019ensemble] mentions they used a dataset that included 110 variables, which is less than ours with 145 variables. It is therefore clear that the dataset has changed over time in ways we do not know. For example, have loans been excluded because the full 145 veriables were not available?
## General presentation
The original dataset is large: it includes `r nrow(lending_club)` loan samples, each containing `r ncol(lending_club)` variables (after the identification variables filled with null values). The loans were issued from `r min(lending_club$issue_d)` to `r max(lending_club$issue_d)`.
### Business volume
The dataset represents a total of ca.$34bln in loan principals, which is a substantial share of the total amount stated to have been intermediated to date by LC (publicly reported to be $50bln+). About 55%/60% of the portfolio is not current anymore (either fully or partially repaid). See Table \@ref(tab:04-loan-per-status).
```{r 04-loan-per-status}
local({
nloans <- nrow(lending_club)
lending_club %>%
group_by(loan_status) %>%
summarise(Number = n()) %>%
mutate(Proportion = round(100 * Number / nloans, 3)) %>%
kable(
"latex",
caption = "Number of loans per status",
booktabs = T,
# longtable = T,
col.names = c("Loan status", "Count", "Proportion (%)")
) %>%
kable_styling(latex_options = c("repeat_header")) %>%
column_spec(1, width = "8.5cm") %>%
column_spec(2, width = "2.5cm") %>%
column_spec(3, width = "3.5cm")
})
```
Figure \@ref(fig:04-business-volume-per-month) plots the number, volume (cumulative principal amount) and average principal per loan. It shows that the business grew exponentially (in the common sense of the word) from inception until 2016. At this point, according to Wikipedia ^[source: https://en.wikipedia.org/wiki/LendingClub - Retrieval date 15 September 2019]:
```{r 04-business-volume-per-month,fig.cap="Business volume written per month",fig.fullwidth=TRUE}
volume1 <- lending_club %>%
mutate(Monthly = ceiling_date(issue_d, unit = "month")) %>%
group_by(Monthly) %>%
summarise(Number_of_written_loans = n()) %>%
ggplot(aes(Monthly, Number_of_written_loans)) +
geom_line(color = "blue") +
ylab("Number of loans") + geom_smooth(color = "red", method = "loess")
volume2 <- lending_club %>%
mutate(Monthly = ceiling_date(issue_d, unit = "month")) %>%
group_by(Monthly) %>%
summarise(Volume = sum(funded_amnt) / 1000000) %>%
ggplot(aes(Monthly, Volume)) +
geom_line(color = "blue") +
ylab("Volume of loans ($mln)") + geom_smooth(color = "red", method = "loess")
volume3 <- lending_club %>%
mutate(Monthly = ceiling_date(issue_d, unit = "month")) %>%
group_by(Monthly) %>%
summarise(Average = mean(funded_amnt)) %>%
ggplot(aes(Monthly, Average)) +
geom_line(color = "blue") +
ylab("Average loan size ($ per loan)") + geom_smooth(color = "red", method = "loess")
grid.arrange(volume1, volume2, volume3, nrow = 1)
```
" _Like other peer-to-peer lenders including Prosper, Sofi and Khutzpa.com, LendingClub experienced increasing difficulty attracting investors during early 2016. This led the firm to increase the interest rate it charges borrowers on three occasions during the first months of the year. The increase in interest rates and concerns over the impact of the slowing United States economy caused a large drop in LendingClub's share price._"
The number and volume of loans plotted have been aggregated by month. The growth is very smooth in the early years, and suddenly very volatile. As far as the first part of the dataset is concerned, a starting business could expect to be volatile and could witness a yearly cycle (expected from economic consumption figures) superimposed on the growth trend. This is not the case.
An interesting metric is that the average principal of loans has increased (see RHS Figure \@ref(fig:04-business-volume-per-month), on a sample of 100,000 loans). Partly, the increase in the early years could be interpreted success in improving marketing, distribution capabilities and confidence building. This metric plateau-ed in 2016 and decreased afterwards, but to a much lesser extent than the gross volume metrics. However, it is more volatile than the two previous metrics in the early years.
By the end of the dataset, those metrics have essentially recovered to their 2016 level.
### Loan lifecyle and status
In the dataset, less loans are still outstanding than matured or "_charged off_" (term that LC use to mean partially or fully written off, i.e. there are no possibilty for LC and/or the investors to receive further payments). The share of outstanding loans is:
```{r 04-share-of-live-portfolio}
# Share of portfolio currently 'live'
local({
n_current <- lending_club %>%
filter(!(
loan_status %in% c(
"Charged Off",
"Does not meet the credit policy. Status:Charged Off",
"Does not meet the credit policy. Status:Fully Paid",
"Fully Paid"
)
)) %>%
nrow()
n_samples <- lending_club %>% nrow()
cat("Share of current loans = ",
round(100 * n_current / n_samples, 3),
"%")
})
```
The dataset describes the life cycle of a loan. In the typical (ideal) case, we understand it to be:
$$
\text{Loan is approved} \rightarrow \text{Full amount funded by investors} \rightarrow \text{Loan marked as Current} \rightarrow \text{Fully Paid}
$$
In the worst case, it is:
$$
\text{Loan is approved} \rightarrow \text{Full amount funded by investors} \rightarrow \text{Loan marked as Current} \rightarrow
$$
$$
\rightarrow \text{Grace period (missed payments under 2 weeks)} \rightarrow \text{Late 15 to 31 days} \rightarrow
$$
$$
\rightarrow \text{Late 31 to 120 days} \rightarrow \text{Default} \rightarrow \text{Charged Off}
$$
Note that _Default_ precedes and is distinct from _Charged Off_ ^[ See LendingClub FAQ at [https://help.lendingclub.com/hc/en-us/articles/215488038] and help page [https://help.lendingclub.com/hc/en-us/articles/216127897-What-happens-when-a-loan-is-charged-off-] ]. A couple of things could happen to a loan in default:
+ LC and the borrower restructure the loan with a new repayment schedule, where the borrower may repay a lesser amount over a longer period; or,
+ the claim could be sold to a debt recovery company that would buy the claim from LC/investors. This would be the final payment (if any) received by LC and the investors.
The dataset also describes situations where a borrower negotiated a restructuring of the repayment schedule in case of unexpected hardship (e.g. disaster, sudden unemployment).
Note that this progression of distinguishing default (event in time) from actual financial loss mirrors what banks and rating agencies do. The former is called the _Probability of Default_ (PD), the latter _Loss Given Default_ (LGD). Ratings change over time (in a process resembling Markov Chains transitions). LGD show some correlations with ratings. The dataset, although detailed, does not include the full life of each loan to conduct this sort of analysis (change of loan quality over time). This is an important reason why we decided to focus on the loan approval and expected return.
### Loan application
Before a loan is approved, the borrower undergoes a review process that assess his/her capacity to repay. This includes:
+ employment situation and income, as well whether this income and possibly its source has been independently verified;
+ whether the application is made jointly (likely with a partner or a spouse, but there are no details);
+ housing situation (owner, owner with current mortgage, rental) and in which county he/she lives (that piece of information is partially anonymised by removing the last 2 digits of the borrower's zipcode);
+ the amount sought, its tenor and the purpose of the loan; and,
+ what seems to be previous credit history (number of previous deliquencies). The dataset is very confusing in that regard: in the case of the __joint__ applicant, it is clear that such information relates to before the loan is approved . In the case of the __principal borrower__ however, the variable descriptions could be read as being pre-approval information, or post-approval gathered during the life of the loan. We have assumed that the information related to the principal borrower is also pre-approval. We also used _Sales Supplements_ from the LC website^[See https://www.lendingclub.com/legal/prospectus] that describe some of the information provided to investors. LendingClub also provides a summary description of its approval process in its regulatory filings with the Securities Exchange Commission [@LC201908S3].
## Rates
### IRR and required credit margins {#sec:feature-engineering}
Figure \@ref(fig:04-credit-margin-over-time) shows the evolution of credit margins over time grouped by ratings. The plots are made with a random sample of 300,000 loans.
```{r 04-credit-margin-over-time,fig.cap="Credit margins per grade over time"}
creditMargin36 <-
LoansMargin %>%
sample_n(300000) %>%
left_join(loans, term, by = "loanID") %>%
filter(is.numeric(creditMargin) & (term == 36)) %>%
ggplot(aes(issue_d, int_rate, col = grade)) +
geom_point(alpha = 0.25, size = 0.4) +
ggtitle("Credit margin per grade over time \n 3-year loans") +
xlab("Funding date") +
ylab("Interest rate")
creditMargin60 <-
LoansMargin %>%
sample_n(300000) %>%
left_join(loans, term, by = "loanID") %>%
filter(is.numeric(creditMargin) & (term == 60)) %>%
ggplot(aes(issue_d, int_rate, col = grade)) +
geom_point(alpha = 0.25, size = 0.4) +
ggtitle("Credit margin per grade over time \n 5-year loans") +
xlab("Funding date") +
ylab("Interest rate")
grid.arrange(creditMargin36, creditMargin60, nrow = 1)
```
We notice long periods where certain margins remain fairly stable which indicate that _both_ the initial pricing was constant _and_ that the proportion of default remains very low.
The graphs offer considerations that are relevant to the modeling:
+ The margins clearly change over time. To the extent that they reflect a change in probability of default, the predictions will require to account for time (probably in a non-linear fashion).^[Note that we will add the second and third power of time (measure in months) to create this non-linearity. This will be the only real feature engineering that will be performed on the dataset.]
+ For a given rating, the margins widen and narrow over time. The changes happen in multiples that depends on the ratings:
- For high quality / low margin loans: the changes are multiples of the margin, for example going from roughly 3% to 6/7%.
- Although the range of change is wide, those changes do not happen very often, especially in the later years.
- By comparison, for low quality / high margin loans, the range of change is proportionally smaller, but more frequent and volatile.
+ In other words, the relation between loan quality (its rating) and its pricing (the credit margin) will significantly non-linear.
```{r 04-IRR-over-time,fig.cap="Credit margins per grade over time"}
IRR36 <-
LoansIRR %>%
sample_n(300000) %>%
left_join(loans, term, by = "loanID") %>%
filter(is.numeric(IRR) & (term == 36)) %>%
ggplot(aes(issue_d, IRR, col = grade)) +
geom_point(alpha = 0.25, size = 0.4) +
ggtitle("IRR per grade over time \n 3-year loans") +
xlab("Funding date") +
ylab("IRR")
IRR60 <-
LoansIRR %>%
sample_n(300000) %>%
left_join(loans, term, by = "loanID") %>%
filter(is.numeric(IRR) & (term == 60)) %>%
ggplot(aes(issue_d, IRR, col = grade)) +
geom_point(alpha = 0.25, size = 0.4) +
ggtitle("IRR per grade over time \n 5-year loans") +
xlab("Funding date") +
ylab("IRR")
grid.arrange(IRR36, IRR60, nrow = 1)
```
```{r 04-CM-over-time,fig.cap="Credit margins per grade over time"}
CM36 <-
LoansMargin %>%
sample_n(300000) %>%
left_join(loans, term, by = "loanID") %>%
filter(is.numeric(creditMargin) & (term == 36)) %>%
ggplot(aes(issue_d, creditMargin, col = grade)) +
geom_point(alpha = 0.25, size = 0.4) +
ggtitle("IRR per grade over time \n 3-year loans") +
xlab("Funding date") +
ylab("Credit Margin")
CM60 <-
LoansMargin %>%
sample_n(300000) %>%
left_join(loans, term, by = "loanID") %>%
filter(is.numeric(creditMargin) & (term == 60)) %>%
ggplot(aes(issue_d, creditMargin, col = grade)) +
geom_point(alpha = 0.25, size = 0.4) +
ggtitle("IRR per grade over time \n 5-year loans") +
xlab("Funding date") +
ylab("Credit Margin")
grid.arrange(CM36, CM60, nrow = 1)
```
### Choice of predictors
Because we are interested decisions made prior to invest, we will limit the predictors to those that are realistically available prior to funding. We also remove information that is provided as a result of LC's own credit analysis (e.g. grade and interest rate).
### Interest rates
Based on this information, the loan is approved or not. Approval includes the final amount (which could be lower than the amount requested), tenor (3 or 5 years) and a rating similar to those given to corporate borrowers. Unlike corporate borrowers however, the rating mechanically determines the rate of interest according to a grid known to the borrower in advance^[https://www.lendingclub.com/investing/investor-education/interest-rates-and-fees]. The rates have changed over time. Those changes where not as frequent as market conditions (e.g. changes in Federal Reserve Bank's rates)^[Corporate borrowers would negociate interest margins on a case-by-case basis despite similar risk profiles.].
Figure \@ref(fig:04-interest-rate-table2) ^[source: https://www.lendingclub.com/investing/investor-education/interest-rates-and-fees] shows the predetermined interest rate depending on the initial rating as of July 2019.
```{r 04-interest-rate-table2,fig.cap="Interest rates given rating",out.width="70%",out.height="70%"}
knitr::include_graphics("images/interest-rates-jul2019.png", auto_pdf = TRUE)
```
At the date of this report, the ratings range from `A` (the best) down to `D`, each split in 5 sub-ratings. In the past, LC has also intermediated loans rated F or G (until 6 November 2017) and E (until 30 June 2019) ^[See https://www.lendingclub.com/info/demand-and-credit-profile.action]. This explains that such ratings are in the dataset. We will assume that the ratings in the dataset are the rating at the time of approval and that, even if loans are re-rated by LC, the dataset does not reflect it.
Figures \@ref(fig:04-interest-over-time) shows the change in interest rate over time for different ratings and separated for each tenor. (Each figure is on a sample of 100,000 loans.) For each rating, we can see several parallel lines which correspond to the 5 sub-rating of each rating. We note that the range of interest rates has substantial widened over time. That is, the risk premium necessary to attract potential investors has had to substantially increase. In the most recent years, the highest rates exceed 30% which is higher than many credit cards.3-year loans are naturally considered safer (more A-rated, less G-rated). Identical ratings attract identical rates of interest.
```{r 04-interest-over-time,fig.cap="Interest rate per grade over time"}
interest36 <- loans %>%
sample_n(100000) %>%
filter(term == 36) %>%
ggplot(aes(issue_d, int_rate, col = grade)) +
geom_point(alpha = 0.15) +
ggtitle("Interest rate per grade over time \n 3-year loans") +
xlab("Funding date") +
ylab("Interest rate")
interest60 <- loans %>%
sample_n(100000) %>%
filter(term == 60) %>%
ggplot(aes(issue_d, int_rate, col = grade)) +
geom_point(alpha = 0.15) +
ggtitle("Interest rate per grade over time \n 5-year loans") +
xlab("Funding date") +
ylab("Interest rate")
grid.arrange(interest36, interest60, nrow = 1)
```
By comparison, we plot the 3-year (in red) and 5-year (in blue) bank swap rates in Figure \@ref(fig:04-swap-rates). We see that the swap curve has flattened in recent times (3-year and 5-y rates are almost identical). We also can see that in broad terms the interest rates charged reflect those underlying swap rates. It is more relevant to examine the credit margins excluding swap rates.
```{r 04-swap-rates,fig.cap="Historical Swap Rates"}
RATES %>%
filter(DATE >= ymd("2008-01-01")) %>%
ggplot(aes(DATE, RATE3Y, RATE5Y)) +
geom_line(aes(DATE, RATE3Y), col = "red") +
geom_line(aes(DATE, RATE5Y), col = "blue")
```
Figures \@ref(fig:04-margin-over-time) shows the change in credit margin over time for different ratings and separated for each tenor. (Each figure is on a sample of 100,000 loans.) As above, for each rating, we can see several parallel lines which correspond to the 5 sub-rating of each rating. We note that the range of credit margins has widened over time but less than the interest rates. Identical ratings attract identical credit margins.
```{r 04-margin-over-time,fig.cap="Credit margins per grade over time"}
margin36 <-
loans %>%
filter(term == 36) %>%
sample_n(100000) %>%
select(issue_d, int_rate, grade) %>%
rename(DATE = issue_d) %>%
left_join(RATES, by = "DATE", copy = TRUE) %>%
mutate(Margin = int_rate - RATE3Y) %>%
ggplot(aes(DATE, Margin, col = grade)) +
geom_point(alpha = 0.15) +
ggtitle("Credit margin per grade over time \n 3-year loans") +
xlab("Funding date") +
ylab("Credit margin")
margin60 <-
loans %>%
filter(term == 60) %>%
sample_n(100000) %>%
select(issue_d, int_rate, grade) %>%
rename(DATE = issue_d) %>%
left_join(RATES, by = "DATE", copy = TRUE) %>%
mutate(Margin = int_rate - RATE5Y) %>%
ggplot(aes(DATE, Margin, col = grade)) +
geom_point(alpha = 0.15) +
ggtitle("Credit margin per grade over time \n 5-year loans") +
xlab("Funding date") +
ylab("Credit margin")
grid.arrange(margin36, margin60, nrow = 1)
```
### Purpose
When applying, a potential borrower must state the purpose of the loan. As shown in table \@ref(tab:04-loans-per-purpose), by far the main purpose is the consolidation of existing debts.
```{r 04-loans-per-purpose}
local({
nloans <- nrow(lending_club)
loans %>%
group_by(purpose) %>%
summarise(Number = n()) %>%
arrange(desc(Number)) %>%
kable(
"latex",
caption = "Number of loans per purpose",
booktabs = T,
# longtable = T,
col.names = c("Borrowing purpose", "Count")
) %>%
kable_styling(latex_options = c("repeat_header")) %>%
column_spec(1, width = "8.5cm") %>%
column_spec(3, width = "3.5cm")
})
```
```{r 04-credit-margin-hist,fig.cap="Histograms of credit margins per purpose"}
loans %>%
left_join(LoansMargin, by = "loanID") %>%
select(purpose, creditMargin) %>%
sample_n(300000) %>%
filter(!is.na(creditMargin)) %>%
ggplot(aes(creditMargin)) +
geom_histogram()
```
```{r 04-credit-margin-boxplot,fig.cap="Boxplots of credit margins per purpose"}
loans %>%
left_join(LoansMargin, by = "loanID") %>%
select(purpose, creditMargin) %>%
filter(!is.na(creditMargin)) %>%
ggplot(aes(purpose, creditMargin)) +
geom_boxplot(stat = "boxplot", notch = TRUE, outlier.alpha = 0.1) +
stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
coord_flip()
```
### Payments
The loans are approved for only two tenors, 3 and 5 years, with monthly repayments. Installments are calculated easily with the standard formula:
$$
Installment = Principal \times rate \times \frac{1}{1 - \frac{1}{(1+rate)^N}}
$$
Where $Principal$ is the amount borrowed, $rate = \frac{\text{Quoted Interest Rate}}{12}$ is the monthly interest rate, and $N$ is the number of installments (36 or 60 monthly payments). The following piece of code shows that the average error between this formula and the dataset value is about 2 cents. We therefore precisely understand this variable.
```{r 04-installment-amount-error,echo=TRUE,eval=FALSE}
local({
installmentError <- loans %>%
mutate(
PMT = round(funded_amnt * int_rate / 12 / (1 - 1 / (1 + int_rate / 12) ^
term), 2),
PMT_delta = abs(installment - PMT)
) %>%
select(PMT_delta)
round(mean(100 * installmentError$PMT_delta), digits = 2)
})
```
## Net present value
The behaviour of the NPV of loan losses is informative.
### Average NPV and credit margin by subgrade
Figure \@ref(fig:04-NPV-subrating) shows that as ratings worsen, the average NPV^[The averages are _not_ weighted by loan amount since an investor can invest in $25 parcels. Weighting would have been appropriate if investors were instead forced to invest in the whole amount.] expressed as a portion of the funded amount decreases. For the best quality loans, we see that the NPV exceeds 1.00 = 100%: at a risk-free rate^[Discounting a cash flow at the IRR gives a nil NPV. Discounting at a higher rate, resp. lower, gives a negative, resp. positive, NPV.], investors receive more than what is necessary to compensate for credit loss and can use the excess to cover additional costs mentioned in the Preamble. As ratings worsen, the NPV drops down to about 50%.
If loans were adequately priced, the excess returns (thanks to higher interest) should on average offset credit losses, that is an NPV average should be at least 100%. This seems to be the case down to ratings of about `D4`. Further down, credit losses become too frequent and/or too substantial to be covered on average. We posit that this justified rejecting loans applications rated `E1` and below.
```{r 04-NPV-subrating,fig.cap="Average NPV et credit margin (\\%) depending on sub-rating"}
loansBySubGrade <-
loansWorkingSet %>%
filter(!is.na(creditMargin)) %>%
group_by(sub_grade) %>%
summarise(NumberOfLoans = n(),
AverageNPV = mean(NPVLossPCT),
AverageCreditMargin = mean(creditMargin))
p1 <- loansBySubGrade %>%
ggplot(aes(sub_grade, AverageNPV)) +
geom_point(col = "blue")
p2 <- loansBySubGrade %>%
ggplot(aes(sub_grade, AverageCreditMargin)) +
geom_point(col = "blue")
gridExtra::grid.arrange(p1, p2, ncol = 1)
```
### Distribution of principal losses by rating
We here consider nominal losses, that is not accounting for time effects (discounting).
Figure \@ref(fig:04-loss-distribution-subrating) shows that for a given grade, the losses are very widely spread. The loans are grouped by ratings and loans that have been fully repaid are removed.
Setting aside the loans rated "A" or "B", the distributions seem log-normal. Unsurprisingly, the worse the rating the larger the principal loss.
```{r 04-loss-distribution-subrating,fig.cap="Distribution of the Principal Loss (\\%) depending on rating \n(y-axis square-root scaling)"}
Set <- loansWorkingSet %>%
select(grade, funded_amnt, total_rec_prncp) %>%
filter(total_rec_prncp < funded_amnt - 1) %>%
mutate(PrincipalLoss = (funded_amnt - total_rec_prncp)/funded_amnt)
adjustSpread <- 0.6
rangeMin <- 0.0
rangeMax <- 1.0
Set %>% ggplot(aes(PrincipalLoss, col = grade)) +
geom_density(adjust = adjustSpread) +
scale_x_continuous(limits = c(rangeMin, rangeMax)) +
#scale_y_sqrt() +
xlab("Loss as proportion of princial amount")
```
### NPV distribution by rating
Principal loss does not reflect the timing of that loss: for the same dollar amount, a loss now is worse than a loss later. This subsection looks at the NPVs of actual loan cashflow (principal and interest) discounted the risk-free rate.
Figure \@ref(fig:04-NPV-distribution-subrating) shows that for a given grade, the NPVs are very widely spread. From top to bottom, loans are group by ratings: from quality ratings of `A` and `B`, average ratings of `C` and `D`, to poor ratings of `E` and below. From left to right, we focus on different parts of how NPVs are distributed. Note that each graph is based on a random sample of 100,000 loans (about 1/12th of the original set) and therefore the NPV densities are comparable from graph to graph. This spread is expected. The amount of loss for any loan can be anything a single dollar and 100%.
```{r 04-NPV-distribution-subrating,fig.cap="Distribution of NPV (\\%) depending on rating \n(y-axis square-root scaling)"}
setGood <- loansWorkingSet %>%
filter(grade %in% c("A", "B")) %>%
sample_n(100000) %>%
filter(!is.na(NPVLossPCT))
setMid <- loansWorkingSet %>%
filter(grade %in% c("C", "D")) %>%
sample_n(100000) %>%
filter(!is.na(NPVLossPCT))
setLow <- loansWorkingSet %>%
filter(grade %in% c("E", "F", "G")) %>%
sample_n(100000) %>%
filter(!is.na(NPVLossPCT))
ratingGood <-
setGood %>%
# filter(between(NPVLossPCT, -1, -0.00)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-1, 1.5)) +
scale_y_sqrt() +
xlab(label = "")
ratingMid <-
setMid %>%
# filter(between(NPVLossPCT, -1, -0.00)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-1, 1.5)) +
scale_y_sqrt() +
xlab(label = "")
ratingLow <-
setLow %>%
# filter(between(NPVLossPCT, -1, -0.00)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-1, 1.5)) +
scale_y_sqrt() +
xlab("")
rating2Good <-
setGood %>%
filter(between(NPVLossPCT, -0.2, +0.50)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-0.20, +0.50)) +
scale_y_sqrt() +
xlab(label = "")
rating2Mid <-
setMid %>%
filter(between(NPVLossPCT, -0.20, +0.50)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-0.20, +0.50)) +
scale_y_sqrt() +
xlab(label = "")
rating2Low <-
setLow %>%
filter(between(NPVLossPCT, -0.20, +0.50)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-0.20, +0.50)) +
scale_y_sqrt() +
xlab("")
rating3Good <-
setGood %>%
filter(between(NPVLossPCT, -1, -0.25)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-1, -0.25)) +
scale_y_sqrt() +
xlab(label = "")
rating3Mid <-
setMid %>%
filter(between(NPVLossPCT, -1, -0.25)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-1, -0.25)) +
scale_y_sqrt() +
xlab(label = "")
rating3Low <-
setLow %>%
filter(between(NPVLossPCT, -1, -0.25)) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 1/10) +
scale_x_continuous(limits = c(-1, -0.25)) +
scale_y_sqrt() +
xlab("")
gridExtra::grid.arrange(ratingGood, rating2Good, rating3Good,
ratingMid, rating2Mid, rating3Mid,
ratingLow, rating2Low, rating3Low,
ncol = 3)
```
At the outset, column by column (where NPVs are on the same scale), the NPV distribution show several modes on the same location. The modes are made more apparent by zooming on where the modes are present: the leftmost column basically shows the entire range of the NPVs (as portion of the loan). The middle graph zooms on the -20% / 50% range. The rightmost column zooms on the -100% / -25% section. Looking at the left hand scale, we can see that the lower NPVs overall gain in importance as the loan rating worsen.
Zooming without scaling the y-axis and grouping all the ratings available for investment on a single plot gives more details.
+ Figure \@ref(fig:04-NPV-top) shows a mode with a maximum around 1.25 / 1.5 being loans seemingly repaid in full (the mode is above 100% given the repayment of principal _and_ interest);
```{r 04-NPV-top,fig.cap="NPV \\% higher than 120\\% \n(no y-axis scaling)"}
NPVset <-
loansWorkingSet %>%
filter(grade %in% c("A", "B", "C", "D")) %>%
select(grade, NPVLossPCT) %>%
filter(!is.na(NPVLossPCT))
NPVset %>%
filter(NPVLossPCT >= 1.2) %>%
#sample_n(100000) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 3)
```
+ Figure \@ref(fig:04-NPV-mid-high) and figure \@ref(fig:04-NPV-zero) show a second and third mode around 41% and -1%;
```{r 04-NPV-mid-high,fig.cap="NPV \\% around 41\\% \n(no y-axis scaling)"}
NPVset %>%
filter(between(NPVLossPCT, 0.35, 0.50)) %>%
#sample_n(100000) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density(adjust = 3)
```
```{r 04-NPV-zero,fig.cap="NPV \\% around -1\\% \n(no y-axis scaling)"}
NPVset %>%
filter(between(NPVLossPCT, -0.10, 0.01)) %>%
#sample_n(100000) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density()
```
+ Finally, figure \@ref(fig:04-NPV-total-loss) one last very diffuse mode around -100%.
```{r 04-NPV-total-loss,fig.cap="NPV \\% for close to total loss \\% \n(no y-axis scaling)"}
NPVset %>%
filter(between(NPVLossPCT, -1.00, -0.60)) %>%
#sample_n(100000) %>%
ggplot(aes(NPVLossPCT, col = grade)) +
geom_density()
```
The overall trend is what we should expect. What is surprising is the existence of (1) very clearly defined modes which (2) are common to all types of borrowers. They roughly look log-normal, apart from the mode around 41% which look Gaussian.
## Loan decision
As indicated in the introduction, our focus is on loans that have gone through their entire life cycle to consider their respective pricing, risk and profitability. To that effect, we will remove all loans which are still current (either performing or not), and we will only retain loans which currently available (rated `A1` to `D5`). From here on, everything will be based on this reduced dataset.
This reduced dataset contains `r nrow(loans)` samples. Most of the loans (ca.80%) have been repaid in full (in other words __1 in 5 loans defaulted__). See Table \@ref(tab:04-matured-loans).
```{r 04-matured-loans}
# Share of portfolio currently 'live'
# In the working set, only keep A -> D loans
loansWorkingSet <-
loansWorkingSet %>%
filter(grade %in% c("A", "B", "C", "D"))
loansWorkingSet %>%
group_by(loan_status) %>%
summarise(Number = n()) %>%
mutate(Share = round(100 * Number / n(), 3)) %>%
arrange(desc(Number)) %>%
kable(
"latex",
caption = "Matured loans per status",
booktabs = T,
# longtable = T,
col.names = c("Loan status", "Count", "Proportion (%)")
) %>%
kable_styling(latex_options = c("repeat_header")) %>%
column_spec(1, width = "6cm") %>%
column_spec(2, width = "4cm") %>%
column_spec(2, width = "4cm")
```
When grouped by grade (Figure \@ref(fig:04-funded-by-subgrade)), we see a clear correlation between grade and default: the lower the grade the higher the portion defaults (note the limited scale with a minimum at about 50%). In addition, in the reduced set most of the business is written in the B- or C-rating range.
```{r 04-funded-by-subgrade,fig.cap="Funding and Write-offs by Sub-grades"}
loanSizeSummary <-
loansWorkingSet %>%
group_by(sub_grade) %>%
summarise(
Funded = sum(funded_amnt) / 1000000,
Repaid = sum(total_rec_prncp) / 1000000
) %>%
mutate(Proportion = 100 * Repaid / Funded)
loanBySubgrade1 <-
loanSizeSummary %>%
ggplot(aes(x = sub_grade, Funded)) +
geom_point(col = "blue") +
xlab("Sub grade") +
ylab("Funded amount ($mln)")
loanBySubgrade2 <-
loanSizeSummary %>%
ggplot(aes(x = sub_grade, y = Proportion)) +
geom_point(col = "blue") +
xlab("Sub grade") +
ylab("Share repaid (%)")
grid.arrange(loanBySubgrade1, loanBySubgrade2, ncol = 1)
```