The logiBin package enables fast binning of multiple variables using parallel processing. A summary of all the variables binned is generated which provides the information value, entropy, an indicator of whether the variable follows a monotonic trend or not, etc. It supports rebinning of variables to force a monotonic trend as well as manual binning based on pre specified cuts.
The getBins function uses parallel processing to compute bins for continuous and categorical variables. The splits are computed using the partykit package which uses conditional inferencing trees. Refer to the package documentation for more details. A separate bin is created for NA values. This can be combined using naCombine function. Categorical variables with a maximum of 10 distinct values are supported.
Eg: b1 <- getBins(loanData, “bad_flag”, c(“age”, “LTV”, “score”, “balance”), minCr = 0.8, nCores = 2)
This retrurns a list containing 3 elements. One is a a dataframe
called err which contains details of all the variables that could not be
split and the reason for the same.
var | error | |
---|---|---|
9 | score | No significant splits |
It can be seen that no significant splits were found for the variable
‘score’. The other variables specified were split into bins. The summary
of these splits can be seen from the next element of the list which is a
dataframe called varSummar. This contains the summary of the variables’
IV value, entropy, p value from ctree function in partykit package, flag
which indicates if bad rate increases/decreases with variable value,
flag to indicate if a monotonic trend is present, number of bins which
flip (i.e. do not follow a monotonic trend), number of bins of the
variable and a flag to indicate whether it includes pure nodes (node
which do not have any defaults).
var | iv | pVal | stat | ent | trend | monTrend | flipRatio | numBins | purNode | varType | |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | age | 0.8399 | 0.0356891 | 4.411899 | 0.7367 | I | N | 0.5 | 3 | N | integer |
8 | LTV | 0.5241 | 0.0067388 | 7.341301 | 0.7567 | D | Y | 0.0 | 3 | Y | numeric |
12 | balance | 0.3536 | 0.0360245 | 4.395943 | 0.7900 | D | Y | 0.0 | 2 | N | integer |
The variables LTV & balance have a monotonic decreasing trend which
indicates that the bad rate decreases as the value of the variable
increases. The variable age has an increasing trend. However it is not
monotonic and there is a flip in 50% of the bins. In order to check
this, look at the second element of the list which is a data frame
called bin which contains details of all the bins of the variables.
var | bin | count | bads | goods | propn | bad_rate | iv | ent | |
---|---|---|---|---|---|---|---|---|---|
1 | age | age <= 34 | 44 | 19 | 25 | 44 | 43.18 | 0.2602 | 0.9865 |
2 | age | age > 34 & age <= 45 | 32 | 2 | 30 | 32 | 6.25 | 0.5772 | 0.3373 |
3 | age | age > 45 | 24 | 6 | 18 | 24 | 25.00 | 0.0025 | 0.8113 |
4 | age | Total | 100 | 27 | 73 | 1 | 27.00 | 0.8399 | 0.7367 |
5 | LTV | LTV <= 0.77 | 24 | 13 | 11 | 24 | 54.17 | 0.3843 | 0.9950 |
6 | LTV | LTV > 0.77 | 74 | 14 | 60 | 74 | 18.92 | 0.1398 | 0.6998 |
7 | LTV | is.na(LTV) | 2 | 0 | 2 | 2 | 0.00 | Inf | 0.0000 |
8 | LTV | Total | 100 | 27 | 73 | 1 | 27.00 | 0.5241 | 0.7567 |
10 | balance | balance <= 6359 | 19 | 10 | 9 | 19 | 52.63 | 0.2718 | 0.9980 |
11 | balance | balance > 6359 | 81 | 17 | 64 | 81 | 20.99 | 0.0818 | 0.7412 |
12 | balance | Total | 100 | 27 | 73 | 1 | 27.00 | 0.3536 | 0.7900 |
Looking at the bins of the variable age, it can be seen that the first bin has a high bad rate and contains a large proportion of the population. The bad rate of the middle bin is lower than the last bin. However if the second & third bins are combined a monotonic decreasing trend can be forced. The function forceDecrTrend can be used for this. Eg: b1 <- forceDecrTrend(b1,“age”)
We can see that once a decreasing trend is forced, the variable age is now monotonically decreasing.
var | bin | count | bads | goods | propn | bad_rate | iv | ent | |
---|---|---|---|---|---|---|---|---|---|
5 | LTV | LTV <= 0.77 | 24 | 13 | 11 | 24 | 54.17 | 0.3843 | 0.9950 |
6 | LTV | LTV > 0.77 | 74 | 14 | 60 | 74 | 18.92 | 0.1398 | 0.6998 |
7 | LTV | is.na(LTV) | 2 | 0 | 2 | 2 | 0.00 | Inf | 0.0000 |
8 | LTV | Total | 100 | 27 | 73 | 1 | 27.00 | 0.5241 | 0.7567 |
10 | balance | balance <= 6359 | 19 | 10 | 9 | 19 | 52.63 | 0.2718 | 0.9980 |
11 | balance | balance > 6359 | 81 | 17 | 64 | 81 | 20.99 | 0.0818 | 0.7412 |
12 | balance | Total | 100 | 27 | 73 | 1 | 27.00 | 0.3536 | 0.7900 |
1 | age | age <= 34 | 44 | 19 | 25 | 44 | 43.18 | 0.2602 | 0.9865 |
2 | age | age > 34 | 56 | 8 | 48 | 56 | 14.29 | 0.2880 | 0.5917 |
3 | age | Total | 100 | 27 | 73 | 1 | 27.00 | 0.5482 | 0.7654 |
This function can also take multiple variables as input if a decreasing trend is to be forced on multiple variables.
Eg: forceDecrTrend(b1, c(“age”, “LTV”))
Similarly the function forceIncrTrend can be used to force a
monotonically increasing trend if required. The function manualSplit can
be used to manually split the variable based on specified cuts. The
function naCombine can be used to combine the NA bin with either the bin
having the closest bad rate or the average bad rate if the count of
observations in NA bin is low.
Once this is done, the splits created can be replicated on a test
dataframe to check if the same trand will hold on this.
Eg: b2 <- binTest(b1, testDf, “BAD_FLG”, c(“age”, “LTV”))
If there are a lot of flips on the test data, the variable can be
discarded. Otherwise, increasing/decreasing trends can be forced on b2
to ensure that there are no flips. This can then be tested on the
original data.
Eg: b1 <- binTest(b2, loanData, “BAD_FLG”, c(“age”, “LTV”))
Once the bins have been finalized, variables can be shortlisted based on
IV and linearity. The bins of these shortlisted variables can be created
in the data using the function createBins.
Eg: loanData1 <- createBins(b1, loanData, c(“age”, “LTV”))
The data frame loanData1 will have all the variables of data frame
loanData along with binned variables which will be created with the
prefix “b_” before the original name of the variable.