Lapras is designed to make the model developing job easily and conveniently. It contains these functions below in one key operation: data exploratory analysis, feature selection, feature binning, data visualization, scorecard modeling(a logistic regression model with excellent interpretability), performance measure.
Let's get started.
1.Exploratory Data Analysis lapras.detect() lapras.quality() lapras.IV() lapras.VIF() lapras.PSI()
2.Feature Selection lapras.select() lapras.stepwise()
3.Binning lapras.Combiner() lapras.WOETransformer() lapras.bin_stats() lapras.bin_plot()
4.Modeling lapras.ScoreCard()
5.Performance Measure lapras.perform() lapras.LIFT() lapras.score_plot() lapras.KS_bucket() lapras.PPSI() lapras.KS() lapras.AUC()
6.One Key Auto Modeling Lapras also provides a function which runs all the steps above automatically: lapras.auto_model()
via pip
bash
pip install lapras --upgrade -i https://pypi.org/simple
via source code
bash
python setup.py install
install_requires = [ 'numpy >= 1.18.4', 'pandas >= 0.25.1', 'scipy >= 1.3.2', 'scikit-learn =0.22.2', 'seaborn >= 0.10.1', 'statsmodels >= 0.13.1', 'tensorflow >= 2.2.0', 'hyperopt >= 0.2.7', 'pickle >= 4.0', ]
```python import lapras
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import matplotlib as mpl import matplotlib.pyplot as plt
pd.options.display.max_colwidth = 100 import math %matplotlib inline ```
```python
df = pd.read_csv('data/demo.csv',encoding="utf-8") ```
python
to_drop = ['id'] # exclude the features which not being used, eg:id
target = 'bad' # Y label name
train_df, test_df, _, _ = train_test_split(df, df[[target]], test_size=0.3, random_state=42) # to divide the training set and testing set, strongly recommended
```python
lapras.detect(train_df).sort_values("missing") ```
type | size | missing | unique | mean_or_top1 | std_or_top2 | min_or_top3 | 1%_or_top4 | 10%_or_top5 | 50%_or_bottom5 | 75%_or_bottom4 | 90%_or_bottom3 | 99%_or_bottom2 | max_or_bottom1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | int64 | 5502 | 0.0000 | 5502 | 3947.266630 | 2252.395671 | 2.0 | 87.03 | 820.1 | 3931.5 | 5889.25 | 7077.8 | 7782.99 | 7861.0 |
bad | int64 | 5502 | 0.0000 | 2 | 0.073246 | 0.260564 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
score | int64 | 5502 | 0.0000 | 265 | 295.280625 | 66.243181 | 0.0 | 0.00 | 223.0 | 303.0 | 336.00 | 366.0 | 416.00 | 461.0 |
age | float64 | 5502 | 0.0002 | 34 | 27.659880 | 4.770299 | 19.0 | 21.00 | 23.0 | 27.0 | 30.00 | 34.0 | 43.00 | 53.0 |
wealth | float64 | 5502 | 0.0244 | 18 | 4.529806 | 1.823149 | 1.0 | 1.00 | 3.0 | 4.0 | 5.00 | 7.0 | 10.00 | 22.0 |
education | float64 | 5502 | 0.1427 | 5 | 3.319483 | 1.005660 | 1.0 | 1.00 | 2.0 | 4.0 | 4.00 | 4.0 | 5.00 | 5.0 |
period | float64 | 5502 | 0.1714 | 5 | 7.246326 | 1.982060 | 4.0 | 4.00 | 6.0 | 6.0 | 10.00 | 10.0 | 10.00 | 14.0 |
max_unpay_day | float64 | 5502 | 0.9253 | 11 | 185.476886 | 22.339647 | 28.0 | 86.00 | 171.0 | 188.0 | 201.00 | 208.0 | 208.00 | 208.0 |
```python
lapras.quality(train_df.drop(to_drop,axis=1),target = target) ```
iv | unique | |
---|---|---|
score | 0.758342 | 265.0 |
age | 0.504588 | 35.0 |
wealth | 0.275775 | 19.0 |
education | 0.230553 | 6.0 |
max_unpay_day | 0.170061 | 12.0 |
period | 0.073716 | 6.0 |
```python
cols = list(lapras.quality(train_df,target = target).reset_index()['index'])
for col in cols:
if col not in [target]:
print("%s: %.4f" % (col,lapras.PSI(train_df[col], test_df[col])))
python
score: 0.1500
age: 0.0147
wealth: 0.0070
education: 0.0010
max_unpay_day: 0.0042
id: 0.0000
period: 0.0030
python
lapras.VIF(train_df.drop(['id','bad'],axis=1))
python
wealth 1.124927
max_unpay_day 2.205619
score 18.266471
age 17.724547
period 1.193605
education 1.090158
dtype: float64
python
lapras.IV(train_df['age'],train_df[target])
python
0.5045879202656338
python
train_selected, dropped = lapras.select(train_df.drop(to_drop,axis=1),target = target, empty = 0.95, \
iv = 0.05, corr = 0.9, vif = False, return_drop=True, exclude=[])
print(dropped)
print(train_selected.shape)
train_selected
python
{'empty': array([], dtype=float64), 'iv': array([], dtype=object), 'corr': array([], dtype=object)}
(5502, 7)
```
bad | wealth | max_unpay_day | score | age | period | education | |
---|---|---|---|---|---|---|---|
4168 | 0 | 4.0 | NaN | 288 | 23.0 | 6.0 | 4.0 |
605 | 0 | 4.0 | NaN | 216 | 32.0 | 6.0 | 4.0 |
3018 | 0 | 5.0 | NaN | 250 | 23.0 | 6.0 | 2.0 |
4586 | 0 | 7.0 | 171.0 | 413 | 31.0 | NaN | 2.0 |
1468 | 0 | 5.0 | NaN | 204 | 29.0 | 6.0 | 2.0 |
... | ... | ... | ... | ... | ... | ... | ... |
5226 | 0 | 4.0 | 171.0 | 346 | 23.0 | NaN | 3.0 |
5390 | 0 | 5.0 | NaN | 207 | 32.0 | NaN | 3.0 |
860 | 0 | 6.0 | NaN | 356 | 42.0 | 4.0 | 3.0 |
7603 | 0 | 3.0 | NaN | 323 | 34.0 | NaN | 3.0 |
7270 | 0 | 4.0 | NaN | 378 | 24.0 | 10.0 | 4.0 |
5502 rows × 7 columns
```python
c = lapras.Combiner() c.fit(train_selected, y = target,method = 'mono', min_samples = 0.05,n_bins=8) #empty_separate = False
c.export()
python
{'age': [23.0, 24.0, 25.0, 26.0, 28.0, 29.0, 37.0],
'education': [3.0, 4.0],
'max_unpay_day': [171.0],
'period': [6.0, 10.0],
'score': [237.0, 272.0, 288.0, 296.0, 330.0, 354.0, 384.0],
'wealth': [3.0, 4.0, 5.0, 7.0]}
python
c.transform(train_selected, labels=True).iloc[0:10,:] ```
bad | wealth | max_unpay_day | score | age | period | education | |
---|---|---|---|---|---|---|---|
4168 | 0 | 02.[4.0,5.0) | 00.[-inf,171.0) | 03.[288.0,296.0) | 01.[23.0,24.0) | 01.[6.0,10.0) | 02.[4.0,inf) |
605 | 0 | 02.[4.0,5.0) | 00.[-inf,171.0) | 00.[-inf,237.0) | 06.[29.0,37.0) | 01.[6.0,10.0) | 02.[4.0,inf) |
3018 | 0 | 03.[5.0,7.0) | 00.[-inf,171.0) | 01.[237.0,272.0) | 01.[23.0,24.0) | 01.[6.0,10.0) | 00.[-inf,3.0) |
4586 | 0 | 04.[7.0,inf) | 01.[171.0,inf) | 07.[384.0,inf) | 06.[29.0,37.0) | 00.[-inf,6.0) | 00.[-inf,3.0) |
1468 | 0 | 03.[5.0,7.0) | 00.[-inf,171.0) | 00.[-inf,237.0) | 06.[29.0,37.0) | 01.[6.0,10.0) | 00.[-inf,3.0) |
6251 | 0 | 03.[5.0,7.0) | 00.[-inf,171.0) | 01.[237.0,272.0) | 01.[23.0,24.0) | 02.[10.0,inf) | 00.[-inf,3.0) |
3686 | 0 | 00.[-inf,3.0) | 00.[-inf,171.0) | 00.[-inf,237.0) | 01.[23.0,24.0) | 01.[6.0,10.0) | 00.[-inf,3.0) |
3615 | 0 | 02.[4.0,5.0) | 00.[-inf,171.0) | 03.[288.0,296.0) | 06.[29.0,37.0) | 02.[10.0,inf) | 02.[4.0,inf) |
5338 | 0 | 00.[-inf,3.0) | 00.[-inf,171.0) | 04.[296.0,330.0) | 03.[25.0,26.0) | 02.[10.0,inf) | 00.[-inf,3.0) |
3985 | 0 | 03.[5.0,7.0) | 00.[-inf,171.0) | 01.[237.0,272.0) | 01.[23.0,24.0) | 01.[6.0,10.0) | 02.[4.0,inf) |
```python
cols = list(lapras.quality(train_selected,target = target).reset_index()['index'])
for col in cols:
if col != target:
print(lapras.bin_stats(c.transform(train_selected[[col, target]], labels=True), col=col, target=target))
lapras.bin_plot(c.transform(train_selected[[col,target]], labels=True), col=col, target=target)
python
score bad_count total_count bad_rate ratio woe \
0 00.[-inf,237.0) 136 805 0.168944 0.146310 0.944734
1 01.[237.0,272.0) 101 832 0.121394 0.151218 0.558570
2 02.[272.0,288.0) 46 533 0.086304 0.096874 0.178240
3 03.[288.0,296.0) 20 295 0.067797 0.053617 -0.083176
4 04.[296.0,330.0) 73 1385 0.052708 0.251727 -0.350985
5 05.[330.0,354.0) 18 812 0.022167 0.147583 -1.248849
6 06.[354.0,384.0) 8 561 0.014260 0.101963 -1.698053
7 07.[384.0,inf) 1 279 0.003584 0.050709 -3.089758
iv total_iv
0 0.194867 0.735116
1 0.059912 0.735116
2 0.003322 0.735116
3 0.000358 0.735116
4 0.026732 0.735116
5 0.138687 0.735116
6 0.150450 0.735116
7 0.160788 0.735116

python
age bad_count total_count bad_rate ratio woe \
0 00.[-inf,23.0) 90 497 0.181087 0.090331 1.028860
1 01.[23.0,24.0) 77 521 0.147793 0.094693 0.785844
2 02.[24.0,25.0) 57 602 0.094684 0.109415 0.280129
3 03.[25.0,26.0) 38 539 0.070501 0.097964 -0.041157
4 04.[26.0,28.0) 58 997 0.058175 0.181207 -0.246509
5 05.[28.0,29.0) 20 379 0.052770 0.068884 -0.349727
6 06.[29.0,37.0) 57 1657 0.034400 0.301163 -0.796844
7 07.[37.0,inf) 6 310 0.019355 0.056343 -1.387405
iv total_iv
0 0.147647 0.45579
1 0.081721 0.45579
2 0.009680 0.45579
3 0.000163 0.45579
4 0.009918 0.45579
5 0.007267 0.45579
6 0.137334 0.45579
7 0.062060 0.45579

python
wealth bad_count total_count bad_rate ratio woe \
0 00.[-inf,3.0) 106 593 0.178752 0.107779 1.013038
1 01.[3.0,4.0) 84 1067 0.078725 0.193929 0.078071
2 02.[4.0,5.0) 88 1475 0.059661 0.268084 -0.219698
3 03.[5.0,7.0) 99 1733 0.057126 0.314976 -0.265803
4 04.[7.0,inf) 26 634 0.041009 0.115231 -0.614215
iv total_iv
0 0.169702 0.236205 1 0.001222 0.236205 2 0.011787 0.236205 3 0.019881 0.236205 4 0.033612 0.236205 ```
```python
education bad_count total_count bad_rate ratio woe \
0 00.[-inf,3.0) 225 2123 0.105982 0.385860 0.405408
1 01.[3.0,4.0) 61 648 0.094136 0.117775 0.273712
2 02.[4.0,inf) 117 2731 0.042841 0.496365 -0.568600
iv total_iv
0 0.075439 0.211775
1 0.009920 0.211775
2 0.126415 0.211775

python
max_unpay_day bad_count total_count bad_rate ratio woe \
0 00.[-inf,171.0) 330 5098 0.064731 0.926572 -0.132726
1 01.[171.0,inf) 73 404 0.180693 0.073428 1.026204
iv total_iv
0 0.015426 0.134699
1 0.119272 0.134699

python
period bad_count total_count bad_rate ratio woe \
0 00.[-inf,6.0) 52 1158 0.044905 0.210469 -0.519398
1 01.[6.0,10.0) 218 2871 0.075932 0.521810 0.038912
2 02.[10.0,inf) 133 1473 0.090292 0.267721 0.227787
iv total_iv
0 0.045641 0.061758
1 0.000803 0.061758
2 0.015314 0.061758
```
```python
transfer = lapras.WOETransformer() transfer.fit(c.transform(train_selected), train_selected[target], exclude=[target])
train_woe = transfer.transform(c.transform(train_selected))
transfer.export()
python
{'age': {0: 1.0288596439961428,
1: 0.7858440185299318,
2: 0.2801286322797789,
3: -0.041156782250006324,
4: -0.24650930955337075,
5: -0.34972695582581514,
6: -0.7968444812848496,
7: -1.387405073069694},
'education': {0: 0.4054075821430197,
1: 0.27371220345368763,
2: -0.5685998002779383},
'max_unpay_day': {0: -0.13272639517618706, 1: 1.026204224879801},
'period': {0: -0.51939830439238,
1: 0.0389118677598222,
2: 0.22778739438526965},
'score': {0: 0.9447339847162963,
1: 0.5585702161999536,
2: 0.17824043251497793,
3: -0.08317566500410743,
4: -0.3509853692471706,
5: -1.2488485442424984,
6: -1.6980533007340262,
7: -3.089757954582164},
'wealth': {0: 1.01303813013795,
1: 0.0780708378046198,
2: -0.21969844672815222,
3: -0.2658032661768855,
4: -0.6142151848362123}}
python
train_woe, dropped = lapras.select(train_woe,target = target, empty = 0.9, \
iv = 0.02, corr = 0.9, vif = False, return_drop=True, exclude=[])
print(dropped)
print(train_woe.shape)
train_woe.head(10)
python
{'empty': array([], dtype=float64), 'iv': array([], dtype=object), 'corr': array([], dtype=object)}
(5502, 7)
```
bad | wealth | max_unpay_day | score | age | period | education | |
---|---|---|---|---|---|---|---|
4168 | 0 | -0.219698 | -0.132726 | -0.083176 | 0.785844 | 0.038912 | -0.568600 |
605 | 0 | -0.219698 | -0.132726 | 0.944734 | -0.796844 | 0.038912 | -0.568600 |
3018 | 0 | -0.265803 | -0.132726 | 0.558570 | 0.785844 | 0.038912 | 0.405408 |
4586 | 0 | -0.614215 | 1.026204 | -3.089758 | -0.796844 | -0.519398 | 0.405408 |
1468 | 0 | -0.265803 | -0.132726 | 0.944734 | -0.796844 | 0.038912 | 0.405408 |
6251 | 0 | -0.265803 | -0.132726 | 0.558570 | 0.785844 | 0.227787 | 0.405408 |
3686 | 0 | 1.013038 | -0.132726 | 0.944734 | 0.785844 | 0.038912 | 0.405408 |
3615 | 0 | -0.219698 | -0.132726 | -0.083176 | -0.796844 | 0.227787 | -0.568600 |
5338 | 0 | 1.013038 | -0.132726 | -0.350985 | -0.041157 | 0.227787 | 0.405408 |
3985 | 0 | -0.265803 | -0.132726 | 0.558570 | 0.785844 | 0.038912 | -0.568600 |
```python
final_data = lapras.stepwise(train_woe,target = target, estimator='ols', direction = 'both', criterion = 'aic', exclude = []) final_data ```
bad | wealth | max_unpay_day | score | age | |
---|---|---|---|---|---|
4168 | 0 | -0.219698 | -0.132726 | -0.083176 | 0.785844 |
605 | 0 | -0.219698 | -0.132726 | 0.944734 | -0.796844 |
3018 | 0 | -0.265803 | -0.132726 | 0.558570 | 0.785844 |
4586 | 0 | -0.614215 | 1.026204 | -3.089758 | -0.796844 |
1468 | 0 | -0.265803 | -0.132726 | 0.944734 | -0.796844 |
... | ... | ... | ... | ... | ... |
5226 | 0 | -0.219698 | 1.026204 | -1.248849 | 0.785844 |
5390 | 0 | -0.265803 | -0.132726 | 0.944734 | -0.796844 |
860 | 0 | -0.265803 | -0.132726 | -1.698053 | -1.387405 |
7603 | 0 | 0.078071 | -0.132726 | -0.350985 | -0.796844 |
7270 | 0 | -0.219698 | -0.132726 | -1.698053 | 0.280129 |
5502 rows × 5 columns
```python
card = lapras.ScoreCard( combiner = c, transfer = transfer ) col = list(final_data.drop([target],axis=1).columns) card.fit(final_data[col], final_data[target])
python
ScoreCard(base_odds=0.016666666666666666, base_score=600, card=None,
combiner=
python
final_result = final_data[[target]].copy() score = card.predict(final_data[col]) prob = card.predict_prob(final_data[col])
final_result['score'] = score
final_result['prob'] = prob
print("card.intercept_:%s" % (card.intercept_))
print("card.coef_:%s" % (card.coef_))
card.get_params()['combiner']
card.get_params()['transfer']
card.export()
python
card.intercept_:-2.5207582925622476
card.coef_:[0.32080944 0.3452988 0.68294643 0.66842902]
{'age': {'[-inf,23.0)': -39.69,
'[23.0,24.0)': -30.31,
'[24.0,25.0)': -10.81,
'[25.0,26.0)': 1.59,
'[26.0,28.0)': 9.51,
'[28.0,29.0)': 13.49,
'[29.0,37.0)': 30.74,
'[37.0,inf)': 53.52},
'intercept': {'[-inf,inf)': 509.19},
'max_unpay_day': {'[-inf,171.0)': 2.64, '[171.0,inf)': -20.45},
'score': {'[-inf,237.0)': -37.23,
'[237.0,272.0)': -22.01,
'[272.0,288.0)': -7.02,
'[288.0,296.0)': 3.28,
'[296.0,330.0)': 13.83,
'[330.0,354.0)': 49.22,
'[354.0,384.0)': 66.92,
'[384.0,inf)': 121.77},
'wealth': {'[-inf,3.0)': -18.75,
'[3.0,4.0)': -1.45,
'[4.0,5.0)': 4.07,
'[5.0,7.0)': 4.92,
'[7.0,inf)': 11.37}}
python
lapras.perform(prob,final_result[target])
python
KS: 0.4160
AUC: 0.7602
```
```python
lapras.score_plot(final_result,score='score', target=target)
python
bad: [42, 78, 70, 104, 61, 28, 18, 1, 1, 0]
good: [129, 249, 494, 795, 1075, 972, 825, 282, 164, 114]
all: [171, 327, 564, 899, 1136, 1000, 843, 283, 165, 114]
all_rate: ['3.11%', '5.94%', '10.25%', '16.34%', '20.65%', '18.18%', '15.32%', '5.14%', '3.00%', '2.07%']
bad_rate: ['24.56%', '23.85%', '12.41%', '11.57%', '5.37%', '2.80%', '2.14%', '0.35%', '0.61%', '0.00%']
```
```python
lapras.LIFT(prob,final_data[target]) ```
recall | precision | improve | |
---|---|---|---|
0 | 0.1 | 0.240000 | 3.202779 |
1 | 0.2 | 0.261290 | 3.486897 |
2 | 0.3 | 0.240964 | 3.215642 |
3 | 0.4 | 0.189535 | 2.529327 |
4 | 0.5 | 0.179170 | 2.391013 |
5 | 0.6 | 0.174352 | 2.326707 |
6 | 0.7 | 0.161622 | 2.156831 |
7 | 0.8 | 0.126972 | 1.694425 |
8 | 0.9 | 0.113936 | 1.520466 |
9 | 1.0 | 0.074935 | 1.000000 |
```python
auto_card = lapras.auto_model(df=train_df,target=target,to_drop=to_drop,bins_show=False,iv_rank=False,perform_show=False,
coef_negative = False, empty = 0.95, iv = 0.02, corr = 0.9, vif = False, method = 'mono',
n_bins=8, min_samples=0.05, pdo=40, rate=2, base_odds=1 / 60, base_score=600)
python
——data filtering——
original feature:6 filtered features:6
——feature binning——
——WOE value transformation——
——feature filtering once more—— original feature:6 filtered features:6
——scorecard modeling—— intercept: -2.520670026708529 coef: [0.66928671 0.59743968 0.31723278 0.22972838 0.28750881 0.26435224]
——model performance metrics—— KS: 0.4208 AUC: 0.7626 recall precision improve 0 0.1 0.238095 3.188586 1 0.2 0.254777 3.411990 2 0.3 0.239521 3.207679 3 0.4 0.193742 2.594611 4 0.5 0.182805 2.448141 5 0.6 0.171510 2.296866 6 0.7 0.160501 2.149437 7 0.8 0.130259 1.744435 8 0.9 0.110603 1.481206 9 1.0 0.074671 1.000000
Automatic modeling finished, time costing: 0 second ```