BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. This combination has proven to out perform the original Permutation Importance method in both speed, and the quality of the feature subset produced. Not only does this algorithm provide a better subset of features, but it can also simultaneously provide the most accurate and consistent global feature rankings which can be used for model inference too. Unlike the orginal R package, which limits the user to a Random Forest model, BorutaShap allows the user to choose any Tree Based learner as the base model in the feature selection process.
Despite BorutaShap's runtime improvments the SHAP TreeExplainer scales linearly with the number of observations making it's use cumbersome for large datasets. To combat this, BorutaShap includes a sampling procedure which uses the smallest possible subsample of the data availble at each iteration of the algorithm. It finds this sample by comparing the distributions produced by an isolation forest of the sample and the data using ks-test. From experiments, this procedure can reduce the run time up to 80% while still creating a valid approximation of the entire data set. Even with these improvments the user still might want a faster solution so BorutaShap has included an option to use the mean decrease in gini impurity. This importance measure is independent of the size dataset as it uses the tree's structure to compute a global feature ranking making it much faster than SHAP at larger datasets. Although this metric returns somewhat comparable feature subsets, it is not a reliable measure of global feature importance in spite of it's wide spread use. Thus, I would recommend to using the SHAP metric whenever possible.
Start by creating new copies of all the features in the data set and name them shadow + feature_name, shuffle these newly added features to remove their correlations with the response variable.
Run a classifier on the extended data with the random shadow features included. Then rank the features using a feature importance metric the original algorithm used permutation importance as it's metric of choice.
Create a threshold using the maximum importance score from the shadow features. Then assign a hit to any feature that had exceeded this threshold.
For every unassigned feature preform a two sided T-test of equality.
Attributes which have an importance significantly lower than the threshold are deemed 'unimportant' and are removed them from process. Deem the attributes which have importance significantly higher than than the threshold as 'important'.
Remove all shadow attributes and repeat the procedure until an importance has been assigned for each feature, or the algorithm has reached the previously set limit of runs.
If the algorithm has reached its set limit of runs and an importance has not been assigned to each feature the user has two choices. Either increase the number of runs or use the tentative rough fix function which compares the median importance values between unassigned features and the maximum shadow feature to make the decision.
Use the package manager pip to install foobar.
bash
pip install BorutaShap
For more use cases such as alternative models, sampling or changing the importance metric please view the notebooks here.
```python from BorutaShap import BorutaShap, load_data
X, y = load_data(data_type='regression')
X.head()
```
```python
Feature_Selector = BorutaShap(importance_measure='shap', classification=False)
''' Sample: Boolean if true then a rowise sample of the data will be used to calculate the feature importance values
sample_fraction: float The sample fraction of the original data used in calculating the feature importance values only used if Sample==True.
train_or_test: string Decides whether the feature improtance should be calculated on out of sample data see the dicussion here. https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html#introduction-to-test-vs.training-data
normalize: boolean if true the importance values will be normalized using the z-score formula
verbose: Boolean a flag indicator to print out all the rejected or accepted features. ''' Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False, train_or_test = 'test', normalize=True, verbose=True) ```
```python
Feature_Selector.plot(which_features='all')
```
```python
subset = Feature_Selector.Subset() ```
```python from BorutaShap import BorutaShap, load_data from xgboost import XGBClassifier
X, y = load_data(data_type='classification') X.head() ```
```python model = XGBClassifier()
Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=True)
Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False,
train_or_test = 'test', normalize=True,
verbose=True)
```
```python
Feature_Selector.plot(which_features='all') ```
```python
subset = Feature_Selector.Subset()
```
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
If you wish to cite this work please click on the zenodo badge at the top of this READme file MIT
Scikit-learn >1.2 do not support the use of the Boston dataset from sklearn.datasets. This problem was raised back in december: https://github.com/Ekeany/Boruta-Shap/issues/111 and while one could make workaround importing Boston dataset from other sources, I am not an official maintainer. I merely suggest to replace sklearn.datasets toy dataset: load_boston() with load_diabetes().
This PR replaces the use of the load_boston() dataset in BorutaShap with load_diabetes() to avoid compatibility issues with scikit-learn >1.2.
-Issue raised in December 2022: https://github.com/Ekeany/Boruta-Shap/issues/111 -Documentation for load_diabetes(): https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes
I have tested the loading functionality, replacing load_diabetes() with load_boston(), on my local machine and the code runs without errors.
for sklearn compatibility. It enabled below. ``` from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestRegressor from BorutaShap import BorutaShap from sklearn.model_selection import cross_val_predict
pipe = Pipeline(steps=[("selector", BorutaShap()), ("Regressor", RandomForestRegressor())]) pipe.fit(X,y)
```
cross_val_predict is not supported yet ```
~\Anaconda3\envs\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, fit_params) 891 res = transformer.fit_transform(X, y, fit_params) 892 else: --> 893 res = transformer.fit(X, y, **fit_params).transform(X) 894 895 if weight is None:
AttributeError: 'NoneType' object has no attribute 'transform' ```
The reason for this is that BorutaSHAP does not implement get_params and set_params, so clone(BorutaSHAP()) does not work.
A clear and concise description of what the bug is.
Steps to reproduce the behavior: 1. Go to '...' 2. Click on '....' 3. Scroll down to '....' 4. See error
A clear and concise description of what you expected to happen.
If applicable, add screenshots to help explain your problem.
Add any other context about the problem here.
Load Boston in the Boruta.py leads to an import error. This is due to Sckit-learn version 1.2 and above.
Steps to reproduce the behavior: from BorutaShap import BorutaShap
Package would import normally
ImportError Traceback (most recent call last)
1 frames /usr/local/lib/python3.8/dist-packages/sklearn/datasets/init.py in getattr(name) 154 """ 155 ) --> 156 raise ImportError(msg) 157 try: 158 return globals()[name]
ImportError:
load_boston
has been removed from scikit-learn since version 1.2.
The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption. `` The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
[1] M Carlisle. "Racist data destruction?" https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
[2] Harrison Jr, David, and Daniel L. Rubinfeld. "Hedonic housing prices and the demand for clean air." Journal of environmental economics and management 5.1 (1978): 81-102. https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air
Bumps certifi from 2019.11.28 to 2022.12.7.
9e9e840
2022.12.07b81bdb2
2022.09.24939a28f
2022.09.14aca828a
2022.06.15.2de0eae1
Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...b8eb5e9
2022.06.15.147fb7ab
Fix deprecation warning on Python 3.11 (#199)b0b48e0
fixes #198 -- update link in license9d514b4
2022.06.154151e88
Add py.typed to MANIFEST.in to package in sdist (#196)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps pillow from 7.0.0 to 9.3.0.
Sourced from pillow's releases.
9.3.0
https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html
Changes
- Initialize libtiff buffer when saving #6699 [
@radarhere
]- Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [
@wiredfool
]- Inline fname2char to fix memory leak #6329 [
@nulano
]- Fix memory leaks related to text features #6330 [
@nulano
]- Use double quotes for version check on old CPython on Windows #6695 [
@hugovk
]- GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [
@nulano
]- Remove backup implementation of Round for Windows platforms #6693 [
@cgohlke
]- Upload fribidi.dll to GitHub Actions #6532 [
@nulano
]- Fixed set_variation_by_name offset #6445 [
@radarhere
]- Windows build improvements #6562 [
@nulano
]- Fix malloc in _imagingft.c:font_setvaraxes #6690 [
@cgohlke
]- Only use ASCII characters in C source file #6691 [
@cgohlke
]- Release Python GIL when converting images using matrix operations #6418 [
@hmaarrfk
]- Added ExifTags enums #6630 [
@radarhere
]- Do not modify previous frame when calculating delta in PNG #6683 [
@radarhere
]- Added support for reading BMP images with RLE4 compression #6674 [
@npjg
]- Decode JPEG compressed BLP1 data in original mode #6678 [
@radarhere
]- pylint warnings #6659 [
@marksmayo
]- Added GPS TIFF tag info #6661 [
@radarhere
]- Added conversion between RGB/RGBA/RGBX and LAB #6647 [
@radarhere
]- Do not attempt normalization if mode is already normal #6644 [
@radarhere
]- Fixed seeking to an L frame in a GIF #6576 [
@radarhere
]- Consider all frames when selecting mode for PNG save_all #6610 [
@radarhere
]- Don't reassign crc on ChunkStream close #6627 [
@radarhere
]- Raise a warning if NumPy failed to raise an error during conversion #6594 [
@radarhere
]- Only read a maximum of 100 bytes at a time in IMT header #6623 [
@radarhere
]- Show all frames in ImageShow #6611 [
@radarhere
]- Allow FLI palette chunk to not be first #6626 [
@radarhere
]- If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [
@radarhere
]- Round box position to integer when pasting embedded color #6517 [
@radarhere
]- Removed EXIF prefix when saving WebP #6582 [
@radarhere
]- Pad IM palette to 768 bytes when saving #6579 [
@radarhere
]- Added DDS BC6H reading #6449 [
@ShadelessFox
]- Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [
@JayWiz
]- Raise an error when allocating translucent color to RGB palette #6654 [
@jsbueno
]- Moved mode check outside of loops #6650 [
@radarhere
]- Added reading of TIFF child images #6569 [
@radarhere
]- Improved ImageOps palette handling #6596 [
@PososikTeam
]- Defer parsing of palette into colors #6567 [
@radarhere
]- Apply transparency to P images in ImageTk.PhotoImage #6559 [
@radarhere
]- Use rounding in ImageOps contain() and pad() #6522 [
@bibinhashley
]- Fixed GIF remapping to palette with duplicate entries #6548 [
@radarhere
]- Allow remap_palette() to return an image with less than 256 palette entries #6543 [
@radarhere
]- Corrected BMP and TGA palette size when saving #6500 [
@radarhere
]
... (truncated)
Sourced from pillow's changelog.
9.3.0 (2022-10-29)
Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]
Initialize libtiff buffer when saving #6699 [radarhere]
Inline fname2char to fix memory leak #6329 [nulano]
Fix memory leaks related to text features #6330 [nulano]
Use double quotes for version check on old CPython on Windows #6695 [hugovk]
Remove backup implementation of Round for Windows platforms #6693 [cgohlke]
Fixed set_variation_by_name offset #6445 [radarhere]
Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]
Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]
Added ExifTags enums #6630 [radarhere]
Do not modify previous frame when calculating delta in PNG #6683 [radarhere]
Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]
Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]
Added GPS TIFF tag info #6661 [radarhere]
Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]
Do not attempt normalization if mode is already normal #6644 [radarhere]
... (truncated)
d594f4c
Update CHANGES.rst [ci skip]909dc64
9.3.0 version bump1a51ce7
Merge pull request #6699 from hugovk/security-libtiff_buffer2444cdd
Merge pull request #6700 from hugovk/security-samples_per_pixel-sec744f455
Added release notes0846bfa
Add to release notes799a6a0
Fix linting00b25fd
Hide UserWarning in logs05b175e
Tighter test case13f2c5a
Prevent DOS with large SAMPLESPERPIXEL in Tiff IFDDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values.
BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values.