A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.

Ekeany, updated 🕥 2023-03-21 10:03:00

DOI PyPI version

Boruta-Shap

BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. This combination has proven to out perform the original Permutation Importance method in both speed, and the quality of the feature subset produced. Not only does this algorithm provide a better subset of features, but it can also simultaneously provide the most accurate and consistent global feature rankings which can be used for model inference too. Unlike the orginal R package, which limits the user to a Random Forest model, BorutaShap allows the user to choose any Tree Based learner as the base model in the feature selection process.

Despite BorutaShap's runtime improvments the SHAP TreeExplainer scales linearly with the number of observations making it's use cumbersome for large datasets. To combat this, BorutaShap includes a sampling procedure which uses the smallest possible subsample of the data availble at each iteration of the algorithm. It finds this sample by comparing the distributions produced by an isolation forest of the sample and the data using ks-test. From experiments, this procedure can reduce the run time up to 80% while still creating a valid approximation of the entire data set. Even with these improvments the user still might want a faster solution so BorutaShap has included an option to use the mean decrease in gini impurity. This importance measure is independent of the size dataset as it uses the tree's structure to compute a global feature ranking making it much faster than SHAP at larger datasets. Although this metric returns somewhat comparable feature subsets, it is not a reliable measure of global feature importance in spite of it's wide spread use. Thus, I would recommend to using the SHAP metric whenever possible.

Algorithm

  1. Start by creating new copies of all the features in the data set and name them shadow + feature_name, shuffle these newly added features to remove their correlations with the response variable.

  2. Run a classifier on the extended data with the random shadow features included. Then rank the features using a feature importance metric the original algorithm used permutation importance as it's metric of choice.

  3. Create a threshold using the maximum importance score from the shadow features. Then assign a hit to any feature that had exceeded this threshold.

  4. For every unassigned feature preform a two sided T-test of equality.

  5. Attributes which have an importance significantly lower than the threshold are deemed 'unimportant' and are removed them from process. Deem the attributes which have importance significantly higher than than the threshold as 'important'.

  6. Remove all shadow attributes and repeat the procedure until an importance has been assigned for each feature, or the algorithm has reached the previously set limit of runs.

If the algorithm has reached its set limit of runs and an importance has not been assigned to each feature the user has two choices. Either increase the number of runs or use the tentative rough fix function which compares the median importance values between unassigned features and the maximum shadow feature to make the decision.

Installation

Use the package manager pip to install foobar.

bash pip install BorutaShap

Usage

For more use cases such as alternative models, sampling or changing the importance metric please view the notebooks here.

Using Shap and Basic Random Forest

```python from BorutaShap import BorutaShap, load_data

X, y = load_data(data_type='regression') X.head() ```

```python

no model selected default is Random Forest, if classification is True it is a Classification problem

Feature_Selector = BorutaShap(importance_measure='shap', classification=False)

''' Sample: Boolean if true then a rowise sample of the data will be used to calculate the feature importance values

sample_fraction: float The sample fraction of the original data used in calculating the feature importance values only used if Sample==True.

train_or_test: string Decides whether the feature improtance should be calculated on out of sample data see the dicussion here. https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html#introduction-to-test-vs.training-data

normalize: boolean if true the importance values will be normalized using the z-score formula

verbose: Boolean a flag indicator to print out all the rejected or accepted features. ''' Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False, train_or_test = 'test', normalize=True, verbose=True) ```

```python

Returns Boxplot of features

Feature_Selector.plot(which_features='all') ```

```python

Returns a subset of the original data with the selected features

subset = Feature_Selector.Subset() ```

Using BorutaShap with another model XGBoost

```python from BorutaShap import BorutaShap, load_data from xgboost import XGBClassifier

X, y = load_data(data_type='classification') X.head() ```

```python model = XGBClassifier()

if classification is False it is a Regression problem

Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=True)

Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False, train_or_test = 'test', normalize=True, verbose=True) ```

```python

Returns Boxplot of features

Feature_Selector.plot(which_features='all') ```

```python

Returns a subset of the original data with the selected features

subset = Feature_Selector.Subset() ```

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

If you wish to cite this work please click on the zenodo badge at the top of this READme file MIT

Issues

in response to [BUG] BorutaSHAP.py load Boston Import Error #111

opened on 2023-03-21 10:02:59 by IanWord

Scikit-learn >1.2 do not support the use of the Boston dataset from sklearn.datasets. This problem was raised back in december: https://github.com/Ekeany/Boruta-Shap/issues/111 and while one could make workaround importing Boston dataset from other sources, I am not an official maintainer. I merely suggest to replace sklearn.datasets toy dataset: load_boston() with load_diabetes().

What does this PR do?

This PR replaces the use of the load_boston() dataset in BorutaShap with load_diabetes() to avoid compatibility issues with scikit-learn >1.2.

References

-Issue raised in December 2022: https://github.com/Ekeany/Boruta-Shap/issues/111 -Documentation for load_diabetes(): https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes

Testing performed

I have tested the loading functionality, replacing load_diabetes() with load_boston(), on my local machine and the code runs without errors.

Known issues

transform function for sklearn compatibility

opened on 2023-02-06 01:05:21 by jckkvs

What does this PR do?

for sklearn compatibility. It enabled below. ``` from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestRegressor from BorutaShap import BorutaShap from sklearn.model_selection import cross_val_predict

pipe = Pipeline(steps=[("selector", BorutaShap()),               ("Regressor", RandomForestRegressor())]) pipe.fit(X,y)

```

Problem

cross_val_predict is not supported yet ```

~\Anaconda3\envs\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, fit_params) 891 res = transformer.fit_transform(X, y, fit_params) 892 else: --> 893 res = transformer.fit(X, y, **fit_params).transform(X) 894 895 if weight is None:

AttributeError: 'NoneType' object has no attribute 'transform' ```

The reason for this is that BorutaSHAP does not implement get_params and set_params, so clone(BorutaSHAP()) does not work.

Hi, I am having trouble with a task in BorutaSHAP which is stuck at 0% progression apparently.[BUG]

opened on 2023-01-24 11:33:11 by federiconuta

Describe the bug

A clear and concise description of what the bug is.

To Reproduce

Steps to reproduce the behavior: 1. Go to '...' 2. Click on '....' 3. Scroll down to '....' 4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

[BUG] BorutaSHAP.py load Boston Import Error

opened on 2022-12-09 08:53:45 by HishamSalem

Describe the bug

Load Boston in the Boruta.py leads to an import error. This is due to Sckit-learn version 1.2 and above.

To Reproduce

Steps to reproduce the behavior: from BorutaShap import BorutaShap

Expected behavior

Package would import normally

Output

ImportError Traceback (most recent call last) in ----> 1 from BorutaShap import BorutaShap

1 frames /usr/local/lib/python3.8/dist-packages/sklearn/datasets/init.py in getattr(name) 154 """ 155 ) --> 156 raise ImportError(msg) 157 try: 158 return globals()[name]

ImportError: load_boston has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption. `` The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle. "Racist data destruction?" https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

[2] Harrison Jr, David, and Daniel L. Rubinfeld. "Hedonic housing prices and the demand for clean air." Journal of environmental economics and management 5.1 (1978): 81-102. https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air


Bump certifi from 2019.11.28 to 2022.12.7

opened on 2022-12-08 04:20:59 by dependabot[bot]

Bumps certifi from 2019.11.28 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/Ekeany/Boruta-Shap/network/alerts).

Bump pillow from 7.0.0 to 9.3.0

opened on 2022-11-22 05:32:53 by dependabot[bot]

Bumps pillow from 7.0.0 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

  • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

  • Initialize libtiff buffer when saving #6699 [radarhere]

  • Inline fname2char to fix memory leak #6329 [nulano]

  • Fix memory leaks related to text features #6330 [nulano]

  • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

  • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

  • Fixed set_variation_by_name offset #6445 [radarhere]

  • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

  • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

  • Added ExifTags enums #6630 [radarhere]

  • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

  • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

  • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

  • Added GPS TIFF tag info #6661 [radarhere]

  • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

  • Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/Ekeany/Boruta-Shap/network/alerts).

Releases

BorutaShap 2020-11-05 17:23:54

BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values.

Boruta Shap 2020-09-27 12:31:29

BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values.