.. -- mode: rst --
|Tests| |Codecov| |PythonVersion| |PyPi| |Docs|_
.. |Tests| image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/actions/workflows/ci-tests.yml/badge.svg?branch=master .. _Tests: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/actions/workflows/ci-tests.yml
.. |Codecov| image:: https://codecov.io/gh/rodrigo-arenas/Sklearn-genetic-opt/branch/master/graphs/badge.svg?branch=master&service=github .. _Codecov: https://codecov.io/github/rodrigo-arenas/Sklearn-genetic-opt?branch=master
.. |PythonVersion| image:: https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue .. _PythonVersion : https://www.python.org/downloads/ .. |PyPi| image:: https://badge.fury.io/py/sklearn-genetic-opt.svg .. _PyPi: https://badge.fury.io/py/sklearn-genetic-opt
.. |Docs| image:: https://readthedocs.org/projects/sklearn-genetic-opt/badge/?version=latest .. _Docs: https://sklearn-genetic-opt.readthedocs.io/en/latest/?badge=latest
.. |Contributors| image:: https://contributors-img.web.app/image?repo=rodrigo-arenas/sklearn-genetic-opt .. _Contributors: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/graphs/contributors
.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/logo.png?raw=true
Sklearn-genetic-opt
scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.
This is meant to be an alternative to popular methods inside scikit-learn such as Grid Search and Randomized Grid Search for hyperparameteres tuning, and from RFE, Select From Model for feature selection.
Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
Documentation is available here <https://sklearn-genetic-opt.readthedocs.io/>
_
Main Features:
Demos on Features:
Visualize the progress of your training:
.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/progress_bar.gif?raw=true
Real-time metrics visualization and comparison across runs:
.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/tensorboard_log.png?raw=true
Sampled distribution of hyperparameters:
.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/density.png?raw=true
Artifacts logging:
.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/mlflow_artifacts_4.png?raw=true
Usage:
Install sklearn-genetic-opt
It's advised to install sklearn-genetic using a virtual env, inside the env use::
pip install sklearn-genetic-opt
If you want to get all the features, including plotting, tensorboard and mlflow logging capabilities, install all the extra packages::
pip install sklearn-genetic-opt[all]
Example: Hyperparameters Tuning
.. code-block:: python
from sklearn_genetic import GASearchCV from sklearn_genetic.space import Continuous, Categorical, Integer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, StratifiedKFold from sklearn.datasets import load_digits from sklearn.metrics import accuracy_score
data = load_digits() n_samples = len(data.images) X = data.images.reshape((n_samples, -1)) y = data['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = RandomForestClassifier()
param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'), 'bootstrap': Categorical([True, False]), 'max_depth': Integer(2, 30), 'max_leaf_nodes': Integer(2, 35), 'n_estimators': Integer(100, 300)}
cv = StratifiedKFold(n_splits=3, shuffle=True)
evolved_estimator = GASearchCV(estimator=clf, cv=cv, scoring='accuracy', population_size=20, generations=35, param_grid=param_grid, n_jobs=-1, verbose=True, keep_top_k=4)
# Train and optimize the estimator evolved_estimator.fit(X_train, y_train) # Best parameters found print(evolved_estimator.best_params_) # Use the model fitted with the best parameters y_predict_ga = evolved_estimator.predict(X_test) print(accuracy_score(y_test, y_predict_ga))
# Saved metadata for further analysis print("Stats achieved in each generation: ", evolved_estimator.history) print("Best k solutions: ", evolved_estimator.hof)
Example: Feature Selection
.. code:: python3
from sklearn_genetic import GAFeatureSelectionCV, ExponentialAdapter
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
data = load_iris()
X, y = data["data"], data["target"]
# Add random non-important features
noise = np.random.uniform(5, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
clf = SVC(gamma='auto')
mutation_scheduler = ExponentialAdapter(0.8, 0.2, 0.01)
crossover_scheduler = ExponentialAdapter(0.2, 0.8, 0.01)
evolved_estimator = GAFeatureSelectionCV(
estimator=clf,
scoring="accuracy",
population_size=30,
generations=20,
mutation_probability=mutation_scheduler,
crossover_probability=crossover_scheduler,
n_jobs=-1)
# Train and select the features
evolved_estimator.fit(X_train, y_train)
# Features selected by the algorithm
features = evolved_estimator.support_
print(features)
# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))
# Transform the original data to the selected features
X_reduced = evolved_estimator.transform(X_test)
Changelog
See the changelog <https://sklearn-genetic-opt.readthedocs.io/en/latest/release_notes.html>
__
for notes on the changes of Sklearn-genetic-opt
Important links
Source code
You can check the latest development version with the command::
git clone https://github.com/rodrigo-arenas/Sklearn-genetic-opt.git
Install the development dependencies::
pip install -r dev-requirements.txt
Check the latest in-development documentation: https://sklearn-genetic-opt.readthedocs.io/en/latest/
Contributing
Contributions are more than welcome!
There are several opportunities on the ongoing project, so please get in touch if you would like to help out.
Make sure to check the current issues and also
the Contribution guide <https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/CONTRIBUTING.md>
_.
Big thanks to the people who are helping with this project!
|Contributors|_
Testing
After installation, you can launch the test suite from outside the source directory::
pytest sklearn_genetic
Hello,
I have been trying to understand the selection and crossover methods for GASearchCV
but I still have some doubts. I am using the default algorithm (eaMuPlusLambda
), but in the implementation it appears that both mu
and lamba
are set to None
.
If I am not wrong, these parameters establish the following:
- mu
: The number of individuals chosen from the previous generation without undergoing mutation or crossover.
- lambda
: The number of individuals in the next generation obtained from crossing and mutating the parents from the previous generation.
If both of them are set to None
, then I don't understand which percentage of a new generation is parents from the previous one and which percentage is mutated children of crossed parents.
I believe that the reproduction process is the following one:
My question is, the next generation, is composed only of probably mutated children of crossed parents? Or are there also parents that are not crossed? In this second case, which is the percentage of children and parents?
Thanks a lot in advance!
Mario
Is your feature request related to a problem? Please describe.
May I ask if there are plans to release a conda
package in the near future?
I want to use this package within a project whose virtual environment is created with conda
and all installed packages are also from conda/conda-forge. I have pip
installed in the environment and tried to install sklearn-genetic-opt
via pip as stated in the docs (pip install sklearn-genetic-opt
). pip
identified the dependencies and installed them (deap
, numpy
, etc.). The problem though is that it doesn't integrate well with the environment. For instance, I have pandas 1.5.0
installed in the conda
environment, but when I open a Python session and run import sklearn_genetic
, the interpreter returns me an error claiming that pandas
is not installed.
Describe the solution you'd expect
The package would be easier to use if it were possible to install it within conda
.
Additional context Everything I reported refers to a Windows 10 21H2 machine.
I open this issue for newcomers who would like to contribute to an open-source project
The idea is to improve the current docs and add more examples using the library, you can see the current docs files here
You could also add external articles to the package showcasing some applications, see these for example
Here is the stable docs
I think I might have a solution for MLflow finally. I've been working with docker quite a bit lately so I think this might work. Tell me what you think.
Is your feature request related to a problem? Please describe. Currently there are not unit tests to the integration with MLflow
Describe the solution you'd expect Create the file in sklearn_genetic/tests/test_mlflow.py and put the set of test that contains the use case of MLflow from sklearn_genetic.mlflow It should test if the config creates or no a new topic and the use of each parameter, as well, that at the end of the runs the logged artifacts/metric/hyperparameters exists in the mlflow server and clean the resources after the test is ended
This is a small release for a minor bug fix
pip install sklearn-genetic-opt[all]
This release brings support to python 3.10, it also comes with different API updates and algorithms optimization
GAFeatureSelectionCV
now mimics the scikit-learn FeatureSelection algorithms API instead of Grid Search, this enables easier implementation as a selection method that is closer to the scikit-learn APIGAFeatureSelectionCV
candidate generation when max_features
is set, it also ensures there is at least one feature selectedcrossover_probability
and mutation_probability
are now correctly passed to the mate and mutation functions inside GAFeatureSelectionCVThanks to the people who contributed with their ideas and suggestions
This release comes with new features and general performance improvements
Introducing Adaptive Schedulers to enable adaptive mutation and crossover probabilities; currently, supported schedulers are: ConstantAdapter
, ExponentialAdapter
, InverseAdapter
, and PotentialAdapter
Add random_state parameter (default= None) in Continuous
, Categorical
and Integer
classes from space to leave fixed the random seed during hyperparameters sampling.
Changed the default values of mutation_probability and crossover_probability to 0.8 and 0.2, respectively.
The weighted_choice function used in GAFeatureSelectionCV
was re-written to give more probability to a number of features closer to the max_features parameter
Removed unused and broken function plot_parallel_coordinates()
This release implements a change when the max_features parameter from class GAFeatureSelectionCV is set, the initial population is now sampled giving more probability to solutions with less than max_features features.
This release comes with some requested features and enhancements.
Class GAFeatureSelectionCV
now has a parameter called max_features
, int, default=None. If it's not None, it will penalize individuals with more features than max_features, putting a "soft" upper bound to the number of features to be selected.
Classes GASearchCV
and GAFeatureSelectionCV
now support multi-metric evaluation the same way scikit-learn does; you will see this reflected on the logbook
and cv_results_
objects, where now you get results for each metric. As in scikit-learn, if multi-metric is used, the refit
parameter must be a str specifying the metric to evaluate the cv-scores.
Training gracefully stops if interrupted by some of these exceptions: KeyboardInterrupt
, SystemExit
, StopIteration
.
When one of these exceptions is raised, the model finishes the current generation and saves the current best model. It only works if at least one generation has been completed.
The following parameters changed their default values to create more extensive and different models with better results:
population_size from 10 to 50
generations from 40 to 80
mutation_probability from 0.1 to 0.2
This is an exciting release! It introduces features selection capabilities to the package
GAFeatureSelectionCV
class for feature selection along with any scikit-learn classifier or regressor. It optimizes the cv-score while minimizing the number of features to select. This class is compatible with the mlflow and tensorboard integration, the Callbacks, and the plot_fitness_evolution function.The module mlflow was renamed to mlflow_log to avoid unexpected errors on name resolutions
python scikit-learn machine-learning artificial-intelligence hyperparameters deap looking-for-contributors model-selection hyperparameter-optimization automl begginer-friendly evolutionary-algorithms contributions-welcome help-wanted good-first-issue goodfirstissue sklearn up-for-grabs featureselection feature-selection