ML hyperparameters tuning and feature selection, using evolutionary algorithms.

rodrigo-arenas, updated 🕥 2023-03-15 01:37:40

.. -- mode: rst --

|Tests| |Codecov| |PythonVersion| |PyPi| |Docs|_

.. |Tests| image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/actions/workflows/ci-tests.yml/badge.svg?branch=master .. _Tests: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/actions/workflows/ci-tests.yml

.. |Codecov| image:: https://codecov.io/gh/rodrigo-arenas/Sklearn-genetic-opt/branch/master/graphs/badge.svg?branch=master&service=github .. _Codecov: https://codecov.io/github/rodrigo-arenas/Sklearn-genetic-opt?branch=master

.. |PythonVersion| image:: https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue .. _PythonVersion : https://www.python.org/downloads/ .. |PyPi| image:: https://badge.fury.io/py/sklearn-genetic-opt.svg .. _PyPi: https://badge.fury.io/py/sklearn-genetic-opt

.. |Docs| image:: https://readthedocs.org/projects/sklearn-genetic-opt/badge/?version=latest .. _Docs: https://sklearn-genetic-opt.readthedocs.io/en/latest/?badge=latest

.. |Contributors| image:: https://contributors-img.web.app/image?repo=rodrigo-arenas/sklearn-genetic-opt .. _Contributors: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/graphs/contributors

.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/logo.png?raw=true

Sklearn-genetic-opt

scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

This is meant to be an alternative to popular methods inside scikit-learn such as Grid Search and Randomized Grid Search for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.

Documentation is available here <https://sklearn-genetic-opt.readthedocs.io/>_

Main Features:

  • GASearchCV: Main class of the package for hyperparameters tuning, holds the evolutionary cross-validation optimization routine.
  • GAFeatureSelectionCV: Main class of the package for feature selection.
  • Algorithms: Set of different evolutionary algorithms to use as an optimization procedure.
  • Callbacks: Custom evaluation strategies to generate early stopping rules, logging (into TensorBoard, .pkl files, etc) or your custom logic.
  • Schedulers: Adaptive methods to control learning parameters.
  • Plots: Generate pre-defined plots to understand the optimization process.
  • MLflow: Build-in integration with mlflow to log all the hyperparameters, cv-scores and the fitted models.

Demos on Features:

Visualize the progress of your training:

.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/progress_bar.gif?raw=true

Real-time metrics visualization and comparison across runs:

.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/tensorboard_log.png?raw=true

Sampled distribution of hyperparameters:

.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/density.png?raw=true

Artifacts logging:

.. image:: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs/images/mlflow_artifacts_4.png?raw=true

Usage:

Install sklearn-genetic-opt

It's advised to install sklearn-genetic using a virtual env, inside the env use::

pip install sklearn-genetic-opt

If you want to get all the features, including plotting, tensorboard and mlflow logging capabilities, install all the extra packages::

pip install sklearn-genetic-opt[all]

Example: Hyperparameters Tuning

.. code-block:: python

from sklearn_genetic import GASearchCV from sklearn_genetic.space import Continuous, Categorical, Integer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, StratifiedKFold from sklearn.datasets import load_digits from sklearn.metrics import accuracy_score

data = load_digits() n_samples = len(data.images) X = data.images.reshape((n_samples, -1)) y = data['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = RandomForestClassifier()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'), 'bootstrap': Categorical([True, False]), 'max_depth': Integer(2, 30), 'max_leaf_nodes': Integer(2, 35), 'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

evolved_estimator = GASearchCV(estimator=clf, cv=cv, scoring='accuracy', population_size=20, generations=35, param_grid=param_grid, n_jobs=-1, verbose=True, keep_top_k=4)

# Train and optimize the estimator evolved_estimator.fit(X_train, y_train) # Best parameters found print(evolved_estimator.best_params_) # Use the model fitted with the best parameters y_predict_ga = evolved_estimator.predict(X_test) print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis print("Stats achieved in each generation: ", evolved_estimator.history) print("Best k solutions: ", evolved_estimator.hof)

Example: Feature Selection

.. code:: python3

from sklearn_genetic import GAFeatureSelectionCV, ExponentialAdapter
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np

data = load_iris()
X, y = data["data"], data["target"]

# Add random non-important features
noise = np.random.uniform(5, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

clf = SVC(gamma='auto')
mutation_scheduler = ExponentialAdapter(0.8, 0.2, 0.01)
crossover_scheduler = ExponentialAdapter(0.2, 0.8, 0.01)

evolved_estimator = GAFeatureSelectionCV(
    estimator=clf,
    scoring="accuracy",
    population_size=30,
    generations=20,
    mutation_probability=mutation_scheduler,
    crossover_probability=crossover_scheduler,
    n_jobs=-1)

# Train and select the features
evolved_estimator.fit(X_train, y_train)

# Features selected by the algorithm
features = evolved_estimator.support_
print(features)

# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Transform the original data to the selected features
X_reduced = evolved_estimator.transform(X_test)

Changelog

See the changelog <https://sklearn-genetic-opt.readthedocs.io/en/latest/release_notes.html>__ for notes on the changes of Sklearn-genetic-opt

Important links

  • Official source code repo: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/
  • Download releases: https://pypi.org/project/sklearn-genetic-opt/
  • Issue tracker: https://github.com/rodrigo-arenas/Sklearn-genetic-opt/issues
  • Stable documentation: https://sklearn-genetic-opt.readthedocs.io/en/stable/

Source code

You can check the latest development version with the command::

git clone https://github.com/rodrigo-arenas/Sklearn-genetic-opt.git

Install the development dependencies::

pip install -r dev-requirements.txt

Check the latest in-development documentation: https://sklearn-genetic-opt.readthedocs.io/en/latest/

Contributing

Contributions are more than welcome! There are several opportunities on the ongoing project, so please get in touch if you would like to help out. Make sure to check the current issues and also the Contribution guide <https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/CONTRIBUTING.md>_.

Big thanks to the people who are helping with this project!

|Contributors|_

Testing

After installation, you can launch the test suite from outside the source directory::

pytest sklearn_genetic

Issues

Question about selection and crossover

opened on 2023-03-15 17:50:43 by mario-sanz

Hello,

I have been trying to understand the selection and crossover methods for GASearchCV but I still have some doubts. I am using the default algorithm (eaMuPlusLambda), but in the implementation it appears that both mu and lambaare set to None.

If I am not wrong, these parameters establish the following: - mu: The number of individuals chosen from the previous generation without undergoing mutation or crossover. - lambda: The number of individuals in the next generation obtained from crossing and mutating the parents from the previous generation.

If both of them are set to None, then I don't understand which percentage of a new generation is parents from the previous one and which percentage is mutated children of crossed parents.

I believe that the reproduction process is the following one:

  1. Selection: With the chosen selection method, select individuals that will produce next generation.
  2. Crossover: Apply crossover to some of the selected individuals according to crossover probability.
  3. Mutation: Apply mutation to the resulting population according to mutation probability.

My question is, the next generation, is composed only of probably mutated children of crossed parents? Or are there also parents that are not crossed? In this second case, which is the percentage of children and parents?

Thanks a lot in advance!

Mario

[FEATURE] Conda package

opened on 2022-10-28 20:46:53 by abianco88

Is your feature request related to a problem? Please describe. May I ask if there are plans to release a conda package in the near future?

I want to use this package within a project whose virtual environment is created with conda and all installed packages are also from conda/conda-forge. I have pip installed in the environment and tried to install sklearn-genetic-opt via pip as stated in the docs (pip install sklearn-genetic-opt). pip identified the dependencies and installed them (deap, numpy, etc.). The problem though is that it doesn't integrate well with the environment. For instance, I have pandas 1.5.0 installed in the conda environment, but when I open a Python session and run import sklearn_genetic, the interpreter returns me an error claiming that pandas is not installed.

Describe the solution you'd expect The package would be easier to use if it were possible to install it within conda.

Additional context Everything I reported refers to a Windows 10 21H2 machine.

Improve documentation and examples

opened on 2022-06-16 13:47:27 by rodrigo-arenas

I open this issue for newcomers who would like to contribute to an open-source project

The idea is to improve the current docs and add more examples using the library, you can see the current docs files here

You could also add external articles to the package showcasing some applications, see these for example

Here is the stable docs

Workflow for mlflow added

opened on 2021-09-11 06:41:14 by Turtle24

I think I might have a solution for MLflow finally. I've been working with docker quite a bit lately so I think this might work. Tell me what you think.

[FEATURE] MLflow tests

opened on 2021-06-23 15:55:47 by rodrigo-arenas

Is your feature request related to a problem? Please describe. Currently there are not unit tests to the integration with MLflow

Describe the solution you'd expect Create the file in sklearn_genetic/tests/test_mlflow.py and put the set of test that contains the use case of MLflow from sklearn_genetic.mlflow It should test if the config creates or no a new topic and the use of each parameter, as well, that at the end of the runs the logged artifacts/metric/hyperparameters exists in the mlflow server and clean the resources after the test is ended

Releases

0.10.1 2023-03-15 01:34:40

This is a small release for a minor bug fix

Features:

  • Install TensorFlow when using pip install sklearn-genetic-opt[all]

Bug Fixes:

  • Fixed a bug that wouldn’t allow cloning the GA classes when used inside a pipeline

0.10.0 2023-02-15 02:36:40

This release brings support to python 3.10, it also comes with different API updates and algorithms optimization

API Changes:

  • GAFeatureSelectionCV now mimics the scikit-learn FeatureSelection algorithms API instead of Grid Search, this enables easier implementation as a selection method that is closer to the scikit-learn API
  • Improved GAFeatureSelectionCV candidate generation when max_features is set, it also ensures there is at least one feature selected
  • crossover_probability and mutation_probability are now correctly passed to the mate and mutation functions inside GAFeatureSelectionCV
  • Dropped support for python 3.7 and add support for python 3.10
  • Update most important packages from dev-requirements.txt to more recent versions
  • Update deprecated functions in tests

Thanks to the people who contributed with their ideas and suggestions

0.9.0 2022-06-06 22:32:57

This release comes with new features and general performance improvements

Features:

  • Introducing Adaptive Schedulers to enable adaptive mutation and crossover probabilities; currently, supported schedulers are: ConstantAdapter, ExponentialAdapter, InverseAdapter, and PotentialAdapter

  • Add random_state parameter (default= None) in Continuous, Categorical and Integer classes from space to leave fixed the random seed during hyperparameters sampling.

API Changes:

  • Changed the default values of mutation_probability and crossover_probability to 0.8 and 0.2, respectively.

  • The weighted_choice function used in GAFeatureSelectionCV was re-written to give more probability to a number of features closer to the max_features parameter

  • Removed unused and broken function plot_parallel_coordinates()

Bug Fixes

  • Now, when using the plot_search_space() function, all the parameters get cast as np.float64 to avoid errors on the seaborn package while plotting bool values.

0.8.1 2022-03-09 19:13:56

This release implements a change when the max_features parameter from class GAFeatureSelectionCV is set, the initial population is now sampled giving more probability to solutions with less than max_features features.

0.8.0 2022-01-05 02:36:13

This release comes with some requested features and enhancements.

Features:

  • Class GAFeatureSelectionCV now has a parameter called max_features, int, default=None. If it's not None, it will penalize individuals with more features than max_features, putting a "soft" upper bound to the number of features to be selected.

  • Classes GASearchCV and GAFeatureSelectionCV now support multi-metric evaluation the same way scikit-learn does; you will see this reflected on the logbook and cv_results_ objects, where now you get results for each metric. As in scikit-learn, if multi-metric is used, the refit parameter must be a str specifying the metric to evaluate the cv-scores.

  • Training gracefully stops if interrupted by some of these exceptions: KeyboardInterrupt, SystemExit, StopIteration. When one of these exceptions is raised, the model finishes the current generation and saves the current best model. It only works if at least one generation has been completed.

API Changes:

  • The following parameters changed their default values to create more extensive and different models with better results:

  • population_size from 10 to 50

  • generations from 40 to 80

  • mutation_probability from 0.1 to 0.2

Docs:

  • A new notebook called Iris_multimetric was added to showcase the new multi-metric capabilities.

0.7.0 2021-11-17 23:14:23

This is an exciting release! It introduces features selection capabilities to the package

Features:

  • GAFeatureSelectionCV class for feature selection along with any scikit-learn classifier or regressor. It optimizes the cv-score while minimizing the number of features to select. This class is compatible with the mlflow and tensorboard integration, the Callbacks, and the plot_fitness_evolution function.

API Changes:

The module mlflow was renamed to mlflow_log to avoid unexpected errors on name resolutions

python scikit-learn machine-learning artificial-intelligence hyperparameters deap looking-for-contributors model-selection hyperparameter-optimization automl begginer-friendly evolutionary-algorithms contributions-welcome help-wanted good-first-issue goodfirstissue sklearn up-for-grabs featureselection feature-selection