AutoML for clustering models in sklearn.

wywongbd, updated 🕥 2023-01-09 05:53:43

autocluster

autocluster is an automated machine learning (AutoML) toolkit for performing clustering tasks.

Report and presentation slides can be found here and here.

Prerequisites

  • Python 3.5 or above
  • Linux OS, or Windows WSL is also possible

How to get started?

  1. First, install SMAC:
  2. sudo apt-get install build-essential swig
  3. conda install gxx_linux-64 gcc_linux-64 swig
  4. pip install smac==0.8.0
  5. pip install autocluster

How it works?

  • autocluster automatically optimizes the configuration of a clustering problem. By configuration, we mean

    • choice of dimension reduction algorithm
    • choice of clustering model
    • setting of dimension reduction algorithm's hyperparameters
    • setting of clustering model's hyperparameters
  • autocluster provides 3 different approaches to optimize the configuration (with increasing complexity):

    • random optimization
    • bayesian optimization
    • bayesian optimization + meta-learning (warmstarting)

Algorithms/Models supported

  • List of dimension reduction algorithms in sklearn supported by autocluster's optimizer.

  • List of clustering models in sklearn supported by autocluster's optimizer.

Examples

Examples are available in these notebooks.

Experimental results

  • This dataset comprises of 16 Gaussian clusters in 128-dimensional space with N = 1024 points. The optimal configuration obtained by autocluster (SMAC + Warmstarting) consists of a Truncated SVD dimension reduction model + Birch clustering model.

  • This dataset comprises of 15 Gaussian clusters in 2-dimensional space with N = 5000 points. The optimal configuration obtained by autocluster (SMAC + Warmstarting) consists of a TSNE dimension reduction model + Agglomerative clustering model.

Links

  • Link to pypi.
  • Great writeup by Martin Krasser on Bayesian Optimization

Disclaimer

The project is experimental and still under development.

Issues

AttributeError: module 'pynisher' has no attribute 'enforce_limits'

opened on 2023-01-16 05:11:36 by hammad2008

I am getting this error AttributeError: module 'pynisher' has no attribute 'enforce_limits'

Installation problems? I am maintaining a version of this until the authors come back

opened on 2023-01-04 00:42:32 by renxida

This is the first google result that comes up when I searched for "python automl clustering" and is frankly a really great library. However, it's not maintained and installation has broken.

See

https://github.com/renxida/autocluster

for a version that works as of Jan 23 2023.

I have also submitted pull requests in the hope that the author comes back, and will gladly close this issue if this repo gets some love.

SMAC Link Dead in Readme

opened on 2023-01-03 18:46:18 by renxida

https://automl.github.io/SMAC3/stable/installation.html

This link is dead

What does metaknowledge mean and how can I use it ?

opened on 2021-12-01 07:33:20 by haow85

What does metaknowledge mean in this repository ?

Bravo ! Hao Wang

Consistent Core Dump

opened on 2020-06-05 05:08:04 by WolVesz

Also, when attempting to the run the system at all I am consistently running into a core dump issue:

`>>> from autocluster import AutoCluster, get_evaluator

X, y = datasets.make_blobs(n_samples=1000, ... n_features=2, ... centers=6, ... cluster_std=0.5, ... shuffle=True, random_state=27) dummy_df = pd.DataFrame(X) dummy_df.head(5) 0 1 0 7.742343 -6.603815 1 8.726121 6.433689 2 -1.427522 5.393546 3 8.801468 -5.185687 4 -1.404321 9.526536 cluster = AutoCluster(logger=None) fit_params = { ... "df": dummy_df, ... "cluster_alg_ls": [ ... 'KMeans', 'GaussianMixture', 'MiniBatchKMeans' ... ], ... "dim_reduction_alg_ls": [ ... 'NullModel' ... ], ... "optimizer": 'smac', ... "n_evaluations": 40, ... "run_obj": 'quality', ... "seed": 27, ... "cutoff_time": 10, ... "preprocess_dict": { ... "numeric_cols": list(range(2)), ... "categorical_cols": [], ... "ordinal_cols": [], ... "y_col": [] ... }, ... "evaluator": get_evaluator(evaluator_ls = ['silhouetteScore', ... 'daviesBouldinScore', ... 'calinskiHarabaszScore'], ... weights = [1, 1, 1], ... clustering_num = None, ... min_proportion = .01, ... min_relative_proportion='default'), ... "n_folds": 3, ... "warmstart": False, ... "verbose_level": 1, ... } result_dict = cluster.fit(**fit_params) /home/wolvez/.local/lib/python3.8/site-packages/sklearn/ensemble/_iforest.py:252: FutureWarning: 'behaviour' is deprecated in 0.22 and will be removed in 0.24. You should not pass or set this parameter. warn( 664/1000 datapoints remaining after outlier removal Truncated n_evaluations: 40 Segmentation fault (core dumped)`

Matplotlib known build error on version 3.0.3, Upgraded Recommended

opened on 2020-06-04 22:44:05 by WolVecz

https://github.com/matplotlib/matplotlib/issues/13555

When running a base pip install I am consistently having the same issue.

` pip3 --no-cache-dir install autocluster Looking in indexes: https://pypi.org/simple, https://1205d49dc47b4644d672f57e74f850e6342693e3f0b8cf0b:****@packagecloud.io/agrible/internal/pypi/simple Collecting autocluster Downloading autocluster-0.5.2-py3-none-any.whl (35 kB) Requirement already satisfied: six>=1.5.0 in /usr/lib/python3/dist-packages (from autocluster) (1.14.0) Collecting matplotlib==3.0.3 Downloading matplotlib-3.0.3.tar.gz (36.6 MB) |████████████████████████████████| 36.6 MB 3.1 MB/s ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ecu3m8rg/matplotlib/setup.py'"'"'; file='"'"'/tmp/pip-install-ecu3m8rg/matplotlib/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-ecu3m8rg/matplotlib/pip-egg-info cwd: /tmp/pip-install-ecu3m8rg/matplotlib/ Complete output (48 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-install-ecu3m8rg/matplotlib/setup.py", line 225, in msg = pkg.install_help_msg() File "/tmp/pip-install-ecu3m8rg/matplotlib/setupext.py", line 650, in install_help_msg release = platform.linux_distribution()[0].lower() AttributeError: module 'platform' has no attribute 'linux_distribution' ============================================================================ Edit setup.cfg to change the build options

BUILDING MATPLOTLIB
            matplotlib: yes [3.0.3]
                python: yes [3.8.2 (default, Apr 27 2020, 15:53:34)  [GCC
                        9.3.0]]
              platform: yes [linux]

REQUIRED DEPENDENCIES AND EXTENSIONS
                 numpy: yes [version 1.18.5]
      install_requires: yes [handled by setuptools]
                libagg: yes [pkg-config information for 'libagg' could not
                        be found. Using local copy.]
              freetype: no  [The C/C++ header for freetype2 (ft2build.h)
                        could not be found.  You may need to install the
                        development package.]
                   png: no  [pkg-config information for 'libpng' could not
                        be found.]
                 qhull: yes [pkg-config information for 'libqhull' could not
                        be found. Using local copy.]

OPTIONAL SUBPACKAGES
           sample_data: yes [installing]
              toolkits: yes [installing]
                 tests: no  [skipping due to configuration]
        toolkits_tests: no  [skipping due to configuration]

OPTIONAL BACKEND EXTENSIONS
                   agg: yes [installing]
                 tkagg: yes [installing; run-time loading from Python Tcl /
                        Tk]
                macosx: no  [Mac OS-X only]
             windowing: no  [Microsoft Windows only]

OPTIONAL PACKAGE DATA
                  dlls: no  [skipping due to configuration]

============================================================================
                        * The following required packages can not be built:
                        * freetype, png
----------------------------------------

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.`

Wen Yan

Machine Learning Engineer | Shopee SG | HKUST | Times Series Analysis, Deep Learning & Statistics

GitHub Repository

hyperparameter-optimization bayesian-optimization automl clustering