AutoML for clustering models in sklearn.

wywongbd, updated 🕥 2023-01-09 05:53:43


autocluster is an automated machine learning (AutoML) toolkit for performing clustering tasks.

Report and presentation slides can be found here and here.


  • Python 3.5 or above
  • Linux OS, or Windows WSL is also possible

How to get started?

  1. First, install SMAC:
  2. sudo apt-get install build-essential swig
  3. conda install gxx_linux-64 gcc_linux-64 swig
  4. pip install smac==0.8.0
  5. pip install autocluster

How it works?

  • autocluster automatically optimizes the configuration of a clustering problem. By configuration, we mean

    • choice of dimension reduction algorithm
    • choice of clustering model
    • setting of dimension reduction algorithm's hyperparameters
    • setting of clustering model's hyperparameters
  • autocluster provides 3 different approaches to optimize the configuration (with increasing complexity):

    • random optimization
    • bayesian optimization
    • bayesian optimization + meta-learning (warmstarting)

Algorithms/Models supported

  • List of dimension reduction algorithms in sklearn supported by autocluster's optimizer.

  • List of clustering models in sklearn supported by autocluster's optimizer.


Examples are available in these notebooks.

Experimental results

  • This dataset comprises of 16 Gaussian clusters in 128-dimensional space with N = 1024 points. The optimal configuration obtained by autocluster (SMAC + Warmstarting) consists of a Truncated SVD dimension reduction model + Birch clustering model.

  • This dataset comprises of 15 Gaussian clusters in 2-dimensional space with N = 5000 points. The optimal configuration obtained by autocluster (SMAC + Warmstarting) consists of a TSNE dimension reduction model + Agglomerative clustering model.


  • Link to pypi.
  • Great writeup by Martin Krasser on Bayesian Optimization


The project is experimental and still under development.


AttributeError: module 'pynisher' has no attribute 'enforce_limits'

opened on 2023-01-16 05:11:36 by hammad2008

I am getting this error AttributeError: module 'pynisher' has no attribute 'enforce_limits'

Installation problems? I am maintaining a version of this until the authors come back

opened on 2023-01-04 00:42:32 by renxida

This is the first google result that comes up when I searched for "python automl clustering" and is frankly a really great library. However, it's not maintained and installation has broken.


for a version that works as of Jan 23 2023.

I have also submitted pull requests in the hope that the author comes back, and will gladly close this issue if this repo gets some love.

SMAC Link Dead in Readme

opened on 2023-01-03 18:46:18 by renxida

This link is dead

What does metaknowledge mean and how can I use it ?

opened on 2021-12-01 07:33:20 by haow85

What does metaknowledge mean in this repository ?

Bravo ! Hao Wang

Consistent Core Dump

opened on 2020-06-05 05:08:04 by WolVesz

Also, when attempting to the run the system at all I am consistently running into a core dump issue:

`>>> from autocluster import AutoCluster, get_evaluator

X, y = datasets.make_blobs(n_samples=1000, ... n_features=2, ... centers=6, ... cluster_std=0.5, ... shuffle=True, random_state=27) dummy_df = pd.DataFrame(X) dummy_df.head(5) 0 1 0 7.742343 -6.603815 1 8.726121 6.433689 2 -1.427522 5.393546 3 8.801468 -5.185687 4 -1.404321 9.526536 cluster = AutoCluster(logger=None) fit_params = { ... "df": dummy_df, ... "cluster_alg_ls": [ ... 'KMeans', 'GaussianMixture', 'MiniBatchKMeans' ... ], ... "dim_reduction_alg_ls": [ ... 'NullModel' ... ], ... "optimizer": 'smac', ... "n_evaluations": 40, ... "run_obj": 'quality', ... "seed": 27, ... "cutoff_time": 10, ... "preprocess_dict": { ... "numeric_cols": list(range(2)), ... "categorical_cols": [], ... "ordinal_cols": [], ... "y_col": [] ... }, ... "evaluator": get_evaluator(evaluator_ls = ['silhouetteScore', ... 'daviesBouldinScore', ... 'calinskiHarabaszScore'], ... weights = [1, 1, 1], ... clustering_num = None, ... min_proportion = .01, ... min_relative_proportion='default'), ... "n_folds": 3, ... "warmstart": False, ... "verbose_level": 1, ... } result_dict =**fit_params) /home/wolvez/.local/lib/python3.8/site-packages/sklearn/ensemble/ FutureWarning: 'behaviour' is deprecated in 0.22 and will be removed in 0.24. You should not pass or set this parameter. warn( 664/1000 datapoints remaining after outlier removal Truncated n_evaluations: 40 Segmentation fault (core dumped)`

Matplotlib known build error on version 3.0.3, Upgraded Recommended

opened on 2020-06-04 22:44:05 by WolVecz

When running a base pip install I am consistently having the same issue.

` pip3 --no-cache-dir install autocluster Looking in indexes:, https://1205d49dc47b4644d672f57e74f850e6342693e3f0b8cf0b:**** Collecting autocluster Downloading autocluster-0.5.2-py3-none-any.whl (35 kB) Requirement already satisfied: six>=1.5.0 in /usr/lib/python3/dist-packages (from autocluster) (1.14.0) Collecting matplotlib==3.0.3 Downloading matplotlib-3.0.3.tar.gz (36.6 MB) |████████████████████████████████| 36.6 MB 3.1 MB/s ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ecu3m8rg/matplotlib/'"'"'; file='"'"'/tmp/pip-install-ecu3m8rg/matplotlib/'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);'"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-ecu3m8rg/matplotlib/pip-egg-info cwd: /tmp/pip-install-ecu3m8rg/matplotlib/ Complete output (48 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-install-ecu3m8rg/matplotlib/", line 225, in msg = pkg.install_help_msg() File "/tmp/pip-install-ecu3m8rg/matplotlib/", line 650, in install_help_msg release = platform.linux_distribution()[0].lower() AttributeError: module 'platform' has no attribute 'linux_distribution' ============================================================================ Edit setup.cfg to change the build options

            matplotlib: yes [3.0.3]
                python: yes [3.8.2 (default, Apr 27 2020, 15:53:34)  [GCC
              platform: yes [linux]

                 numpy: yes [version 1.18.5]
      install_requires: yes [handled by setuptools]
                libagg: yes [pkg-config information for 'libagg' could not
                        be found. Using local copy.]
              freetype: no  [The C/C++ header for freetype2 (ft2build.h)
                        could not be found.  You may need to install the
                        development package.]
                   png: no  [pkg-config information for 'libpng' could not
                        be found.]
                 qhull: yes [pkg-config information for 'libqhull' could not
                        be found. Using local copy.]

           sample_data: yes [installing]
              toolkits: yes [installing]
                 tests: no  [skipping due to configuration]
        toolkits_tests: no  [skipping due to configuration]

                   agg: yes [installing]
                 tkagg: yes [installing; run-time loading from Python Tcl /
                macosx: no  [Mac OS-X only]
             windowing: no  [Microsoft Windows only]

                  dlls: no  [skipping due to configuration]

                        * The following required packages can not be built:
                        * freetype, png

ERROR: Command errored out with exit status 1: python egg_info Check the logs for full command output.`

Wen Yan

Machine Learning Engineer | Shopee SG | HKUST | Times Series Analysis, Deep Learning & Statistics

GitHub Repository

hyperparameter-optimization bayesian-optimization automl clustering