HMM and DTW-based sequence machine learning algorithms in Python following an sklearn-like interface.

eonu, updated 🕥 2023-02-01 08:37:39


Sequentia

HMM and DTW-based sequence machine learning algorithms in Python following an sklearn-like interface.

PyPI PyPI - Python Version Read The Docs - Documentation PyPI - License

About · Build Status · Features · Documentation · Examples · Acknowledgments · References · Contributors · Licensing

About

Sequentia is a Python package that provides various classification and regression algorithms for sequential data, including methods based on hidden Markov models and dynamic time warping.

Some examples of how Sequentia can be used on sequence data include:

  • determining a spoken word based on its audio signal or alternative representations such as MFCCs,
  • predicting motion intent for gesture control from sEMG signals,
  • classifying hand-written characters according to their pen-tip trajectories.

Build Status

| master | dev | | -------- | ------| | CircleCI Build (Master) | CircleCI Build (Development) |

Features

Models

The following models provided by Sequentia all support variable length sequences.

Dynamic Time Warping + k-Nearest Neighbors (via dtaidistance)

  • [x] Classification
  • [x] Regression
  • [x] Multivariate real-valued observations
  • [x] Sakoe–Chiba band global warping constraint
  • [x] Dependent and independent feature warping (DTWD/DTWI)
  • [x] Custom distance-weighted predictions
  • [x] Multi-processed predictions

Hidden Markov Models (via hmmlearn)

Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [1]

  • [x] Classification
  • [x] Multivariate real-valued observations (Gaussian mixture model emissions)
  • [x] Univariate categorical observations (discrete emissions)
  • [x] Linear, left-right and ergodic topologies
  • [x] Multi-processed predictions

Scikit-Learn compatibility

Sequentia aims to follow the Scikit-Learn interface for estimators and transformations, as well as to be largely compatible with three core Scikit-Learn modules to improve the ease of model development: preprocessing, model_selection and pipeline.

While there are many other modules, maintaining full compatibility with Scikit-Learn is challenging and many of its features are inapplicable to sequential data, therefore we only focus on the relevant core modules.

Despite some deviation from the Scikit-Learn interface in order to accommodate sequences, the following features are currently compatible with Sequentia.

Installation

You can install Sequentia using pip.

Stable

The latest stable version of Sequentia can be installed with the following command.

console pip install sequentia

C library compilation

For optimal performance when using any of the k-NN based models, it is important that dtaidistance C libraries are compiled correctly.

Please see the dtaidistance installation guide for troubleshooting if you run into C compilation issues, or if setting use_c=True on k-NN based models results in a warning.

You can use the following to check if the appropriate C libraries have been installed.

python from dtaidistance import dtw dtw.try_import_c()

Pre-release

Pre-release versions include new features which are in active development and may change unpredictably.

The latest pre-release version can be installed with the following command.

console pip install --pre sequentia

Development

Please see the contribution guidelines to see installation instructions for contributing to Sequentia.

Documentation

Documentation for the package is available on Read The Docs.

Examples

Demonstration of classifying multivariate sequences with two features into two classes using the KNNClassifier.

This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.

```python import numpy as np

from sklearn.preprocessing import scale from sklearn.decomposition import PCA

from sequentia.models import KNNClassifier from sequentia.pipeline import Pipeline from sequentia.preprocessing import IndependentFunctionTransformer, mean_filter

Create input data

- Sequentia expects sequences to be concatenated into a single array

- Sequence lengths are provided separately and used to decode the sequences when needed

- This avoids the need for complex structures such as lists of arrays with different lengths

Sequences

X = np.array([ # Sequence 1 - Length 3 [1.2 , 7.91], [1.34, 6.6 ], [0.92, 8.08], # Sequence 2 - Length 5 [2.11, 6.97], [1.83, 7.06], [1.54, 5.98], [0.86, 6.37], [1.21, 5.8 ], # Sequence 3 - Length 2 [1.7 , 6.22], [2.01, 5.49] ])

Sequence lengths

lengths = np.array([3, 5, 2])

Sequence classes

y = np.array([0, 1, 1])

Create a transformation pipeline that feeds into a KNNClassifier

1. Individually denoise each sequence by applying a mean filter for each feature

2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature

3. Reduce the dimensionality of the data to a single feature by using PCA

4. Pass the resulting transformed data into a KNNClassifier

pipeline = Pipeline([ ('denoise', IndependentFunctionTransformer(mean_filter)), ('scale', IndependentFunctionTransformer(scale)), ('pca', PCA(n_components=1)), ('knn', KNNClassifier(k=1)) ])

Fit the pipeline to the data - lengths must be provided

pipeline.fit(X, y, lengths)

Predict classes for the sequences and calculate accuracy - lengths must be provided

y_pred = pipeline.predict(X, lengths) acc = pipeline.score(X, y, lengths) ```

Acknowledgments

In earlier versions of the package, an approximate DTW implementation fastdtw was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [2] claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N2) runtime complexity of the usual exact DTW implementation.

I was contacted by Prof. Eamonn Keogh whose work makes the surprising revelation that FastDTW is generally slower than the exact DTW algorithm that it approximates [3]. Upon switching from the fastdtw package to dtaidistance (a very solid implementation of exact DTW with fast pure C compiled functions), DTW k-NN prediction times were indeed reduced drastically.

I would like to thank Prof. Eamonn Keogh for directly reaching out to me regarding this finding.

References

[1] Lawrence R. Rabiner. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" Proceedings of the IEEE 77 (1989), no. 2, 257-86.
[2] Stan Salvador & Philip Chan. "FastDTW: Toward accurate dynamic time warping in linear time and space." Intelligent Data Analysis 11.5 (2007), 561-580.
[3] Renjie Wu & Eamonn J. Keogh. "FastDTW is approximate and Generally Slower than the Algorithm it Approximates" IEEE Transactions on Knowledge and Data Engineering (2020), 1–1.

Contributors

All contributions to this repository are greatly appreciated. Contribution guidelines can be found here.

eonu
eonu
Prhmma
Prhmma
manisci
manisci
jonnor
jonnor

Licensing

Sequentia is released under the MIT license.

Certain parts of the source code are heavily adapted from Scikit-Learn. Such files contain copy of their license.


Sequentia © 2019-2023, Edwin Onuonga - Released under the MIT license.
Authored and maintained by Edwin Onuonga.

Releases

v1.1.1 2023-02-01 08:03:28

Major changes

  • Remove scikit-learn validation constraints from IndependentFunctionTransformer. (#237)

Minor changes

  • Change default mean_filter/median_filter width to 5. (#238)
  • Update repository documentation. (#239)

v1.1.0 2023-01-18 01:13:32

Major changes

  • Set max_nbytes=None to fix read-only buffer source array error in joblib.Parallel (see https://github.com/scikit-learn/scikit-learn/issues/7981). (#235)
  • Added sequentia.preprocessing module with sklearn.preprocessing compatibility. (#234)
  • Added sequentia.pipeline module for sklearn.pipeline compatibility. (#234)

Minor changes

  • Upgrade sklearn version specifier from >=0.22 to >=1.0. (#234)
  • Upgrade development status classifier to stable. (#233)

v1.0.0 2022-12-27 17:30:29

Major changes

  • Fix CategoricalHMM and GaussianMixtureHMM parameter defaults for params/init_params being modified. (#231)
  • Fix CategoricalHMM and GaussianMixtureHMM unfreeze() calling super().freeze() instead of super().unfreeze(). (#231)
  • Fix serialization/deserialization for _KNNMixin when weighting=None. (#231)
  • Add unit tests. (#231)

Minor changes

  • Change load_digits numbers parameter name to digits. (#231)
  • Change SequentialDataset properties to not return copies of arrays. (#231)
  • Remove SequentialDataset.__eq__. (#231)
  • Change HMMClassifier prior default to None. (#231)

v1.0.0a2 2022-12-06 22:15:30

Minor changes

  • Fix broken link on README.md. (#229)

v1.0.0a1 2022-12-06 21:55:16

Major changes

  • Rework interface to follow sklearn-like patterns. (#226)
  • Remove preprocessing module (temporarily until design is finalized). (#226)
  • Add KNN regression. (#226)
  • Add HMM classifier with categorical emissions. (#226)
  • Use Pydantic for better validation. (#226)
  • Add datasets module for sample datasets. (#226)
  • Split KNN logic across more functions. (#226)
  • Better multi-processing for KNN. (#226)
  • Documentation rework + switch Sphinx documentation theme. (#226)
  • Fix Sakoe-Chiba width calculation. (#226)

v0.13.1 2022-06-26 20:18:58

Major changes

  • Add datasets.load_random_sequences for generating an arbitrarily sized dataset of sequences. (#216)
  • Remove DeepGRU and classifier.rnn module. (#215)
  • Add sequentia.datasets module. (#214)
  • Added return_scores argument to KNNClassifier.predict() to return class scores. (#213)
  • Return self in fit() functions. (#213)
  • Update to hmmlearn v0.2.7. (#201)
  • Update HMMClassifier structure to match KNNClassifier. (#200)
  • Remove 'uniform' KNNClassifier weighting option. (#192)
  • Fix major KNNClassifier label scoring bug - thanks @manisci. (#187)

Minor changes

  • Add digits.npz as package data in setup.py. (#221)
  • Update CONTRIBUTING.md CI instructions. (#219)
  • Switch from TravisCI to CircleCI. (#218)
  • Update HMM tests to use datasets module. (#217)
  • Add tslearn as a core dependency. (#216)
  • Remove torchaudio, torchvision and torchfsdd dependencies. (#214)
  • Add playable audio to notebooks via play_audio helper. (#214)
  • Update README.md and documentation. (#202)
  • Add Jinja2 dependency for RTD. (#188)
Edwin Onuonga

Learning to make machines learn.

GitHub Repository Homepage

classification-algorithms machine-learning python time-series time-series-classification multivariate-timeseries dynamic-time-warping hidden-markov-models k-nearest-neighbor-classifier sequential-patterns sequence-classification dtw knn hmm variable-length