A high-performance implementation of Wilkinson formulas for Python.

matthewwardrop, updated 🕥 2023-03-27 23:07:56

Formulaic

PyPI - Version PyPI - Python Version PyPI - Status build docs codecov Code Style

Formulaic is a high-performance implementation of Wilkinson formulas for Python.

Note: This project, while largely complete, is still a work in progress, and the API is subject to change between major versions (0.<major>.<minor>).

  • Documentation: https://matthewwardrop.github.io/formulaic
  • Source Code: https://github.com/matthewwardrop/formulaic
  • Issue tracker: https://github.com/matthewwardrop/formulaic/issues

It provides:

  • high-performance dataframe to model-matrix conversions.
  • support for reusing the encoding choices made during conversion of one data-set on other datasets.
  • extensible formula parsing.
  • extensible data input/output plugins, with implementations for:
  • input:
    • pandas.DataFrame
    • pyarrow.Table
  • output:
    • pandas.DataFrame
    • numpy.ndarray
    • scipy.sparse.CSCMatrix
  • support for symbolic differentiation of formulas (and hence model matrices).

Example code

``` import pandas from formulaic import Formula

df = pandas.DataFrame({ 'y': [0,1,2], 'x': ['A', 'B', 'C'], 'z': [0.3, 0.1, 0.2], })

y, X = Formula('y ~ x + z').get_model_matrix(df) ```

y =

y
0 0
1 1
2 2

X =

Intercept x[T.B] x[T.C] z
0 1.0 0 0 0.3
1 1.0 1 0 0.1
2 1.0 0 1 0.2

Benchmarks

Formulaic typically outperforms R for both dense and sparse model matrices, and vastly outperforms patsy (the existing implementation for Python) for dense matrices (patsy does not support sparse model matrix output).

Benchmarks

For more details, see here.

Related projects and prior art

  • Patsy: a prior implementation of Wilkinson formulas for Python, which is widely used (e.g. in statsmodels). It has fantastic documentation (which helped bootstrap this project), and a rich array of features.
  • StatsModels.jl @formula: The implementation of Wilkinson formulas for Julia.
  • R Formulas: The implementation of Wilkinson formulas for R, which is thoroughly introduced here. [R itself is an implementation of S, in which formulas were first made popular].
  • The work that started it all: Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392–399, 1973.

Issues

How can the encoding choices for one dataset be reused for another?

opened on 2023-01-10 10:42:45 by JDawson-Camlin

The documentation's introduction says Formulaic provides "support for reusing the encoding choices made during conversion of one data-set on other datasets." How?

Terms not being evaluated in get_model_matrix()

opened on 2022-12-15 21:40:23 by cmoroney

I am creating factors manually and setting the eval_method for categorical factors as lookup and for numerical factors as python as I am using np.isnan and np.power for transformations. I then create term lists and feed them as the first argument in formulaic.Formula().

After creating the formula, the columns using np transformations show 0 for every observation. If I use a string formula instead of a list of Terms, there doesn't seem to be any issue in evaluating the expression.

DOC: Explicitly mention support for multiple variables on the left hand side

opened on 2022-11-26 10:07:33 by s3alfisc

Hi Matthew,

Yet another question: Are multiple variables on the left hand side supported? It seems like it, but I wanted to be sure before I build on it:

``` from formulaic import model_matrix import numpy as np import pandas as pd

N = 10 X = np.random.normal(0, 1, N) Y = np.random.normal(0, 1, N) Y1 = np.random.normal(0, 1, N) Y2 = np.random.normal(0, 1, N)

data = {'Y1':Y1, 'Y2':Y2, 'X':X} data = pd.DataFrame(data)

fml = 'Y1 + Y2 ~ X' y, X = model_matrix(fml, data, na_action = 'ignore')

y.head() Y1 Y2 0 -0.949247 2.510162 1 -0.695311 -0.727821 2 -0.072459 -1.172503 3 0.422692 -1.223921 4 0.224729 0.585468 ```

Best, Alex

drop both columns in dependent variable and design matrix when missings occur

opened on 2022-11-26 09:45:05 by s3alfisc

Hi Matthew - thanks for making this super useful package available!

Currently, when there is a missing value in a column of the design matrix X, but not in the dependent variable Y, model_matrix() drops observations columnwise, but not for both X and Y.

Here's a quick example:

``` from formulaic import model_matrix import numpy as np import pandas as pd

N = 10 X = np.random.normal(0, 1, N) Y = np.random.normal(0, 1, N) data = {'Y':Y, 'X':X} data = pd.DataFrame(data)

data.X.iloc[0] = None

fml = 'Y ~ X' y, X = model_matrix(fml, data, na_action = 'ignore')

y.head() Y 0 -0.174508 1 0.373280 2 1.631371 3 -0.622598 4 -0.482028 X.head() Intercept X 1 1.0 2.652463 2 1.0 -1.356067 3 1.0 1.143417 4 1.0 -1.020435 5 1.0 0.072263

y, X = model_matrix(fml, data)

y.shape (10, 1) X.shape (9, 2) ``` The NaN column is dropped from X, but not from Y.

I think it would be nice to add functionality for this (though it might already exist?).

E.g. in R, before calling base::model.matrix(), one would define a base::model.frame(), which would then by default drop the entire column where a missing value exists (for both X and Y).

``` N <- 10 X <- rnorm(N) Y <- rnorm(N) X[1] <- NA

data <- data.frame(Y = Y, X = X) mf <- model.frame(Y ~ X, data)

Y X

2 0.9418535 0.05795054

3 -1.2333905 -1.02186716

4 -0.1277604 1.59699265

5 -0.1258892 -1.16908339

6 0.2176256 0.22375018

7 -1.2068559 0.92400472

8 -0.5803319 0.55442642

9 -1.3511992 -0.34372283

10 -2.0518279 -0.31997878

mm <- model.matrix(mf)

depvar <- model.response(mf)

```

Draft: Add support for the patsy `Q` transform.

opened on 2022-10-09 22:05:24 by matthewwardrop

Motivated by #115 , this is a quick draft demonstrating a Q transform that replicates the behavior of the patsy Q function.

Notes: - This syntax is not as nice as ...., but may be valuable when users are migrating over to formulaic from patsy. - Perhaps this should be gated by a patsy compatibility flag. - Perhaps _data and _context should be passed in separately, to avoid name collisions / weird data types. - Perhaps it isn't worth adding at all? And just have people migrate over to ... syntax, or furnish formulaic with their own Q implementation (only merge the context bits).

Add a patsy to formulaic formula converter

opened on 2022-10-07 19:46:40 by rchui

The formula grammar for patsy and formulaic are close but not 1:1. It would be great if terms like Q('...') could automatically be converted into `...`, etc. This would help ease the adoption curve to migrate from patsy to formulaic using existing formulae.

Releases

2022-09-18 03:22:26

This is a minor patch releases that fixes one bug.

Bugfixes and cleanups:

  • Fixed alignment between the length of a Structured instance and iteration over this instance (including Formula instances). Formerly the length would only count the number of keys in its structure, rather than the number of objects that would be yielded during iteration.

2022-09-10 03:35:53

This is a minor patch release that fixes two bugs.

Bugfixes and cleanups:

  • Fixed generation of string representation of Formula objects.
  • Fixed generation of formulaic.__version__ during package build.

2022-08-29 05:23:03

This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.

Breaking changes:

  • Accessing named substructures of Formula objects (e.g. formula.lhs) no longer returns a list of terms; but rather a Formula object, so that the helper methods can remain accessible. You can access the raw terms by iterating over the formula (list(formula)) or looking up the root node (formula.root).

New features and improvements:

  • The ModelSpec object is now the source of truth in all ModelMatrix generations, and can be constructed directly from any supported specification using ModelSpec.from_spec(...). Supported specifications include formula strings, parsed formulae, model matrices and prior model specs.
  • The .get_model_matrix() helper methods across Formula, FormulaMaterializer, ModelSpec and model_matrix objects/helpers functions are now consistent, and all use ModelSpec directly under the hood.
  • When accessing substructures of Formula objects (e.g. formula.lhs), the term lists will be wrapped as trivial Formula instances rather than returned as raw lists (so that the helper methods like .get_model_matrix() can still be used).
  • FormulaSpec is now exported from the top-level module.

Bugfixes and cleanups:

  • Fixed ModelSpec specifications being overriden by default arguments to FormulaMaterializer.get_model_matrix.
  • Structured._flatten() now correctly flattens unnamed substructures.

2022-08-10 20:31:56

This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.

Breaking changes:

  • include_intercept is no longer an argument to FormulaParser.get_terms; and is instead an argument of the DefaultFormulaParser constructor. If you want to modify the include_intercept behaviour, please use: python Formula("y ~ x", _parser=DefaultFormulaParser(include_intercept=False))
  • Accessing terms via Formula.terms is deprecated since Formula became a subclass of Structured[List[Terms]]. You can directly iterate over, and/or access nested structure on the Formula instance itself. Formula.terms has a deprecated property which will return a reference to itself in order to support legacy use-cases. This will be removed in 1.0.0.
  • ModelSpec.feature_names and ModelSpec.feature_columns are deprecated in favour of ModelSpec.column_names and ModelSpec.column_indices. Deprecated properties remain in-place to support legacy use-cases. These will be removed in 1.0.0.

New features and enhancements:

  • Structured formulae (and their derived matrices and specs) are now mutable. Internally Formula has been refactored as a subclass of Structured[List[Terms]], and can be incrementally built and modified. The matrix and spec outputs now have explicit subclasses of Structured (ModelMatrices and ModelSpecs respectively) to expose convenience methods that allow these objects to be largely used interchangeably with their singular counterparts.
  • ModelMatrices and ModelSpecs arenow surfaced as top-level exports of the formulaic module.
  • Structured (and its subclasses) gained improved integration of nested tuple structure, as well as support for flattened iteration, explicit mapping output types, and lots of cleanups.
  • ModelSpec was made into a dataclass, and gained several new properties/methods to support better introspection and mutation of the model spec.
  • FormulaParser was renamed DefaultFormulaParser, and made a subclass of the new formula parser interface FormulaParser. In this process include_intercept was removed from the API, and made an instance attribute of the default parser implementation.

Bugfixes and cleanups:

  • Fixed AST evaluation for large formulae that caused the evaluation to hit the recursion limit.
  • Fixed sparse categorical encoding when the dataframe index is not the standard range index.
  • Fixed a bug in the linear constraints parser when more than two constraints were specified in a comma-separated string.
  • Avoid implicit changing of the sparsity structure of CSC matrices.
  • If manually constructed ModelSpecs are provided by the user during materialization, they are updated to reflect the output-type chosen by the user, as well as whether to ensure full rank/etc.
  • Allowed use of older pandas versions. All versions >=1.0.0 are now supported.
  • Various linting cleanups as pylint was added to the CI testing.

Documentation:

  • Apart from the .materializer submodule, most code now has inline documentation and annotations.

2022-05-01 04:08:51

This is a backward compatible major release that adds several new features.

New features and enhancements:

  • Added support for customizing the contrasts generated for categorical features, including treatment, sum, deviation, helmert and custom contrasts.
  • Added support for the generation of linear constraints for ModelMatrix instances (see ModelMatrix.model_spec.get_linear_constraints).
  • Added support for passing ModelMatrix, ModelSpec and other formula-like objects to the model_matrix sugar method so that pre-processed formulae can be used.
  • Improved the way tokens are manipulated for the right-hand-side intercept and substitutions of 0 with -1 to avoid substitutions in quoted contexts.

Bugfixes and cleanups:

  • Fixed variable sanitization during evaluation, allowing variables with special characters to be used in Python transforms; for example: bs(`my|feature%is^cool`).
  • Fixed the parsing of dictionaries and sets within python expressions in the formula; for example: C(x, {"a": [1,2,3]}).
  • Bumped requirement on astor to >=0.8 to fix issues with ast-generation in Python 3.8+ when numerical constants are present in the parsed python expression (e.g. "bs(x, df=10)").

2022-04-04 03:06:52

This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for conda.

Matthew Wardrop

Developer of data/scientific tools that make routine programmatic tasks transparent.

GitHub Repository