Formulaic is a high-performance implementation of Wilkinson formulas for Python.

**Note:** This project, while largely complete, is still a work in progress, and the API is subject to change between major versions (0.<major>.<minor>).

**Documentation**: https://matthewwardrop.github.io/formulaic**Source Code**: https://github.com/matthewwardrop/formulaic**Issue tracker**: https://github.com/matthewwardrop/formulaic/issues

It provides:

- high-performance dataframe to model-matrix conversions.
- support for reusing the encoding choices made during conversion of one data-set on other datasets.
- extensible formula parsing.
- extensible data input/output plugins, with implementations for:
- input:
`pandas.DataFrame`

`pyarrow.Table`

- output:
`pandas.DataFrame`

`numpy.ndarray`

`scipy.sparse.CSCMatrix`

- support for symbolic differentiation of formulas (and hence model matrices).

``` import pandas from formulaic import Formula

df = pandas.DataFrame({ 'y': [0,1,2], 'x': ['A', 'B', 'C'], 'z': [0.3, 0.1, 0.2], })

y, X = Formula('y ~ x + z').get_model_matrix(df) ```

`y =`

y | |
---|---|

0 | 0 |

1 | 1 |

2 | 2 |

`X =`

Intercept | x[T.B] | x[T.C] | z | |
---|---|---|---|---|

0 | 1.0 | 0 | 0 | 0.3 |

1 | 1.0 | 1 | 0 | 0.1 |

2 | 1.0 | 0 | 1 | 0.2 |

Formulaic typically outperforms R for both dense and sparse model matrices, and vastly outperforms `patsy`

(the existing implementation for Python) for dense matrices (`patsy`

does not support sparse model matrix output).

For more details, see here.

- Patsy: a prior implementation of Wilkinson formulas for Python, which is widely used (e.g. in statsmodels). It has fantastic documentation (which helped bootstrap this project), and a rich array of features.
- StatsModels.jl
`@formula`

: The implementation of Wilkinson formulas for Julia. - R Formulas: The implementation of Wilkinson formulas for R, which is thoroughly introduced here. [R itself is an implementation of S, in which formulas were first made popular].
- The work that started it all: Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392–399, 1973.

The documentation's introduction says Formulaic provides "support for reusing the encoding choices made during conversion of one data-set on other datasets." How?

I am creating factors manually and setting the `eval_method`

for categorical factors as `lookup`

and for numerical factors as `python`

as I am using `np.isnan`

and `np.power`

for transformations. I then create term lists and feed them as the first argument in `formulaic.Formula()`

.

After creating the formula, the columns using np transformations show 0 for every observation. If I use a string formula instead of a list of Terms, there doesn't seem to be any issue in evaluating the expression.

Hi Matthew,

Yet another question: Are multiple variables on the left hand side supported? It seems like it, but I wanted to be sure before I build on it:

``` from formulaic import model_matrix import numpy as np import pandas as pd

N = 10 X = np.random.normal(0, 1, N) Y = np.random.normal(0, 1, N) Y1 = np.random.normal(0, 1, N) Y2 = np.random.normal(0, 1, N)

data = {'Y1':Y1, 'Y2':Y2, 'X':X} data = pd.DataFrame(data)

fml = 'Y1 + Y2 ~ X' y, X = model_matrix(fml, data, na_action = 'ignore')

y.head() Y1 Y2 0 -0.949247 2.510162 1 -0.695311 -0.727821 2 -0.072459 -1.172503 3 0.422692 -1.223921 4 0.224729 0.585468 ```

Best, Alex

Hi Matthew - thanks for making this super useful package available!

Currently, when there is a missing value in a column of the design matrix X, but not in the dependent variable Y, `model_matrix()`

drops observations columnwise, but not for both X and Y.

Here's a quick example:

``` from formulaic import model_matrix import numpy as np import pandas as pd

N = 10 X = np.random.normal(0, 1, N) Y = np.random.normal(0, 1, N) data = {'Y':Y, 'X':X} data = pd.DataFrame(data)

data.X.iloc[0] = None

fml = 'Y ~ X' y, X = model_matrix(fml, data, na_action = 'ignore')

y.head() Y 0 -0.174508 1 0.373280 2 1.631371 3 -0.622598 4 -0.482028 X.head() Intercept X 1 1.0 2.652463 2 1.0 -1.356067 3 1.0 1.143417 4 1.0 -1.020435 5 1.0 0.072263

y, X = model_matrix(fml, data)

y.shape (10, 1) X.shape (9, 2) ``` The NaN column is dropped from X, but not from Y.

I think it would be nice to add functionality for this (though it might already exist?).

E.g. in R, before calling `base::model.matrix()`

, one would define a `base::model.frame()`

, which would then by default drop the entire column where a missing value exists (for both X and Y).

``` N <- 10 X <- rnorm(N) Y <- rnorm(N) X[1] <- NA

data <- data.frame(Y = Y, X = X) mf <- model.frame(Y ~ X, data)

mm <- model.matrix(mf)

depvar <- model.response(mf)

```

Motivated by #115 , this is a quick draft demonstrating a `Q`

transform that replicates the behavior of the patsy `Q`

function.

Notes:
- This syntax is not as nice as `....`

, but may be valuable when users are migrating over to formulaic from patsy.
- Perhaps this should be gated by a patsy compatibility flag.
- Perhaps `_data`

and `_context`

should be passed in separately, to avoid name collisions / weird data types.
- Perhaps it isn't worth adding at all? And just have people migrate over to `...`

syntax, or furnish formulaic with their own `Q`

implementation (only merge the context bits).

The formula grammar for patsy and formulaic are close but not 1:1. It would be great if terms like `Q('...')`

could automatically be converted into ``...``

, etc. This would help ease the adoption curve to migrate from patsy to formulaic using existing formulae.

This is a minor patch releases that fixes one bug.

**Bugfixes and cleanups:**

- Fixed alignment between the length of a
`Structured`

instance and iteration over this instance (including`Formula`

instances). Formerly the length would only count the number of keys in its structure, rather than the number of objects that would be yielded during iteration.

This is a minor patch release that fixes two bugs.

**Bugfixes and cleanups:**

- Fixed generation of string representation of
`Formula`

objects. - Fixed generation of
`formulaic.__version__`

during package build.

This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.

**Breaking changes:**

- Accessing named substructures of
`Formula`

objects (e.g.`formula.lhs`

) no longer returns a list of terms; but rather a`Formula`

object, so that the helper methods can remain accessible. You can access the raw terms by iterating over the formula (`list(formula)`

) or looking up the root node (`formula.root`

).

**New features and improvements:**

- The
`ModelSpec`

object is now the source of truth in all`ModelMatrix`

generations, and can be constructed directly from any supported specification using`ModelSpec.from_spec(...)`

. Supported specifications include formula strings, parsed formulae, model matrices and prior model specs. - The
`.get_model_matrix()`

helper methods across`Formula`

,`FormulaMaterializer`

,`ModelSpec`

and`model_matrix`

objects/helpers functions are now consistent, and all use`ModelSpec`

directly under the hood. - When accessing substructures of
`Formula`

objects (e.g.`formula.lhs`

), the term lists will be wrapped as trivial`Formula`

instances rather than returned as raw lists (so that the helper methods like`.get_model_matrix()`

can still be used). `FormulaSpec`

is now exported from the top-level module.

**Bugfixes and cleanups:**

- Fixed
`ModelSpec`

specifications being overriden by default arguments to`FormulaMaterializer.get_model_matrix`

. `Structured._flatten()`

now correctly flattens unnamed substructures.

This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.

**Breaking changes:**

`include_intercept`

is no longer an argument to`FormulaParser.get_terms`

; and is instead an argument of the`DefaultFormulaParser`

constructor. If you want to modify the`include_intercept`

behaviour, please use:`python Formula("y ~ x", _parser=DefaultFormulaParser(include_intercept=False))`

- Accessing terms via
`Formula.terms`

is deprecated since`Formula`

became a subclass of`Structured[List[Terms]]`

. You can directly iterate over, and/or access nested structure on the`Formula`

instance itself.`Formula.terms`

has a deprecated property which will return a reference to itself in order to support legacy use-cases. This will be removed in 1.0.0. `ModelSpec.feature_names`

and`ModelSpec.feature_columns`

are deprecated in favour of`ModelSpec.column_names`

and`ModelSpec.column_indices`

. Deprecated properties remain in-place to support legacy use-cases. These will be removed in 1.0.0.

**New features and enhancements:**

- Structured formulae (and their derived matrices and specs) are now mutable.
Internally
`Formula`

has been refactored as a subclass of`Structured[List[Terms]]`

, and can be incrementally built and modified. The matrix and spec outputs now have explicit subclasses of`Structured`

(`ModelMatrices`

and`ModelSpecs`

respectively) to expose convenience methods that allow these objects to be largely used interchangeably with their singular counterparts. `ModelMatrices`

and`ModelSpecs`

arenow surfaced as top-level exports of the`formulaic`

module.`Structured`

(and its subclasses) gained improved integration of nested tuple structure, as well as support for flattened iteration, explicit mapping output types, and lots of cleanups.`ModelSpec`

was made into a dataclass, and gained several new properties/methods to support better introspection and mutation of the model spec.`FormulaParser`

was renamed`DefaultFormulaParser`

, and made a subclass of the new formula parser interface`FormulaParser`

. In this process`include_intercept`

was removed from the API, and made an instance attribute of the default parser implementation.

**Bugfixes and cleanups:**

- Fixed AST evaluation for large formulae that caused the evaluation to hit the recursion limit.
- Fixed sparse categorical encoding when the dataframe index is not the standard range index.
- Fixed a bug in the linear constraints parser when more than two constraints were specified in a comma-separated string.
- Avoid implicit changing of the sparsity structure of CSC matrices.
- If manually constructed
`ModelSpec`

s are provided by the user during materialization, they are updated to reflect the output-type chosen by the user, as well as whether to ensure full rank/etc. - Allowed use of older pandas versions. All versions >=1.0.0 are now supported.
- Various linting cleanups as
`pylint`

was added to the CI testing.

**Documentation:**

- Apart from the
`.materializer`

submodule, most code now has inline documentation and annotations.

This is a backward compatible major release that adds several new features.

**New features and enhancements:**

- Added support for customizing the contrasts generated for categorical features, including treatment, sum, deviation, helmert and custom contrasts.
- Added support for the generation of linear constraints for
`ModelMatrix`

instances (see`ModelMatrix.model_spec.get_linear_constraints`

). - Added support for passing
`ModelMatrix`

,`ModelSpec`

and other formula-like objects to the`model_matrix`

sugar method so that pre-processed formulae can be used. - Improved the way tokens are manipulated for the right-hand-side intercept and
substitutions of
`0`

with`-1`

to avoid substitutions in quoted contexts.

**Bugfixes and cleanups:**

- Fixed variable sanitization during evaluation, allowing variables with
special characters to be used in Python transforms; for example:
`bs(`my|feature%is^cool`)`

. - Fixed the parsing of dictionaries and sets within python expressions in the
formula; for example:
`C(x, {"a": [1,2,3]})`

. - Bumped requirement on
`astor`

to >=0.8 to fix issues with ast-generation in Python 3.8+ when numerical constants are present in the parsed python expression (e.g. "bs(x, df=10)").

This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for conda.

Developer of data/scientific tools that make routine programmatic tasks transparent.

GitHub Repository