Formulaic is a high-performance implementation of Wilkinson formulas for Python.
Note: This project, while largely complete, is still a work in progress, and the API is subject to change between major versions (0.<major>.<minor>).
It provides:
pandas.DataFrame
pyarrow.Table
pandas.DataFrame
numpy.ndarray
scipy.sparse.CSCMatrix
``` import pandas from formulaic import Formula
df = pandas.DataFrame({ 'y': [0,1,2], 'x': ['A', 'B', 'C'], 'z': [0.3, 0.1, 0.2], })
y, X = Formula('y ~ x + z').get_model_matrix(df) ```
y =
y | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
X =
Intercept | x[T.B] | x[T.C] | z | |
---|---|---|---|---|
0 | 1.0 | 0 | 0 | 0.3 |
1 | 1.0 | 1 | 0 | 0.1 |
2 | 1.0 | 0 | 1 | 0.2 |
Formulaic typically outperforms R for both dense and sparse model matrices, and vastly outperforms patsy
(the existing implementation for Python) for dense matrices (patsy
does not support sparse model matrix output).
For more details, see here.
@formula
: The implementation of Wilkinson formulas for Julia.The documentation's introduction says Formulaic provides "support for reusing the encoding choices made during conversion of one data-set on other datasets." How?
I am creating factors manually and setting the eval_method
for categorical factors as lookup
and for numerical factors as python
as I am using np.isnan
and np.power
for transformations. I then create term lists and feed them as the first argument in formulaic.Formula()
.
After creating the formula, the columns using np transformations show 0 for every observation. If I use a string formula instead of a list of Terms, there doesn't seem to be any issue in evaluating the expression.
Hi Matthew,
Yet another question: Are multiple variables on the left hand side supported? It seems like it, but I wanted to be sure before I build on it:
``` from formulaic import model_matrix import numpy as np import pandas as pd
N = 10 X = np.random.normal(0, 1, N) Y = np.random.normal(0, 1, N) Y1 = np.random.normal(0, 1, N) Y2 = np.random.normal(0, 1, N)
data = {'Y1':Y1, 'Y2':Y2, 'X':X} data = pd.DataFrame(data)
fml = 'Y1 + Y2 ~ X' y, X = model_matrix(fml, data, na_action = 'ignore')
y.head() Y1 Y2 0 -0.949247 2.510162 1 -0.695311 -0.727821 2 -0.072459 -1.172503 3 0.422692 -1.223921 4 0.224729 0.585468 ```
Best, Alex
Hi Matthew - thanks for making this super useful package available!
Currently, when there is a missing value in a column of the design matrix X, but not in the dependent variable Y, model_matrix()
drops observations columnwise, but not for both X and Y.
Here's a quick example:
``` from formulaic import model_matrix import numpy as np import pandas as pd
N = 10 X = np.random.normal(0, 1, N) Y = np.random.normal(0, 1, N) data = {'Y':Y, 'X':X} data = pd.DataFrame(data)
data.X.iloc[0] = None
fml = 'Y ~ X' y, X = model_matrix(fml, data, na_action = 'ignore')
y.head() Y 0 -0.174508 1 0.373280 2 1.631371 3 -0.622598 4 -0.482028 X.head() Intercept X 1 1.0 2.652463 2 1.0 -1.356067 3 1.0 1.143417 4 1.0 -1.020435 5 1.0 0.072263
y, X = model_matrix(fml, data)
y.shape (10, 1) X.shape (9, 2) ``` The NaN column is dropped from X, but not from Y.
I think it would be nice to add functionality for this (though it might already exist?).
E.g. in R, before calling base::model.matrix()
, one would define a base::model.frame()
, which would then by default drop the entire column where a missing value exists (for both X and Y).
``` N <- 10 X <- rnorm(N) Y <- rnorm(N) X[1] <- NA
data <- data.frame(Y = Y, X = X) mf <- model.frame(Y ~ X, data)
mm <- model.matrix(mf)
depvar <- model.response(mf)
```
Motivated by #115 , this is a quick draft demonstrating a Q
transform that replicates the behavior of the patsy Q
function.
Notes:
- This syntax is not as nice as ....
, but may be valuable when users are migrating over to formulaic from patsy.
- Perhaps this should be gated by a patsy compatibility flag.
- Perhaps _data
and _context
should be passed in separately, to avoid name collisions / weird data types.
- Perhaps it isn't worth adding at all? And just have people migrate over to ...
syntax, or furnish formulaic with their own Q
implementation (only merge the context bits).
The formula grammar for patsy and formulaic are close but not 1:1. It would be great if terms like Q('...')
could automatically be converted into `...`
, etc. This would help ease the adoption curve to migrate from patsy to formulaic using existing formulae.
This is a minor patch releases that fixes one bug.
Bugfixes and cleanups:
Structured
instance and iteration
over this instance (including Formula
instances). Formerly the length would
only count the number of keys in its structure, rather than the number of
objects that would be yielded during iteration.This is a minor patch release that fixes two bugs.
Bugfixes and cleanups:
Formula
objects.formulaic.__version__
during package build.This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.
Breaking changes:
Formula
objects (e.g. formula.lhs
) no
longer returns a list of terms; but rather a Formula
object, so that the
helper methods can remain accessible. You can access the raw terms by
iterating over the formula (list(formula)
) or looking up the root node
(formula.root
).New features and improvements:
ModelSpec
object is now the source of truth in all ModelMatrix
generations, and can be constructed directly from any supported specification
using ModelSpec.from_spec(...)
. Supported specifications include formula
strings, parsed formulae, model matrices and prior model specs..get_model_matrix()
helper methods across Formula
,
FormulaMaterializer
, ModelSpec
and model_matrix
objects/helpers
functions are now consistent, and all use ModelSpec
directly under the hood.Formula
objects (e.g. formula.lhs
), the
term lists will be wrapped as trivial Formula
instances rather than returned
as raw lists (so that the helper methods like .get_model_matrix()
can still
be used).FormulaSpec
is now exported from the top-level module.Bugfixes and cleanups:
ModelSpec
specifications being overriden by default arguments to
FormulaMaterializer.get_model_matrix
.Structured._flatten()
now correctly flattens unnamed substructures.This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.
Breaking changes:
include_intercept
is no longer an argument to FormulaParser.get_terms
;
and is instead an argument of the DefaultFormulaParser
constructor. If you
want to modify the include_intercept
behaviour, please use:
python
Formula("y ~ x", _parser=DefaultFormulaParser(include_intercept=False))
Formula.terms
is deprecated since Formula
became a
subclass of Structured[List[Terms]]
. You can directly iterate over, and/or
access nested structure on the Formula
instance itself. Formula.terms
has a deprecated property which will return a reference to itself in order to
support legacy use-cases. This will be removed in 1.0.0.ModelSpec.feature_names
and ModelSpec.feature_columns
are deprecated in
favour of ModelSpec.column_names
and ModelSpec.column_indices
. Deprecated
properties remain in-place to support legacy use-cases. These will be removed
in 1.0.0.New features and enhancements:
Formula
has been refactored as a subclass of
Structured[List[Terms]]
, and can be incrementally built and modified. The
matrix and spec outputs now have explicit subclasses of Structured
(ModelMatrices
and ModelSpecs
respectively) to expose convenience methods
that allow these objects to be largely used interchangeably with their
singular counterparts.ModelMatrices
and ModelSpecs
arenow surfaced as top-level exports of the
formulaic
module.Structured
(and its subclasses) gained improved integration of nested tuple
structure, as well as support for flattened iteration, explicit mapping
output types, and lots of cleanups.ModelSpec
was made into a dataclass, and gained several new
properties/methods to support better introspection and mutation of the model
spec.FormulaParser
was renamed DefaultFormulaParser
, and made a subclass of the
new formula parser interface FormulaParser
. In this process
include_intercept
was removed from the API, and made an instance attribute
of the default parser implementation.Bugfixes and cleanups:
ModelSpec
s are provided by the user during
materialization, they are updated to reflect the output-type chosen by the
user, as well as whether to ensure full rank/etc.pylint
was added to the CI testing.Documentation:
.materializer
submodule, most code now has inline
documentation and annotations.This is a backward compatible major release that adds several new features.
New features and enhancements:
ModelMatrix
instances (see ModelMatrix.model_spec.get_linear_constraints
).ModelMatrix
, ModelSpec
and other formula-like
objects to the model_matrix
sugar method so that pre-processed formulae can
be used.0
with -1
to avoid substitutions in quoted contexts.Bugfixes and cleanups:
bs(`my|feature%is^cool`)
.C(x, {"a": [1,2,3]})
.astor
to >=0.8 to fix issues with ast-generation in
Python 3.8+ when numerical constants are present in the parsed python
expression (e.g. "bs(x, df=10)").This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for conda.
Developer of data/scientific tools that make routine programmatic tasks transparent.
GitHub Repository