Python library for AB test analysis.
Spotify Confidence provides convinience wrappers around statsmodel's various functions for computing p-values and confidence intervalls. With Spotify Confidence it's easy to compute several p-values and confidence bounds in one go, e.g. one for each country or for each date. Each function comes in two versions: - one that return a pandas dataframe, - one that returns a Chartify chart.
Spotify Confidence has support calculating p-values and confidence intervals using Z-statistics, Student's T-statistics (or more exactly Welch's T-test), as well as Chi-squared statistics. It also supports a variance reduction technique based on using pre-exposure data to fit a linear model.
There is also a Bayesian alternative in the BetaBinomial class.
``` import spotify_confidence as confidence import pandas as pd
data = pd.DataFrame( {'variation_name': ['treatment1', 'control', 'treatment2', 'treatment3'], 'success': [50, 40, 10, 20], 'total': [100, 100, 50, 60] } )
test = confidence.ZTest( data, numerator_column='success', numerator_sum_squares_column=None, denominator_column='total', categorical_group_columns='variation_name', correction_method='bonferroni')
test.summary() test.difference(level_1='control', level_2='treatment1') test.multiple_difference(level='control', level_as_reference=True)
test.summary_plot().show() test.difference_plot(level_1='control', level_2='treatment1').show() test.multiple_difference_plot(level='control', level_as_reference=True).show() ```
See jupyter notebooks in examples
folder for more complete examples.
Spotify Confidence can be installed via pip:
pip install spotify-confidence
Find the latest release version here
This project adheres to the Open Code of Conduct By participating, you are expected to honor this code.
This PR was automatically created by Snyk using the credentials of a real user.
Severity | Priority Score () | Issue | Upgrade | Breaking Change | Exploit Maturity
:-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:-------------------------
| 531/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 4.2 | Remote Code Execution (RCE)
SNYK-PYTHON-IPYTHON-3318382 | ipython:
7.34.0 -> 8.10.0
| No | Proof of Concept
| 509/1000
Why?* Has a fix available, CVSS 5.9 | Regular Expression Denial of Service (ReDoS)
SNYK-PYTHON-SETUPTOOLS-3180412 | setuptools:
39.0.1 -> 65.5.1
| No | No Known Exploit
(*) Note that the real score may have changed since the PR was raised.
Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.
Check the changes in this PR to ensure they won't cause issues with your project.
Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.
For more information: 🧐 View latest project report
📚 Read more about Snyk's upgrade and patch logic
Learn how to fix vulnerabilities with free interactive lessons:
🦉 Regular Expression Denial of Service (ReDoS)
We are considering using Spotify confidence to report on all the experiments running on our experimentation platform. So, I did some tests by running a sample of our data (see image below) against Ztest class to see if it could be used to meet our needs of running it simultaneously for various experiments and conversion events. And my findings were as follows:
For a single experiment with multiple metrics, the following methods, summary(), difference(), and multiple_difference(), worked correctly.
``` ztest_filtered = confidence.ZTest(pandasDF_filtered, numerator_column='NUMERATOR', numerator_sum_squares_column=None, denominator_column='DENOMINATOR', categorical_group_columns= ['VARIATION_TYPE','CONVERSION_EVENT_NAME'], interval_size=0.95, correction_method='bonferroni', #metric_column = 'CONVERSION_EVENT_NAME', )
ztest_filtered.summary() ztest_filtered.difference(level_1="control", level_2="variation_1", groupby="CONVERSION_EVENT_NAME", absolute=False) ztest_filtered.multiple_difference(level='control', groupby='CONVERSION_EVENT_NAME', level_as_reference=True)
```
Similar results to the previous one, but satisfying to see that it works perfectly for all experiments and events if we do a concatenation between the fields "Experiment_Key~Conversion_Event_Name".
``` ztest_concat = confidence.ZTest(pandasDF_updated, numerator_column='NUMERATOR', numerator_sum_squares_column='NUMERATOR', denominator_column='DENOMINATOR', categorical_group_columns=['VARIATION_TYPE','EXP_n_EVENT'], #ordinal_group_column = , interval_size=0.95, correction_method='bonferroni', #metric_column = 'CONVERSION_EVENT_NAME', #treatment_column , # power - 0.8 (default) )
ztest_concat.summary() ztest_concat.difference(level_1="control", level_2="variation_1", groupby="EXP_n_EVENT", absolute=False) ztest_concat.multiple_difference(level='control', groupby='EXP_n_EVENT', level_as_reference=True) ```
The summary class works even if I change the conversion_event from the categorical group to metric_column. While the methods difference () and multiple_difference() return errors regardless of the combinations, I can try in both the class and the method.
``` Trial 1: metric_column equals conversion_event_name
ztest = confidence.ZTest(pandasDF_updated, numerator_column='NUMERATOR', numerator_sum_squares_column='NUMERATOR', denominator_column='DENOMINATOR', categorical_group_columns=['VARIATION_TYPE','EXPERIMENT_KEY'], #ordinal_group_column = , interval_size=0.95, correction_method='bonferroni', metric_column = 'CONVERSION_EVENT_NAME', #treatment_column , # power - 0.8 (default) )
Trial 2 : metric_column hidden and conversion_event_name moved to categorical_group_columns
ztest = confidence.ZTest(pandasDF_updated, numerator_column='NUMERATOR', numerator_sum_squares_column='NUMERATOR', denominator_column='DENOMINATOR', categorical_group_columns=['VARIATION_TYPE','EXPERIMENT_KEY','CONVERSION_EVENT_NAME'], #ordinal_group_column = , interval_size=0.95, correction_method='bonferroni', #metric_column = 'CONVERSION_EVENT_NAME', #treatment_column , # power - 0.8 (default) )
```
ztest.multiple_difference(level='control', groupby=['EXPERIMENT_KEY','CONVERSION_EVENT_NAME'], level_as_reference=True)
ValueError: cannot handle a non-unique multi-index! (for both trials)
I've been searching inside the repository notebooks, but I couldn't find the place that explains or execute this error message.
So after this test, I wondered:
Thanks, and looking forward to leveraging this package.
I'm a bit confused why the powered_effect
is not calculated in the StudentsTTest
but it's provided in ZTest
.
The above is the data frame which I passed into both
stat_res_df = confidence.ZTest(
stats_df,
numerator_column='conversions',
numerator_sum_squares_column=None,
denominator_column='total',
categorical_group_columns='variant_id',
correction_method='bonferroni')
and
stat_res_df = confidence.StudentsTTest(
stats_df,
numerator_column='conversions',
numerator_sum_squares_column=None,
denominator_column='total',
categorical_group_columns='variant_id',
correction_method='bonferroni')
but when I called stat_res_df.difference(level_1='control', level_2='treatment')
I found the result from z-test provides the powered_effect
column as below
but it's missing from the t-test result. Another question, why is the required_sample_size
missing? Is there a way to also provide the sample size estimation in the result? Thanks!
If you could tell me. I can't fully understand it from the code
PR for adding support for tanking tests. Some things remain.
Fixed bug in sample size calculator check for binary metrics when there are nans
Fixed bug in SampleSizeCalculator.optimal_weights_and_sample_size.
Added check to make point estimates and variances match for binary metrics when using the sample size calculator.
level_as_reference
to default to None
in multiple_difference_plot
to be consistent with multiple_difference
level_as_reference