data wrangling simplicity, complete audit transparency, and at speed

whythawk, updated 🕥 2023-03-03 07:25:55

whyqd: simplicity, transparency, speed

Documentation Status Build Status DOI

What is it?

whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.

It aims to get you to the point where you can perform automated data munging prior to committing your data into a database, and no further. It is built on Pandas, and plays well with existing Python-based data-analytical tools. Each raw source file will produce a json schema and method file which defines the set of actions to be performed to produce refined data, and a destination file validated against that schema.

whyqd ensures complete audit transparency by saving all actions performed to restructure your input data to a separate json-defined methods file. This permits others to read and scrutinise your approach, validate your methodology, or even use your methods to import data in production.

Once complete, a method file can be shared, along with your input data, and anyone can import whyqd and validate your method to verify that your output data is the product of these inputs.

Read the docs and there are two worked tutorials to demonstrate how you can use whyqd to support source data curation transparency:

Why use it?

If all you want to do is test whether your source data are even useful, spending days or weeks slogging through data restructuring could kill a project. If you already have a workflow and established software which includes Python and pandas, having to change your code every time your source data changes is really, really frustrating.

If you want to go from a Cthulhu dataset like this:

UNDP Human Development Index 2007-2008: a beautiful example of messy data.

To this:

| | country_name | indicator_name | reference | year | values | | --: | :--------------------- | :------------- | :-------- | ---: | -----: | | 0 | Hong Kong, China (SAR) | HDI rank | e | 2008 | 21 | | 1 | Singapore | HDI rank | nan | 2008 | 25 | | 2 | Korea (Republic of) | HDI rank | nan | 2008 | 26 | | 3 | Cyprus | HDI rank | nan | 2008 | 28 | | 4 | Brunei Darussalam | HDI rank | nan | 2008 | 30 | | 5 | Barbados | HDI rank | e,g, f | 2008 | 31 |

With a readable set of scripts to ensure that your process can be audited and repeated:

scripts = [ "DEBLANK", "DEDUPE", "REBASE < [11]", f"DELETE_ROWS < {[int(i) for i in np.arange(144, df.index[-1]+1)]}", "RENAME_ALL > ['HDI rank', 'Country', 'Human poverty index (HPI-1) - Rank;;2008', 'Reference 1', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Reference 2', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Reference 3', 'Population not using an improved water source (%);;2004', 'Reference 4', 'Children under weight for age (% under age 5);;1996-2005', 'Reference 5', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Reference 6', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Reference 7', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'Reference 8', 'HPI-1 rank minus income poverty rank;;2008']", "PIVOT_CATEGORIES > ['HDI rank'] < [14,44,120]", "RENAME_NEW > 'HDI Category'::['PIVOT_CATEGORIES_idx_20_0']", "PIVOT_LONGER > = ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']", "SPLIT > ';;'::['PIVOT_LONGER_names_idx_9']", f"JOIN > 'reference' < {reference_columns}", "RENAME > 'indicator_name' < ['SPLIT_idx_11_0']", "RENAME > 'country_name' < ['Country']", "RENAME > 'year' < ['SPLIT_idx_12_1']", "RENAME > 'values' < ['PIVOT_LONGER_values_idx_10']", ]

There are two complex and time-consuming parts to preparing data for analysis: social, and technical.

The social part requires multi-stakeholder engagement with source data-publishers, and with destination database users, to agree structural metadata. Without any agreement on data publication formats or destination structure, you are left with the tedious frustration of manually wrangling each independent dataset into a single schema.

whyqd allows you to get to work without requiring you to achieve buy-in from anyone or change your existing code.

Wrangling process

  • Create, update or import a data schema which defines the destination data structure,
  • Create a new method and associate it with your schema and input data source/s,
  • Assign a foreign key column and (if required) merge input data sources,
  • Structure input data fields to conform to the requriements for each schema field,
  • Assign categorical data identified during structuring,
  • Transform and filter input data to produce a final destination data file,
  • Share your data and a citation.

Installation and dependencies

You'll need at least Python 3.7, then:

pip install whyqd

Code requirements have been tested on the following versions:

  • numpy>=1.18.1
  • openpyxl>=3.0.3
  • pandas>=1.0.0
  • tabulate>=0.8.3
  • xlrd>=1.2.0

Version 0.5.0 introduced a new, simplified, API, along with script-based transformation actions. You can import and transform any saved method.json files with:

SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE) schema_scripts = whyqd.parsers.LegacyScript().parse_legacy_method( version="1", schema=SCHEMA, source_path=METHOD_SOURCE_V1 )

Where SCHEMA_SOURCE is a path to your schema. Existing schema.json files should still work.

Changelog

The version history can be found in the changelog.

Background and funding

whyqd was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. You can see the product of that at openLocal.uk.

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from EOSC Future through the RDA Open Call mechanism, based on evaluations of external, independent experts.

The 'backronym' for whyqd /wɪkɪd/ is Whythawk Quantitative Data, Whythawk is an open data science and open research technical consultancy.

Licence

BSD 3

Issues

Bump certifi from 2021.10.8 to 2022.12.7

opened on 2022-12-08 13:26:14 by dependabot[bot]

Bumps certifi from 2021.10.8 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/whythawk/whyqd/network/alerts).

Bump nbconvert from 6.4.0 to 6.5.1

opened on 2022-08-23 18:50:42 by dependabot[bot]

Bumps nbconvert from 6.4.0 to 6.5.1.

Release notes

Sourced from nbconvert's releases.

Release 6.5.1

No release notes provided.

6.5.0

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.5...6.5

6.4.3

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.2...6.4.3

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/whythawk/whyqd/network/alerts).

Bump mistune from 0.8.4 to 2.0.3

opened on 2022-07-29 23:42:26 by dependabot[bot]

Bumps mistune from 0.8.4 to 2.0.3.

Release notes

Sourced from mistune's releases.

Version 2.0.2

Fix escape_url via lepture/mistune#295

Version 2.0.1

Fix XSS for image link syntax.

Version 2.0.0

First release of Mistune v2.

Version 2.0.0 RC1

In this release, we have a Security Fix for harmful links.

Version 2.0.0 Alpha 1

This is the first release of v2. An alpha version for users to have a preview of the new mistune.

Changelog

Sourced from mistune's changelog.

Changelog

Here is the full history of mistune v2.

Version 2.0.4


Released on Jul 15, 2022
  • Fix url plugin in &lt;a&gt; tag
  • Fix * formatting

Version 2.0.3

Released on Jun 27, 2022

  • Fix table plugin
  • Security fix for CVE-2022-34749

Version 2.0.2


Released on Jan 14, 2022

Fix escape_url

Version 2.0.1

Released on Dec 30, 2021

XSS fix for image link syntax.

Version 2.0.0


Released on Dec 5, 2021

This is the first non-alpha release of mistune v2.

Version 2.0.0rc1

Released on Feb 16, 2021

Version 2.0.0a6


</tr></table> 

... (truncated)

Commits
  • 3f422f1 Version bump 2.0.3
  • a6d4321 Fix asteris emphasis regex CVE-2022-34749
  • 5638e46 Merge pull request #307 from jieter/patch-1
  • 0eba471 Fix typo in guide.rst
  • 61e9337 Fix table plugin
  • 76dec68 Add documentation for renderer heading when TOC enabled
  • 799cd11 Version bump 2.0.2
  • babb0cf Merge pull request #295 from dairiki/bug.escape_url
  • fc2cd53 Make mistune.util.escape_url less aggressive
  • 3e8d352 Version bump 2.0.1
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/whythawk/whyqd/network/alerts).

Bump notebook from 6.4.7 to 6.4.12

opened on 2022-06-17 00:09:01 by dependabot[bot]

Bumps notebook from 6.4.7 to 6.4.12.

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/whythawk/whyqd/network/alerts).

Bump jupyter-server from 1.13.2 to 1.15.4

opened on 2022-03-29 22:30:09 by dependabot[bot]

Bumps jupyter-server from 1.13.2 to 1.15.4.

Release notes

Sourced from jupyter-server's releases.

v1.15.3

1.15.3

(Full Changelog)

Bugs fixed

Maintenance and upkeep improvements

Contributors to this release

(GitHub contributors page for this release)

@​blink1073 | @​codecov-commenter | @​minrk

v1.15.2

1.15.2

(Full Changelog)

Bugs fixed

Maintenance and upkeep improvements

Contributors to this release

(GitHub contributors page for this release)

@​blink1073 | @​minrk | @​Zsailer

v1.15.1

1.15.1

(Full Changelog)

... (truncated)

Changelog

Sourced from jupyter-server's changelog.

Changelog

All notable changes to this project will be documented in this file.

1.15.6

(Full Changelog)

Bugs fixed

  • Missing warning when no authorizer in found ZMQ handlers #744 (@​Zsailer)

Maintenance and upkeep improvements

Contributors to this release

(GitHub contributors page for this release)

@​blink1073 | @​codecov-commenter | @​Zsailer

1.15.5

(Full Changelog)

Bugs fixed

  • Relax type checking on ExtensionApp.serverapp #739 (@​minrk)
  • raise no-authorization warning once and allow disabled authorization #738 (@​Zsailer)

Maintenance and upkeep improvements

Contributors to this release

(GitHub contributors page for this release)

@​blink1073 | @​codecov-commenter | @​minrk | @​Zsailer

1.15.3

(Full Changelog)

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/whythawk/whyqd/network/alerts).

Bump ipython from 7.31.0 to 7.31.1

opened on 2022-01-21 20:40:20 by dependabot[bot]

Bumps ipython from 7.31.0 to 7.31.1.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/whythawk/whyqd/network/alerts).

Releases

whyqd: simplicity, transparency, speed 2021-08-23 17:06:04

whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.

Read the docs and there are two worked tutorials to demonstrate how you can use whyqd to support source data curation transparency:

Install using pip:

pip install whyqd

Version 0.5.0 introduced a new, simplified, API, along with script-based transformation actions. You can import and transform any saved method.json files with:

SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE) schema_scripts = whyqd.parsers.LegacyScript().parse_legacy_method( version="1", schema=SCHEMA, source_path=METHOD_SOURCE_V1 )

Where SCHEMA_SOURCE is a path to your schema. Existing schema.json files should still work.

whyqd: simplicity, transparency, speed 2020-05-08 10:06:50

whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.

Read the docs and a full tutorial.

Install using pip:

pip install whyqd
Whythawk

Developing integrated Open Data solutions

GitHub Repository

data-science data-analysis data-wrangling munging data-management open-data open-science python pandas