whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.
It aims to get you to the point where you can perform automated data munging prior to committing your data into a database, and no further. It is built on Pandas, and plays well with existing Python-based data-analytical tools. Each raw source file will produce a json schema and method file which defines the set of actions to be performed to produce refined data, and a destination file validated against that schema.
whyqd ensures complete audit transparency by saving all actions performed to restructure your input data to a separate json-defined methods file. This permits others to read and scrutinise your approach, validate your methodology, or even use your methods to import data in production.
Once complete, a method file can be shared, along with your input data, and anyone can import whyqd and validate your method to verify that your output data is the product of these inputs.
Read the docs and there are two worked tutorials to demonstrate
how you can use whyqd
to support source data curation transparency:
If all you want to do is test whether your source data are even useful, spending days or weeks slogging through data restructuring could kill a project. If you already have a workflow and established software which includes Python and pandas, having to change your code every time your source data changes is really, really frustrating.
If you want to go from a Cthulhu dataset like this:
To this:
| | country_name | indicator_name | reference | year | values | | --: | :--------------------- | :------------- | :-------- | ---: | -----: | | 0 | Hong Kong, China (SAR) | HDI rank | e | 2008 | 21 | | 1 | Singapore | HDI rank | nan | 2008 | 25 | | 2 | Korea (Republic of) | HDI rank | nan | 2008 | 26 | | 3 | Cyprus | HDI rank | nan | 2008 | 28 | | 4 | Brunei Darussalam | HDI rank | nan | 2008 | 30 | | 5 | Barbados | HDI rank | e,g, f | 2008 | 31 |
With a readable set of scripts to ensure that your process can be audited and repeated:
scripts = [
"DEBLANK",
"DEDUPE",
"REBASE < [11]",
f"DELETE_ROWS < {[int(i) for i in np.arange(144, df.index[-1]+1)]}",
"RENAME_ALL > ['HDI rank', 'Country', 'Human poverty index (HPI-1) - Rank;;2008', 'Reference 1', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Reference 2', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Reference 3', 'Population not using an improved water source (%);;2004', 'Reference 4', 'Children under weight for age (% under age 5);;1996-2005', 'Reference 5', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Reference 6', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Reference 7', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'Reference 8', 'HPI-1 rank minus income poverty rank;;2008']",
"PIVOT_CATEGORIES > ['HDI rank'] < [14,44,120]",
"RENAME_NEW > 'HDI Category'::['PIVOT_CATEGORIES_idx_20_0']",
"PIVOT_LONGER > = ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
"SPLIT > ';;'::['PIVOT_LONGER_names_idx_9']",
f"JOIN > 'reference' < {reference_columns}",
"RENAME > 'indicator_name' < ['SPLIT_idx_11_0']",
"RENAME > 'country_name' < ['Country']",
"RENAME > 'year' < ['SPLIT_idx_12_1']",
"RENAME > 'values' < ['PIVOT_LONGER_values_idx_10']",
]
There are two complex and time-consuming parts to preparing data for analysis: social, and technical.
The social part requires multi-stakeholder engagement with source data-publishers, and with destination database users, to agree structural metadata. Without any agreement on data publication formats or destination structure, you are left with the tedious frustration of manually wrangling each independent dataset into a single schema.
whyqd allows you to get to work without requiring you to achieve buy-in from anyone or change your existing code.
You'll need at least Python 3.7, then:
pip install whyqd
Code requirements have been tested on the following versions:
Version 0.5.0 introduced a new, simplified, API, along with script-based transformation actions. You can import and
transform any saved method.json
files with:
SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE)
schema_scripts = whyqd.parsers.LegacyScript().parse_legacy_method(
version="1", schema=SCHEMA, source_path=METHOD_SOURCE_V1
)
Where SCHEMA_SOURCE is a path to your schema. Existing schema.json
files should still work.
The version history can be found in the changelog.
whyqd was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. You can see the product of that at openLocal.uk.
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from EOSC Future through the RDA Open Call mechanism, based on evaluations of external, independent experts.
The 'backronym' for whyqd /wɪkɪd/
is Whythawk Quantitative Data, Whythawk is an open data science and open research technical consultancy.
Bumps certifi from 2021.10.8 to 2022.12.7.
9e9e840
2022.12.07b81bdb2
2022.09.24939a28f
2022.09.14aca828a
2022.06.15.2de0eae1
Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...b8eb5e9
2022.06.15.147fb7ab
Fix deprecation warning on Python 3.11 (#199)b0b48e0
fixes #198 -- update link in license9d514b4
2022.06.154151e88
Add py.typed to MANIFEST.in to package in sdist (#196)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps nbconvert from 6.4.0 to 6.5.1.
Sourced from nbconvert's releases.
Release 6.5.1
No release notes provided.
6.5.0
What's Changed
- Drop dependency on testpath. by
@anntzer
in jupyter/nbconvert#1723- Adopt pre-commit by
@blink1073
in jupyter/nbconvert#1744- Add pytest settings and handle warnings by
@blink1073
in jupyter/nbconvert#1745- Apply Autoformatters by
@blink1073
in jupyter/nbconvert#1746- Add git-blame-ignore-revs by
@blink1073
in jupyter/nbconvert#1748- Update flake8 config by
@blink1073
in jupyter/nbconvert#1749- support bleach 5, add packaging and tinycss2 dependencies by
@bollwyvl
in jupyter/nbconvert#1755- [pre-commit.ci] pre-commit autoupdate by
@pre-commit-ci
in jupyter/nbconvert#1752- update cli example by
@leahecole
in jupyter/nbconvert#1753- Clean up pre-commit by
@blink1073
in jupyter/nbconvert#1757- Clean up workflows by
@blink1073
in jupyter/nbconvert#1750New Contributors
@pre-commit-ci
made their first contribution in jupyter/nbconvert#1752Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.5...6.5
6.4.3
What's Changed
- Add section to
customizing
showing how to use template inheritance by@stefanv
in jupyter/nbconvert#1719- Remove ipython genutils by
@rgs258
in jupyter/nbconvert#1727- Update changelog for 6.4.3 by
@blink1073
in jupyter/nbconvert#1728New Contributors
@stefanv
made their first contribution in jupyter/nbconvert#1719@rgs258
made their first contribution in jupyter/nbconvert#1727Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.2...6.4.3
7471b75
Release 6.5.1c1943e0
Fix pre-commit8685e93
Fix tests0abf290
Run black and prettier418d545
Run test on 6.x branchbef65d7
Convert input to string prior to escape HTML0818628
Check input type before escapingb206470
GHSL-2021-1017, GHSL-2021-1020, GHSL-2021-1021a03cbb8
GHSL-2021-1026, GHSL-2021-102548fe71e
GHSL-2021-1024Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps mistune from 0.8.4 to 2.0.3.
Sourced from mistune's releases.
Version 2.0.2
Fix
escape_url
via lepture/mistune#295Version 2.0.1
Fix XSS for image link syntax.
Version 2.0.0
First release of Mistune v2.
Version 2.0.0 RC1
In this release, we have a Security Fix for harmful links.
Version 2.0.0 Alpha 1
This is the first release of v2. An alpha version for users to have a preview of the new mistune.
Sourced from mistune's changelog.
Changelog
Here is the full history of mistune v2.
Version 2.0.4
Released on Jul 15, 2022
- Fix
url
plugin in<a>
tag- Fix
*
formattingVersion 2.0.3
Released on Jun 27, 2022
- Fix
table
plugin- Security fix for CVE-2022-34749
Version 2.0.2
Released on Jan 14, 2022
Fix
escape_url
Version 2.0.1
Released on Dec 30, 2021
XSS fix for image link syntax.
Version 2.0.0
Released on Dec 5, 2021
This is the first non-alpha release of mistune v2.
Version 2.0.0rc1
Released on Feb 16, 2021
Version 2.0.0a6
</tr></table>
... (truncated)
3f422f1
Version bump 2.0.3a6d4321
Fix asteris emphasis regex CVE-2022-347495638e46
Merge pull request #307 from jieter/patch-10eba471
Fix typo in guide.rst61e9337
Fix table plugin76dec68
Add documentation for renderer heading when TOC enabled799cd11
Version bump 2.0.2babb0cf
Merge pull request #295 from dairiki/bug.escape_urlfc2cd53
Make mistune.util.escape_url less aggressive3e8d352
Version bump 2.0.1Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps notebook from 6.4.7 to 6.4.12.
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps jupyter-server from 1.13.2 to 1.15.4.
Sourced from jupyter-server's releases.
v1.15.3
1.15.3
Bugs fixed
- Fix server-extension paths (3rd time's the charm) #734 (
@minrk
)- Revert "Server extension paths (#730)" #732 (
@blink1073
)Maintenance and upkeep improvements
- Avoid usage of ipython_genutils #718 (
@blink1073
)Contributors to this release
(GitHub contributors page for this release)
@blink1073
|@codecov-commenter
|@minrk
v1.15.2
1.15.2
Bugs fixed
- Server extension paths #730 (
@minrk
)- allow handlers to work without an authorizer in the Tornado settings #717 (
@Zsailer
)Maintenance and upkeep improvements
- Skip nbclassic downstream tests for now #725 (
@blink1073
)Contributors to this release
(GitHub contributors page for this release)
@blink1073
|@minrk
|@Zsailer
v1.15.1
1.15.1
... (truncated)
Sourced from jupyter-server's changelog.
Changelog
All notable changes to this project will be documented in this file.
1.15.6
Bugs fixed
Maintenance and upkeep improvements
- More CI Cleanup #742 (
@blink1073
)- Clean up downstream tests #741 (
@blink1073
)Contributors to this release
(GitHub contributors page for this release)
@blink1073
|@codecov-commenter
|@Zsailer
1.15.5
Bugs fixed
- Relax type checking on ExtensionApp.serverapp #739 (
@minrk
)- raise no-authorization warning once and allow disabled authorization #738 (
@Zsailer
)Maintenance and upkeep improvements
- Fix sdist test #736 (
@blink1073
)Contributors to this release
(GitHub contributors page for this release)
@blink1073
|@codecov-commenter
|@minrk
|@Zsailer
1.15.3
... (truncated)
427ce75
Bump to 1.15.4a5683ac
Merge pull request from GHSA-p737-p57g-4cpre4a3141
Publish 1.15.3a4542cf
Automated Changelog Entry for 1.15.3 on main (#735)461b551
Fix server-extension paths (3rd time's the charm) (#734)93b1c83
Revert "Server extension paths (#730)" (#732)0fd5c7b
Avoid usage of ipython_genutils (#718)f4d131b
Publish 1.15.25b83bd7
Automated Changelog Entry for 1.15.2 on main (#731)9711822
Server extension paths (#730)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps ipython from 7.31.0 to 7.31.1.
e321e76
release 7.31.167ca2b3
Merge pull request from GHSA-pq7m-3gw7-gq5x2794330
back to devDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.
Read the docs and there are two worked tutorials to demonstrate
how you can use whyqd
to support source data curation transparency:
Install using pip
:
pip install whyqd
Version 0.5.0 introduced a new, simplified, API, along with script-based transformation actions. You can import and
transform any saved method.json
files with:
SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE)
schema_scripts = whyqd.parsers.LegacyScript().parse_legacy_method(
version="1", schema=SCHEMA, source_path=METHOD_SOURCE_V1
)
Where SCHEMA_SOURCE is a path to your schema. Existing schema.json
files should still work.
whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.
Read the docs and a full tutorial.
Install using pip
:
pip install whyqd
data-science data-analysis data-wrangling munging data-management open-data open-science python pandas