A pyspark lib to validate data quality

ronald-smith-angel, updated 🕥 2022-11-11 07:51:10

Owl Data Sanitizer: A light Spark data validation framework

license Build Status

This is a small framework for data quality validation. This first version works reading spark dataframes from local datasources like local system, s3 or hive and delivers hive tables with quality reports.

Let's follow this example:

Input data from a hive table:

+----------+--------------+--------+---------+------------------+---------+ |GENERAL_ID| NAME| CODE|ADDR_DESC|ULTIMATE_PARENT_ID|PARENT_ID| +----------+--------------+--------+---------+------------------+---------+ | 1|Dummy 1 Entity|12000123| null| null| null| | 2| null| null| null| 2| 2| | 3| null|12000123| null| 3| 3| | 4| 1| 1| null| 4| 4| | 5| 1|12000123| null| 5| 5| | 6| null| 3| null| 6| 6| | null| null|12000123| null| 11| 7| | 7| 2| null| null| 8| 8| +----------+--------------+--------+---------+------------------+---------+

following this validation config with 4 sections:

  1. source_table including the table metadata.
  2. correctness_validations including correctness validations per column. the rule must be a valid spark SQL expression.
  3. parent_children_constraints including children parent constrains. This means that any parent id should be valid id.
  4. compare_related_tables_list including comparison with other tables or the same table in other environments.

{ "source_table": { "name": "test.data_test", "id_column": "GENERAL_ID", "unique_column_group_values_per_table": ["GENERAL_ID", "ULTIMATE_PARENT_ID"], "fuzzy_deduplication_distance": 0, "output_correctness_table": "test.data_test_correctness", "output_completeness_table": "test.data_test_completeness", "output_comparison_table": "test.data_test_comparison" }, "correctness_validations": [ { "column": "CODE", "rule": "CODE is not null and CODE != '' and CODE != 'null'" }, { "column": "NAME", "rule": "NAME is not null and NAME != '' and NAME != 'null'" }, { "column": "GENERAL_ID", "rule": "GENERAL_ID is not null and GENERAL_ID != '' and GENERAL_ID != 'null' and CHAR_LENGTH(GENERAL_ID) < 4" } ], "completeness_validations": [ { "column": "OVER_ALL_COUNT", "rule": "OVER_ALL_COUNT <= 7" } ], "parent_children_constraints": [ { "column": "GENERAL_ID", "parent": "ULTIMATE_PARENT_ID" }, { "column": "GENERAL_ID", "parent": "PARENT_ID" } ], "compare_related_tables_list": ["test.diff_df", "test.diff_df_2"] }

Therefore, these results are delivered in two output hive tables:

a). Correctness Report.

  • You will see and output col per validation col showing either 1 when there is error or 0 when is clean.
  • Sum of error per columns.

+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+ |GENERAL_ID|IS_ERROR_CODE|IS_ERROR_NAME|IS_ERROR_GENERAL_ID|IS_ERROR_GENERAL_ID_ULTIMATE_PARENT_ID|IS_ERROR_GENERAL_ID_PARENT_ID|IS_ERROR__ROW|dt |IS_ERROR_CODE_SUM|IS_ERROR_NAME_SUM|IS_ERROR_GENERAL_ID_SUM|IS_ERROR_GENERAL_ID_ULTIMATE_PARENT_ID_SUM|IS_ERROR_GENERAL_ID_PARENT_ID_SUM|IS_ERROR__ROW_SUM| +----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+ |null |0 |1 |1 |1 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |3 |0 |1 |0 |0 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |7 |1 |0 |0 |1 |1 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |5 |0 |0 |0 |0 |0 |0 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |6 |0 |1 |0 |0 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |4 |0 |0 |0 |0 |0 |0 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |2 |1 |1 |0 |0 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | |1 |0 |0 |0 |0 |0 |0 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 | +----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+ b) Completeness Report. - The overall count of the dataframe. - Column checking if the overall count is complete, example: IS_ERROR_OVER_ALL_COUNT. +--------------+-----------------------+--------------------------+ |OVER_ALL_COUNT|IS_ERROR_OVER_ALL_COUNT|dt | +--------------+-----------------------+--------------------------+ |8 |1 |2020-04-17 09:39:04.783505| +--------------+-----------------------+--------------------------+

c). Comparison of schema and values with related dataframes.

NOTE: the result includes for now only the ids that are different and a further join with the source data to see differences is needed.

+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+ |df |missing_cols_right |missing_cols_left|missing_vals_right|missing_vals_left|dt | +--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+ |test.diff_df_2|GENERAL_ID:string,ADDR_DESC:string|GENERAL_ID:int | | |2020-04-17 09:39:07.572483| |test.diff_df | | |6,7 | |2020-04-17 09:39:07.572483| +--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+

Installation

Install owl sanitizer from PyPI:

pip install owl-sanitizer-data-quality

Then you can call the library.

``` from spark_validation.dataframe_validation.dataframe_validator import CreateHiveValidationDF from spark_validation.common.config import Config

spark_session = SparkSession.builder.enableHiveSupport().getOrCreate() with open(PATH_TO_CONFIG_FILE) as f: config = Config.parse(f) CreateHiveValidationDF.validate(spark_session, config) ```

To use in your spark submit command or airflow dag.

  • Add py_files : [https://pypi.org/project/owl-sanitizer-data-quality/latest/] .
  • application : owl-sanitizer-data-quality/latest/src/spark_validation/dataframe_validation/hive_validator.py
  • application_package: https://pypi.org/project/owl-sanitizer-data-quality/latest/owl-sanitizer-data-quality-latest.tar.gz
  • application_params: URL_TO_YOUR_REMOTE_CONFIG_FILE

Contact

Please ask questions about technical issues here on GitHub.

Issues

Bump pyspark from 2.4.5 to 3.2.2

opened on 2022-11-11 07:51:09 by dependabot[bot]

Bumps pyspark from 2.4.5 to 3.2.2.

Commits
  • 78a5825 Preparing Spark release v3.2.2-rc1
  • ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases
  • 001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...
  • 9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check
  • bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....
  • c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...
  • 32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...
  • be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check
  • 1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec
  • 3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ronald-smith-angel/owl-data-sanitizer/network/alerts).

Bump pyspark from 2.4.5 to 3.2.2 in /lib

opened on 2022-11-11 07:50:26 by dependabot[bot]

Bumps pyspark from 2.4.5 to 3.2.2.

Commits
  • 78a5825 Preparing Spark release v3.2.2-rc1
  • ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases
  • 001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...
  • 9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check
  • bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....
  • c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...
  • 32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...
  • be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check
  • 1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec
  • 3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ronald-smith-angel/owl-data-sanitizer/network/alerts).

Bump numpy from 1.18.3 to 1.22.0 in /lib

opened on 2022-06-22 01:47:14 by dependabot[bot]

Bumps numpy from 1.18.3 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

  • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
  • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
  • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
  • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
  • A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ronald-smith-angel/owl-data-sanitizer/network/alerts).

Bump numpy from 1.18.3 to 1.22.0

opened on 2022-06-22 01:46:22 by dependabot[bot]

Bumps numpy from 1.18.3 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

  • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
  • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
  • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
  • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
  • A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ronald-smith-angel/owl-data-sanitizer/network/alerts).

Bump ipython from 7.13.0 to 7.16.3 in /lib

opened on 2022-01-21 20:22:07 by dependabot[bot]

Bumps ipython from 7.13.0 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ronald-smith-angel/owl-data-sanitizer/network/alerts).

Bump ipython from 7.13.0 to 7.16.3

opened on 2022-01-21 20:20:36 by dependabot[bot]

Bumps ipython from 7.13.0 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ronald-smith-angel/owl-data-sanitizer/network/alerts).
Ronald Angel

Software Developer focused on Big data and distributed systems.

GitHub Repository