This is a small framework for data quality validation. This first version works reading spark dataframes from local datasources like local system, s3 or hive and delivers hive tables with quality reports.
Let's follow this example:
Input data from a hive table:
+----------+--------------+--------+---------+------------------+---------+
|GENERAL_ID| NAME| CODE|ADDR_DESC|ULTIMATE_PARENT_ID|PARENT_ID|
+----------+--------------+--------+---------+------------------+---------+
| 1|Dummy 1 Entity|12000123| null| null| null|
| 2| null| null| null| 2| 2|
| 3| null|12000123| null| 3| 3|
| 4| 1| 1| null| 4| 4|
| 5| 1|12000123| null| 5| 5|
| 6| null| 3| null| 6| 6|
| null| null|12000123| null| 11| 7|
| 7| 2| null| null| 8| 8|
+----------+--------------+--------+---------+------------------+---------+
following this validation config with 4 sections:
source_table
including the table metadata.correctness_validations
including correctness validations per column.
the rule must be a valid spark SQL expression.parent_children_constraints
including children parent constrains.
This means that any parent id should be valid id.compare_related_tables_list
including comparison with other tables or
the same table in other environments.{
"source_table": {
"name": "test.data_test",
"id_column": "GENERAL_ID",
"unique_column_group_values_per_table": ["GENERAL_ID", "ULTIMATE_PARENT_ID"],
"fuzzy_deduplication_distance": 0,
"output_correctness_table": "test.data_test_correctness",
"output_completeness_table": "test.data_test_completeness",
"output_comparison_table": "test.data_test_comparison"
},
"correctness_validations": [
{
"column": "CODE",
"rule": "CODE is not null and CODE != '' and CODE != 'null'"
},
{
"column": "NAME",
"rule": "NAME is not null and NAME != '' and NAME != 'null'"
},
{
"column": "GENERAL_ID",
"rule": "GENERAL_ID is not null and GENERAL_ID != '' and GENERAL_ID != 'null' and CHAR_LENGTH(GENERAL_ID) < 4"
}
],
"completeness_validations": [
{
"column": "OVER_ALL_COUNT",
"rule": "OVER_ALL_COUNT <= 7"
}
],
"parent_children_constraints": [
{
"column": "GENERAL_ID",
"parent": "ULTIMATE_PARENT_ID"
},
{
"column": "GENERAL_ID",
"parent": "PARENT_ID"
}
],
"compare_related_tables_list": ["test.diff_df", "test.diff_df_2"]
}
Therefore, these results are delivered in two output hive tables:
a). Correctness Report.
+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+
|GENERAL_ID|IS_ERROR_CODE|IS_ERROR_NAME|IS_ERROR_GENERAL_ID|IS_ERROR_GENERAL_ID_ULTIMATE_PARENT_ID|IS_ERROR_GENERAL_ID_PARENT_ID|IS_ERROR__ROW|dt |IS_ERROR_CODE_SUM|IS_ERROR_NAME_SUM|IS_ERROR_GENERAL_ID_SUM|IS_ERROR_GENERAL_ID_ULTIMATE_PARENT_ID_SUM|IS_ERROR_GENERAL_ID_PARENT_ID_SUM|IS_ERROR__ROW_SUM|
+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+
|null |0 |1 |1 |1 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|3 |0 |1 |0 |0 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|7 |1 |0 |0 |1 |1 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|5 |0 |0 |0 |0 |0 |0 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|6 |0 |1 |0 |0 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|4 |0 |0 |0 |0 |0 |0 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|2 |1 |1 |0 |0 |0 |1 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
|1 |0 |0 |0 |0 |0 |0 |2020-04-17 09:39:04.783505|2 |4 |1 |2 |1 |5 |
+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+
b) Completeness Report.
- The overall count of the dataframe.
- Column checking if the overall count is complete, example: IS_ERROR_OVER_ALL_COUNT
.
+--------------+-----------------------+--------------------------+
|OVER_ALL_COUNT|IS_ERROR_OVER_ALL_COUNT|dt |
+--------------+-----------------------+--------------------------+
|8 |1 |2020-04-17 09:39:04.783505|
+--------------+-----------------------+--------------------------+
c). Comparison of schema and values with related dataframes.
NOTE: the result includes for now only the ids that are different and a further join with the source data to see differences is needed.
+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+
|df |missing_cols_right |missing_cols_left|missing_vals_right|missing_vals_left|dt |
+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+
|test.diff_df_2|GENERAL_ID:string,ADDR_DESC:string|GENERAL_ID:int | | |2020-04-17 09:39:07.572483|
|test.diff_df | | |6,7 | |2020-04-17 09:39:07.572483|
+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+
Install owl sanitizer from PyPI:
pip install owl-sanitizer-data-quality
Then you can call the library.
``` from spark_validation.dataframe_validation.dataframe_validator import CreateHiveValidationDF from spark_validation.common.config import Config
spark_session = SparkSession.builder.enableHiveSupport().getOrCreate() with open(PATH_TO_CONFIG_FILE) as f: config = Config.parse(f) CreateHiveValidationDF.validate(spark_session, config) ```
To use in your spark submit command or airflow dag.
py_files
: [https://pypi.org/project/owl-sanitizer-data-quality/latest/]
.application
: owl-sanitizer-data-quality/latest/src/spark_validation/dataframe_validation/hive_validator.py
application_package
: https://pypi.org/project/owl-sanitizer-data-quality/latest/owl-sanitizer-data-quality-latest.tar.gz
application_params
: URL_TO_YOUR_REMOTE_CONFIG_FILE
Please ask questions about technical issues here on GitHub.
Bumps pyspark from 2.4.5 to 3.2.2.
78a5825
Preparing Spark release v3.2.2-rc1ba978b3
[SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases001d8b0
[SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...9dd4c07
[SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle checkbc54a3f
[SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....c5983c1
[SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...32aff86
[SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...be891ad
[SPARK-39551][SQL][3.2] Add AQE invalid plan check1c0bd4c
[SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec3d084fe
[SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps pyspark from 2.4.5 to 3.2.2.
78a5825
Preparing Spark release v3.2.2-rc1ba978b3
[SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases001d8b0
[SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...9dd4c07
[SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle checkbc54a3f
[SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....c5983c1
[SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...32aff86
[SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...be891ad
[SPARK-39551][SQL][3.2] Add AQE invalid plan check1c0bd4c
[SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec3d084fe
[SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps numpy from 1.18.3 to 1.22.0.
Sourced from numpy's releases.
v1.22.0
NumPy 1.22.0 Release Notes
NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:
- Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
- A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
- NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
- New methods for
quantile
,percentile
, and related functions. The new methods provide a complete set of the methods commonly found in the literature.- A new configurable allocator for use by downstream projects.
These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.
The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.
Expired deprecations
Deprecated numeric style dtype strings have been removed
Using the strings
"Bytes0"
,"Datetime64"
,"Str0"
,"Uint32"
, and"Uint64"
as a dtype will now raise aTypeError
.(gh-19539)
Expired deprecations for
loads
,ndfromtxt
, andmafromtxt
in npyio
numpy.loads
was deprecated in v1.15, with the recommendation that users usepickle.loads
instead.ndfromtxt
andmafromtxt
were both deprecated in v1.17 - users should usenumpy.genfromtxt
instead with the appropriate value for theusemask
parameter.(gh-19615)
... (truncated)
4adc87d
Merge pull request #20685 from charris/prepare-for-1.22.0-releasefd66547
REL: Prepare for the NumPy 1.22.0 release.125304b
wipc283859
Merge pull request #20682 from charris/backport-204165399c03
Merge pull request #20681 from charris/backport-20954f9c45f8
Merge pull request #20680 from charris/backport-20663794b36f
Update armccompiler.pyd93b14e
Update test_public_api.py7662c07
Update init.py311ab52
Update armccompiler.pyDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps numpy from 1.18.3 to 1.22.0.
Sourced from numpy's releases.
v1.22.0
NumPy 1.22.0 Release Notes
NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:
- Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
- A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
- NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
- New methods for
quantile
,percentile
, and related functions. The new methods provide a complete set of the methods commonly found in the literature.- A new configurable allocator for use by downstream projects.
These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.
The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.
Expired deprecations
Deprecated numeric style dtype strings have been removed
Using the strings
"Bytes0"
,"Datetime64"
,"Str0"
,"Uint32"
, and"Uint64"
as a dtype will now raise aTypeError
.(gh-19539)
Expired deprecations for
loads
,ndfromtxt
, andmafromtxt
in npyio
numpy.loads
was deprecated in v1.15, with the recommendation that users usepickle.loads
instead.ndfromtxt
andmafromtxt
were both deprecated in v1.17 - users should usenumpy.genfromtxt
instead with the appropriate value for theusemask
parameter.(gh-19615)
... (truncated)
4adc87d
Merge pull request #20685 from charris/prepare-for-1.22.0-releasefd66547
REL: Prepare for the NumPy 1.22.0 release.125304b
wipc283859
Merge pull request #20682 from charris/backport-204165399c03
Merge pull request #20681 from charris/backport-20954f9c45f8
Merge pull request #20680 from charris/backport-20663794b36f
Update armccompiler.pyd93b14e
Update test_public_api.py7662c07
Update init.py311ab52
Update armccompiler.pyDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps ipython from 7.13.0 to 7.16.3.
d43c7c7
release 7.16.35fa1e40
Merge pull request from GHSA-pq7m-3gw7-gq5x8df8971
back to dev9f477b7
release 7.16.2138f266
bring back release helper from master branch5aa3634
Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...bcae8e0
Backport PR #13335: What's new 7.16.28fcdcd3
Pin Jedi to <0.17.2.2486838
release 7.16.120bdc6f
fix conda buildDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps ipython from 7.13.0 to 7.16.3.
d43c7c7
release 7.16.35fa1e40
Merge pull request from GHSA-pq7m-3gw7-gq5x8df8971
back to dev9f477b7
release 7.16.2138f266
bring back release helper from master branch5aa3634
Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...bcae8e0
Backport PR #13335: What's new 7.16.28fcdcd3
Pin Jedi to <0.17.2.2486838
release 7.16.120bdc6f
fix conda buildDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.