SumStatsRehab is a universal GWAS SumStats pre-processing tool. SumStatsRehab takes care of each of the original data points to maximize statistical power of downstream calculations. Currently, the only supported processing which may result in a loss of the original data is liftover, which is a common task, and is optional to the user.
Examples of what the tool does:
- data validation (e.g.
- data enrichment (e.g. restoration of a missing Chr and BP fields from rsID in the input GWAS SumStats file by rsID lookup in the input dbSNP dataset),
- data correction (restoration of invariably erroneous values),
- data formatting (e.g. sorting),
- data restoration (e.g. calculation of a StdErr field from present p-value and beta fields).
Example of what the tool does not: - data cleaning/reduction
SumStatsRehab aims to be a production-grade software, but you may find it to be not a complete data preparation solution for sumstats-ingesting pipelines. Yet, the tool streamlines the development of the data preparation part. This comes out from focusing on solving more complex problems that span many different use-cases, instead of covering only specific use-cases.
clone this repo
git clone https://github.com/Kukuster/SumStatsRehab.git && cd SumStatsRehab
pip install -r requirements.txt
install SumStatsRehab as a package
python3 setup.py build
python3 setup.py install
run the command using the following syntax:
SumStatsRehab <command> [keys]
diagnose to check the validity of entries in the GWAS SS file.
fix to restore missing/invalid data in the GWAS SS file.
prepare_dbSNPs to preprocess a given dbSNP dataset into 2 datasets, which are used in the
sort to format the input GWAS SS file and sort either by Chr and BP or by rsID.
To use the
fix command to its fullest, a user needs:
- SNPs datasets in the target build, preprocessed with the
- chain file, if the GWAS SS file is provided in build different from the target build
pip install git+https://github.com/Kukuster/SumStatsRehab.git
or, for specific version:
pip install git+https://github.com/Kukuster/[email protected] --upgrade
This installation method doesn't work with the currently upcoming git protocol security update on github: - https://github.blog/2021-09-01-improving-git-protocol-security-github/
Download dbSNP datasets from NCBI, in the target build, in vcf, vcf.gz, bcf, or bcf.gz format. The latest versions are recommended. dbSNP datasets are used to restore the following data: Chr, BP, rsID, OA, EA, EAF. Although only builds 37 and 38 are explicitly supported, build 36 may work as well.
For example, curently latest datasets for build 38 and build 37 can be downloaded here:
A chain file is necessary to perform liftover. If a GWAS SS file is provided in the target build, then a chain file is not used.
see instructions on their websites and/or githubs
recommended bcftools version: 1.11
NOTE: after preprocessing of the necessary dbSNPs is finished, these tools are no longer needed
prepare_dbSNPs using the following syntax:
SumStatsRehab prepare_dbSNPs --dbsnp DBSNP --OUTPUT OUTPUT --gz-sort GZ_SORT --bcftools BCFTOOLS
DBSNP is the dbSNP dataset in vcf, vcf.gz, bcf, or bcf.gz format referencing build 38 or 37
OUTPUT is the base name for the two output dbSNPs datasets
GZ_SORT is a path to the gz-sort executable
BCFTOOLS is a path to the bcftools executable
BUFFER is buffer size for sorting (size of presort), supports k/M/G suffix. Defaults to 1G. Recommended: at least 200M, ideally: 4G or more
Depending on the size of the dataset, specified buffer size, and specs of the machine, preprocessing may take somewhere from 30 minutes to 6 hours.
After preprocessing, steps 4 and 5 may be repeated ad-lib.
Config file is used as meta data for GWAS SS file, and contains: 1) columns' indices (indices start from 0) 2) input build slug (such as "GRCh38", "GRCh37", "hg18", "hg19")
This config file has to have the same file name as the GWAS SS file but with an additional
For example, if your GWAS SS file is named
WojcikG_PMID_htn.gz, and the first 5 lines in the unpacked file are:
Chr Position_hg19 SNP Other-allele Effect-allele Effect-allele-frequency Sample-size Effect-allele-frequency-cases Sample-size-cases Beta SE P-val INFO-score rsid
1 10539 1:10539:C:A C A 0.004378065 49141 0.003603676 27123 -0.1041663 0.1686092 0.5367087 0.46 rs537182016
1 10616 rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0.9916342 49141 0.9901789 27123 -0.1738814 0.109543 0.1124369 0.604 rs376342519
1 10642 1:10642:G:A G A 0.006042409 49141 0.007277901 27123 0.1794246 0.1482529 0.226179 0.441 rs558604819
1 11008 1:11008:C:G C G 0.1054568 49141 0.1042446 27123 -0.007140072 0.03613677 0.84337 0.5 rs575272151
your config file should have the name
WojcikG_PMID_htn.gz.json and the following contents:
- SumStatsRehab will only consider data from the columns which indices are specified in the config file. If one of the above columns is present in the SS file but wasn't specified in the config file, then SumStatsRehab treats the column as missing.
- In this example, all the 11 columns from the list of supported columns are present. Yet, none of the columns above are mandatory. If certain columns are missing, the
fix command will attempt to restore them.
When the config file is created, and dbSNP datasets are preprocessed, the chain file is downloaded if necessary, then the
fix command can use all its features.
Although it is normally a part of the execution of the
fix command, a user may choose to manually run
diagnose is ran without additional arguments, it is "read-only", i.e. doesn't write into the file system.
diagnose as follows:
SumStatsRehab diagnose --INPUT INPUT_GWAS_FILE
INPUT_GWAS_FILE is the path to the GWAS SS file with the corresponding config file at
as a result, it will generate the main plot: stacked histogram plot, and an additional bar chart plot for each of the bins in the stacked histogram plot.
These plots will pop up in a new matplotlib window.
The stacked histogram maps the number of invalid SNPs against p-value, allowing assessment of the distribution of invalid SNPs by significance. On the histogram, valid SNPs are shown as blue, and SNPs that have issues are shown as red. The height of the red plot over each bin with the red caption represents the proportion of invalid SNPs in the corresponding bin.
A bar chart is generated for each bin of the stacked histogram plot and reports the number of issues that invalidate the SNP entries in a particular bin.
If a Linux system runs without GUI, the report should be saved on the file system. For this, run the command as follows:
SumStatsRehab diagnose --INPUT INPUT_GWAS_FILE --REPORT-DIR REPORT_DIR
REPORT_DIR is an existing or not existing directory under which the generated report will be contained. When saved onto a disk, the report also includes a small table with exact numbers of invalid fields and other issues in the GWAS SS file.
Finally, a user may want to decide to run the
A user should run the
fix command as follows:
SumStatsRehab fix --INPUT INPUT_GWAS_FILE --OUTPUT OUTPUT_FILE
[--dbsnp-1 DBSNP1_FILE] [--dbsnp-2 DBSNP2_FILE]
INPUT_GWAS_FILE is the input GWAS SS file with the corresponding
.json config file create at step 4
OUTPUT_FILE is the base name for the fixed file(s)
DBSNP1_FILE is a path to the preprocessed dbSNP #1
DBSNP2_FILE is a path to the preprocessed dbSNP #2
CHAIN_FILE is a path to the chain file
FREQ_DATABASE_SLUG is a slug of a frequency database contained in the dbSNP
SumStatsRehab fix --INPUT "29559693.tsv" --OUTPUT "SumStatsRehab_fixed/29559693" --dbsnp-1 "dbSNP_155_b38.1.tsv.gz" --dbsnp-2 "dbSNP_155_b38.2.tsv.gz" --chain-file "hg19_to_hg38.chain" --freq-db TOPMED --do-not-restore OA EA
As the normal process of
fix, a report will be generated for the input file, as well as for the file after each step of processing. Depending on the availability of invalid/missing data in the GWAS SS file and the input arguments, a different number of steps may be required for a complete run of the
fix command, with 1 or 2 loops performed on the GWAS SS file. All steps are performed automatically without prompt. The process of
fixing is represented in logging to the standard output and may take anywhere from 5 minutes to 1.5 hours, depending on the size of the file and the number of steps.
As a result, if 1 loop was required to fix the file, then the resulting file will be available with the suffix
.rehabed.tsv. If 2 loops were required, then the resulting file is available with the suffix
The report made with a
diagnose command will be available in a separate directory for:
- the input file
- for the file after 1 loop of fixing
- for the file after 2 loops of fixing (applicable only if 2 loops were required)
Please refer to the instructions by running
SumStatsRehab <command> -h
Config file is a json object which supports the following properties:
should be evaluated to an integer: the corresponding column index starting from 0.
should be evaluated to either one of the following (case insensitive):
should be evaluated to an array of integers: indicies of other columns to include
It is also possible to set
EAF to a weighted average of multiple colums. E.g. if there are separate freq. columns for case and control groups, and average freq. is needed, number of participants in each group will serve as weights for the two columns:
fix command, the input sumstats file may undergo sorting. If you want any other columns to be included in the resulting fixed file, add 0-indexed column indices in an array as the
"other" parameter in the config.
"other": [7, 8, 2],
With this entry, the resulting file will have 3 additional columns at the end in the order as their indices appear in this config entry.
sort always output files in this format. Internally, input files are always converted before undergoing any processing.
lib/prepare_GWASSS_columns.py is responsible for formatting the input.
The two key functions of SumStatsRehab are validation and restoration, implemented for 9 data categories: chromosome, base pair position, rsID, effect allele, other allele, allele frequency, standard error, beta, p-value. Each column is validated independently of the others, and is regarded to have two possible states: valid and invalid. With the exception of the case where only one of either the chromosome or base pair position is a valid entry, valid entries are always kept and invalid entries are subject to restoration. 1. Value in the chromosome field is considered to be valid if and only if it includes one of the following values: '1', '01', '2', '02', '3', '03', '4', '04', '5', '05', '6', '06', '7', '07', '8', '08', '9', '09','10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', 'X', 'x', 'Y', 'y', 'M', 'm'. In practice, ‘chr1’, ‘chr01’, ‘Chr1’,‘Chr01’, ‘1’, and ‘01’ are all similarly recognized as referring to chromosome 1. Any entries which do not include a specific chromosome number reference are subject to restoration. 2. Base pair position entry is considered to be valid if and only if it is a non-negative integer. 3. rsID entry is considered valid if and only if it is an arbitrary integer value with ‘rs’ prefix. 4. Effect allele and other allele entries are considered to be valid if and only if the value is either a dash (representing a deleted allele), or any non-empty combination of letters A, T, C, and G. 5. Any floating-point value between 0 and 1 inclusively is considered to be a valid allele frequency entry and p-value entry. 6. Any floating-point value is considered to be valid standard error entry and beta entry.
While most of the columns are assessed independently, the restoration process often involves multiple columns. 1. Chromosome and base pair position entries are only restored together for any particular row. If any of the two values are invalid or missing, and rsID entry in the row is present and valid, then the two columns are subject to restoration. This is the only case where a single potentially valid entry can be rewritten or removed as a result of restoration. Chromosome and base pair position entries are restored by a lookup in the preprocessed DB2 for a matching rsID entry. If a match wasn’t found, both chromosome and base pair position entries are replaced with a dot ‘.’, representing a missing value. 2. Similar to the current implementation of MungeSumstats, SumStatsRehab restores rsID entry by chromosome and base pair position. If rsID entry is missing or invalid, but both chromosome and base pair position are valid, then SumStatsRehab attempts to restore rsID by a lookup in the preprocessed DB1 for matching chromosome and base pair position entries. If a match wasn’t found, rsID is set to a dot ‘.’ This requires chromosome and base pair position entries in the row to be present and valid. 3. EA and OA entries are restored in the same way. If one of the two alleles is invalid or missing and the other one is present, and either rsID or both chromosome and base pair position are valid, then the missing allele can be restored. First, if rsID is present, then the missing allele is restored by a lookup in the preprocessed DB2 for matching rsID entry and the present allele. When rsID is not present, but Chr and BP are present, then the missing allele is restored by a lookup in the preprocessed DB1 for matching Chr and BP entries and the present allele. Otherwise, or if a lookup fails, then no entries are rewritten. 4. Allele frequency is restored as the effect allele frequency, and it’s done in a similar way: first by a lookup in DB1 for matching rsID and EA, if rsID is present; second by a lookup into DB2 for matching Chr and BP and EA. Restoring allele frequency requires frequency entries in the INFO field in the SNPs dataset that was preprocessed. By default, frequencies are taken from the database of Genotypes and Phenotypes (dbGaP), but an alternative database contained in the dbSNP dataset may be used as an argument in the FIX command. 5. Each of the columns, standard error, beta, and p-value, are restored from the other two, using the following relation: s = β/z, where s is the standard error, β is beta, and z is z-score that corresponds to the p-value in the two-tailed test (Zhu et al. 2016). If standard error is signed, then beta can be restored accurately, but if standard error is provided as an absolute value, the restored beta will be correct only as an absolute value, with an unknown sign. Files restored in this manner will have little utility in downstream applications, but may be useful in comparing relative effect sizes.
Warning: Restored betas with an unknown sign should not be utilized in any downstream application.
fixcommand to turn on and off the restoration of the ambiguous effect alleles (in cases when the dbSNPs present multiple options for ALT)
loop_fix.py: make a separate function loopDB1 and loopDB2 that will loop through enough entries in a DB before every resolver and rewrite a "global" object with properties to be fields from the DB: rsID, Chr, BP, alleles, EAF. So resolvers for rsID and ChrBP will be similar to ones for alleles and EAF. Resolvers for these fields then should operate on
fieldsand that object with fields from a DB. This way a really strong optimization, flexibility, and modularity of resolvers will be achieved.
run_alldoesn't have to have resolvers and resolvers_args object to be passed, it can just use the global ones.
wc. Need to do more tests of SumStatsRehab with a different versions of
mawk. E.g. even though
gawkis default for GNU/Linux, Ubuntu has
diagnosecommand script should also generate a bar chart for the bin with missing p-value
diagnosisand removes all the invalid rows.
lib/loop_fix.py: it has to catch speicifc Exceptions everywhere
--verbosekey was not set, then remove the intermediate files.
fixcommand interface: if
--verboseis set, then forbid the
--OUTPUTargument and allow
fixcommand was not set to
--verbose, then if the
--OUTPUTpath ends with
.zip, then compress the resulting file on the output.
Hello, I have been having an issue converting a GWAS file from genome build 38 to genome build 37 (hg19). In order to do this, I specified hg38 for "build" in the .json file of the GWAS, and then provided the two prepared DBSNP files from hg19 (--dbsnp-1 and --dbsnp-2), as well as the --chain-file that specifies conversion from hg38 to hg19.
This results in the program finishing at Step 3 without performing LiftOver as below: ``` SumStatsRehab v1.2.1 - fix command input build: hg38 === Step 1: Format the GWAS SS file === the SumStats file is a gzip. Unpacking Step 1 finished in 67.09121966362 seconds
=== Step 2: Validate entries in the formatted GWAS SS file and save the report === number of lines in the file: 20155434 validating entries : 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [02:08<00:00, 157126.27it/s] calculating reports: 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [01:15<00:00, 268250.15it/s] generating reports found issues: rsID: 790496/20155433 (3.92%) Step 2 finished in 205.60818552970886 seconds
=== Step 3: Analyze the report and prepare for REHAB === Step 3 finished in 0.0001633167266845703 seconds
The input file has nothing to resolve
To see if it would do anything, I decided to instead specify the "build" of my hg38 GWAS to hg19 in the .json file, and provide the same --dbsnp and --chain-file files as above, which resulted in LiftOver being performed successfully:
``` SumStatsRehab v1.2.1 - fix command input build: hg19 === Step 1: Format the GWAS SS file === the SumStats file is a gzip. Unpacking Step 1 finished in 68.19040560722351 seconds
=== Step 2: Validate entries in the formatted GWAS SS file and save the report === number of lines in the file: 20155434 validating entries : 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [02:07<00:00, 157908.08it/s] calculating reports: 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [01:15<00:00, 265598.63it/s] generating reports found issues: rsID: 790496/20155433 (3.92%) Step 2 finished in 205.75064539909363 seconds
=== Step 3: Analyze the report and prepare for REHAB === lifting over : 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [01:29<00:00, 226464.01it/s] finished liftover to hg38 (saved report) number of lines in the file: 20155434 validating entries : 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [02:10<00:00, 154159.25it/s] calculating reports: 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [01:16<00:00, 262338.98it/s] generating reports found issues: rsID: 790496/20155433 (3.92%) Chr: 834211/20155433 (4.14%) BP: 826569/20155433 (4.1%) 790496/20155433 entries are missing rsID Going to sort the GWAS SS file by Chr and BP Sorted by Chr and BP Step 3 finished in 341.96107244491577 seconds
=== Step 4: REHAB: loopping through the GWAS SS file and fixing entries === loop-fix : 100%|████████████████████████████████████████████████████████████████████████| 20155433/20155433 [26:07<00:00, 12858.91it/s] Step 4 finished in 1567.431545972824 seconds
=== Step 5: Validate entries in the fixed GWAS SS file and save the report === number of lines in the file: 20155434 validating entries : 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [02:11<00:00, 153484.10it/s] calculating reports: 100%|███████████████████████████████████████████████████████████████████████| 20155433/20155433 [01:17<00:00, 259726.21it/s] generating reports found issues: rsID: 784818/20155433 (3.89%) Chr: 834211/20155433 (4.14%) BP: 826569/20155433 (4.1%) Step 5 finished in 211.63463640213013 seconds
=== Step 6: Analyze the report after REHAB === lost 834211 (4.14%) "Chr" fields after liftover lost 826569 (4.1%) "BP" fields after liftover restored 5678 (0.03%) "rsID" fields Step 6 finished in 0.00021147727966308594 seconds
Those issues which were possible to resolve have been resolved
``` This produces a file which appears to be successfully transferred from hg38 to hg19. Is the LiftOver function of SumStatsRehab only intended to be from lower genome builds to hg38? It seems like specifying an hg38 GWAS as hg19, and then matching the dbsnp files to that lower build with a chain file that goes from hg38 to hg19 works. However, specifying the hg38 GWAS as its actual build does nothing if you want to convert it to build hg19.
pip install -r requirements.txt
ERROR: Failed building wheel for scipy Failed to build scipy ERROR: Could not build wheels for scipy, which is required to install pyproject.toml-based projects
Detail: a new conda env based on python3
pip install scipy==1.6.1, same problem occored, I searched and have no idea to slove this problem.
and I tried
pip install pyproject.toml, it seems have no thing to do with this problem
Step 2: Validate entries in the formatted GWAS SS file and save the report === number of lines in the file: 28987535 validating entries : 100%|███████████████████████████████████████████████████████████████████████████████████████████| 28987534/28987534 [03:02<00:00, 159012.85it/s] calculating reports: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 28987534/28987534 [01:04<00:00, 446130.26it/s] generating reports found issues: rsID: 28987534/28987534 (100.0%) EAF: 5145971/28987534 (17.75%) SE: 5145971/28987534 (17.75%) beta: 5145971/28987534 (17.75%) pval: 28987534/28987534 (100.0%) Traceback (most recent call last): File "/usr/local/bin/SumStatsRehab", line 11, in <module> load_entry_point('SumStatsRehab==1.2.1', 'console_scripts', 'SumStatsRehab')() File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/SumStatsRehab.py", line 730, in main fix(args.INPUT_GWAS_FILE, args.OUTPUT_FILE, File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/SumStatsRehab.py", line 199, in fix validate_GWASSS_entries( File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/lib/validate_GWASSS_entries.py", line 551, in validate_GWASSS_entries write_report_to_dir(issues_count, REPORT_ABS_DIR) File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/lib/report_utils.py", line 23, in write_report_to_dir with open(os.path.join(REPORT_DIR, INVALID_ENTRIES_REPORT_FILENAME), 'w') as f: PermissionError: [Errno 13] Permission denied: '/home/carlos/ssfiles/Bronchitis-row1722.bgz_input-report/invalid_entries.csv
The fix command fails halfway through execution, at Step 7. Error log:
``` === Step 6: Analyze the report after REHAB === Going to sort the GWAS SS file by Chr and BP Sorted by Chr and BP Step 6 finished in 115.71317148208618 seconds
=== Step 7: REHAB: loopping through the GWAS SS file again and fixing entries ===
An error occured while looping through the SNPs file (see below)
An error occured on line 1 of the GWAS SS file (see below)
Traceback (most recent call last):
File "/usr/local/bin/SumStatsRehab", line 11, in
Reported by Mahantesh B
summary-statistics gwas gwas-summary-statistics gwas-pipeline data-preparation data-preprocessing data-prep bioinformatics bioinformatics-tool sumstats