Easy genetic ancestry predictions in Python

arvkevi, updated 🕥 2022-12-09 05:56:00

ezancestry

Build

Easily visualize your direct-to-consumer genetics next to 2500+ samples from the 1000 genomes project. Evaluate the performance of a custom set of ancestry-informative snps (AISNPs) at classifying the genetic ancestry of the 1000 genomes samples using a machine learning model.

A subset of 1000 Genomes Project samples' single nucleotide polymorphism(s), or, SNP(s) have been parsed from the publicly available .bcf files.
The subset of SNPs, AISNPs (ancestry-informative snps), were chosen from two publications: * Set of 55 AISNPs. Progress toward an efficient panel of SNPs for ancestry inference. Kidd et al. 2014 * Set of 128 AISNPs. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America.. Kosoy et al. 2009 (Seldin Lab)

ezancestry ships with pretrained k-nearest neighbor models for all combinations of following:

* Kidd (55 AISNPs)
* Seldin (128 AISNPs)

* continental-level population (superpopulation)
* regional population (population)

* principal componentanalysis (PCA)
* neighborhood component analysis (NCA)
* uniform manifold approximation and projection (UMAP)

image

Table of Contents

Installation

Install ezancestry with pip:

shell pip install ezancestry

Or clone the repository and run pip install from the directory:

shell git clone [email protected]:arvkevi/ezancestry.git cd ezancestry pip install .

Config

The first time ezancestry is run it will generate a config.ini file and data/ directory in your home directory under ${HOME}/.ezancestry. You can edit conf.ini to change the default settings, but it is not necessary to use ezancestry. The settings are just a utility for the user so they don't have to be verbose when interacting with the software. The settings are also keyword arguments to each of the commands in the ezancestry API, so you can always override the default settings.

These will be created in your home directory:

shell ${HOME}/.ezancestry/conf.ini ${HOME}/.ezancestry/data/

Explanations of each setting is described in the Options section of the --help of each command, for example:

```shell ezancestry predict --help

Usage: ezancestry predict [OPTIONS] INPUT_DATA

Predict ancestry from genetic data.

  • Default arguments are from the ~/.ezancestry/conf.ini file. *

Arguments: INPUT_DATA Can be a file path to raw genetic data (23andMe, ancestry.com, .vcf) file, a path to a directory containing several raw genetic files, or a (tab or comma) delimited file with sample ids as rows and snps as columns. [required]

Options: --output-directory TEXT The directory where to write the prediction results file

--write-predictions / --no-write-predictions If True, write the predictions to a file. If False, return the predictions as a dataframe. [default: True]

--models-directory TEXT The path to the directory where the model files are located.

--aisnps-directory TEXT The path to the directory where the AISNPs files are located.

--n-components INTEGER The number of components to use in the PCA dimensionality reduction.

--k INTEGER The number of nearest neighbors to use in the KNN model.

--thousand-genomes-directory TEXT The path to the 1000 genomes directory. --samples-directory TEXT The path to the directory containing the samples.

--algorithm TEXT The dimensionality reduction algorithm to use. Choose pca|umap|nca

--aisnps-set TEXT The name of the AISNP set to use. To start, choose either 'Kidd' or 'Seldin'. The default value in conf.ini is 'Kidd'. If using your AISNP set, this value will be the in the namingc onvention for all the new model files that are created

--help Show this message and exit. ```

Usage

ezancestry can be used as a command-line tool or as a Python library. ezancestry predict comes with pre-trained models when --aisnps-set="Kidd" (default) or --aisnps-set="Seldin".

build-model and generate-dependencies are for advanced users -- they download large amounts of data and build a new model from a custom AISNPs file.

command-line interface

There are four commands available:

  1. predict: predict the genetic ancestry of a sample or cohort of samples using the nearest neighbors model.
  2. plot: plot the genetic ancestry of samples using only the output of predict.
  3. generate-dependencies: generate the dependencies for build-model.
  4. build-model: build a nearest neighbors model from the 1000 genomes data using a custom set of AISNPs. Requires: generate-dependencies to be run first.

Use the commands in the following way:

predict

ezancestry can predict the genetic ancestry of a sample or cohort of samples using the nearest neighbors model. The input_data can be a file path to raw genetic data (23andMe, ancestry.com, .vcf) file, a path to a directory containing several raw genetic files, or a (tab or comma) delimited file with sample ids as rows and snps as columns.

This writes a file, predictions.csv to the output_directory (defaults to current directory). This file contains predicted ancestry for each sample.

Direct-to-consumer genetic data file (23andMe, ancestry.com, etc.):

shell ezancestry predict mygenome.txt

Directory of direct-to-consumer genetic data files or .vcf files:

shell ezancestry predict /path/to/genetic_datafiles

comma-separated file with sample ids as rows and snps as columns, filled with genotypes as values

shell ezancestry predict ${HOME}/.ezancestry/data/aisnps/thousand_genomes.KIDD.dataframe.csv

plot

Visualize the output of predict using the plot command. This will open a 3d scatter plot in a browser.

shell ezancestry plot predictions.csv

generate-dependencies

This command will download all of the data required to build a new nearest neighbors model for a custom set of AISNPs. This command will attempt to download all the .bcf files from The 1000 Genomes Project. If you want to use existing models, see predict and plot.

Without any arguments this command will download all necessary data to build new models and put it in the ${HOME}/.ezancestry/data/ directory.

shell ezancestry generate-dependencies

Now you are ready to build a new model with build-model.

build-model

Test the discriminative power of your custom set of AISNPs.

This command will build all the necessary models to visualize and predict the 1000 genomes samples as well as user-uploaded samples. A model performace evaluation report will be generated for a five-fold cross-validation on the training set of the 1000 genomes samples as well as a report for the holdout set.

Create a custom AISNP file here: ~/.ezancestry/data/aisnps/custom.AISNP.txt. The prefix of the filename, custom, can be whatever you want. Note that this value is used as the aisnps-set keyword argument for other ezancestry commands.

The file should look like this: id chromosome position_hg19 rs731257 7 12669251 rs2946788 11 24010530 rs3793451 9 71659280 rs10236187 7 139447377 rs1569175 2 201021954

shell ezancestry build-model --aisnps-set=custom

Python API

See the notebook

Visualization

http://ezancestry.herokuapp.com/

Open in Streamlit

image

Contributing

Contributions are welcome! Please feel free to create an issue for discussion or make a pull request.

Issues

Bump certifi from 2021.5.30 to 2022.12.7

opened on 2022-12-09 05:56:00 by dependabot[bot]

Bumps certifi from 2021.5.30 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/arvkevi/ezancestry/network/alerts).

Bump protobuf from 3.17.3 to 3.18.3

opened on 2022-09-23 22:59:07 by dependabot[bot]

Bumps protobuf from 3.17.3 to 3.18.3.

Release notes

Sourced from protobuf's releases.

Protocol Buffers v3.18.3

C++

Protocol Buffers v3.18.2

Java

  • Improve performance characteristics of UnknownFieldSet parsing (#9371)

Protocol Buffers v3.18.1

Python

  • Update setup.py to reflect that we now require at least Python 3.5 (#8989)
  • Performance fix for DynamicMessage: force GetRaw() to be inlined (#9023)

Ruby

  • Update ruby_generator.cc to allow proto2 imports in proto3 (#9003)

Protocol Buffers v3.18.0

C++

  • Fix warnings raised by clang 11 (#8664)
  • Make StringPiece constructible from std::string_view (#8707)
  • Add missing capability attributes for LLVM 12 (#8714)
  • Stop using std::iterator (deprecated in C++17). (#8741)
  • Move field_access_listener from libprotobuf-lite to libprotobuf (#8775)
  • Fix #7047 Safely handle setlocale (#8735)
  • Remove deprecated version of SetTotalBytesLimit() (#8794)
  • Support arena allocation of google::protobuf::AnyMetadata (#8758)
  • Fix undefined symbol error around SharedCtor() (#8827)
  • Fix default value of enum(int) in json_util with proto2 (#8835)
  • Better Smaller ByteSizeLong
  • Introduce event filters for inject_field_listener_events
  • Reduce memory usage of DescriptorPool
  • For lazy fields copy serialized form when allowed.
  • Re-introduce the InlinedStringField class
  • v2 access listener
  • Reduce padding in the proto's ExtensionRegistry map.
  • GetExtension performance optimizations
  • Make tracker a static variable rather than call static functions
  • Support extensions in field access listener
  • Annotate MergeFrom for field access listener
  • Fix incomplete types for field access listener
  • Add map_entry/new_map_entry to SpecificField in MessageDifferencer. They record the map items which are different in MessageDifferencer's reporter.
  • Reduce binary size due to fieldless proto messages
  • TextFormat: ParseInfoTree supports getting field end location in addition to start.
  • Fix repeated enum extension size in field listener
  • Enable Any Text Expansion for Descriptors::DebugString()
  • Switch from int{8,16,32,64} to int{8,16,32,64}_t

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/arvkevi/ezancestry/network/alerts).

[WIP] versioned UMAP models

opened on 2022-09-11 11:56:27 by arvkevi None

Get unit tests working on Actions

opened on 2022-09-01 09:59:36 by arvkevi

Write additional unit tests to test the basic functionality of using ezancestry as a library.

Bump nbconvert from 6.1.0 to 6.5.1

opened on 2022-08-23 18:43:42 by dependabot[bot]

Bumps nbconvert from 6.1.0 to 6.5.1.

Release notes

Sourced from nbconvert's releases.

Release 6.5.1

No release notes provided.

6.5.0

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.5...6.5

6.4.3

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.2...6.4.3

6.4.0

What's Changed

New Contributors

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/arvkevi/ezancestry/network/alerts).

Unable to load umap model

opened on 2022-08-22 09:38:29 by redjay8

Hi, I'm unable to load the included umap models. PCA models work.

When running the following code, ```python

write all the super population dimred models for kidd and Seldin

for aisnps_set, df, df_labels in zip( ["kidd", "Seldin"], [df_kidd_encoded, df_seldin_encoded], [df_kidd["superpopulation"], df_seldin["superpopulation"]] ): for algorithm, labels in zip(["pca", "umap", "nca"], [None, None, None, df_labels]): print(algorithm,aisnps_set,OVERWRITE_MODEL,labels) df_reduced = dimensionality_reduction(df, algorithm=algorithm, aisnps_set=aisnps_set, overwrite_model=OVERWRITE_MODEL, labels=labels, population_level="super population") knn_model = train(df_reduced, df_labels, algorithm=algorithm, aisnps_set=aisnps_set, k=9, population_level="superpopulation", overwrite_model=OVERWRITE_MODEL) ```

I get the error below:

``` 2022-08-22 17:16:03.823 | INFO | ezancestry.dimred:dimensionality_reduction:126 - Successfully loaded a dimensionality reduction model pca kidd False None umap kidd False None


AttributeError Traceback (most recent call last) Input In [17], in () 7 for algorithm, labels in zip(["pca", "umap", "nca"], [None, None, None, df_labels]): 8 print(algorithm,aisnps_set,OVERWRITE_MODEL,labels) ----> 9 df_reduced = dimensionality_reduction(df, algorithm=algorithm, aisnps_set=aisnps_set, overwrite_model=OVERWRITE_MODEL, labels=labels, population_level="super population") 10 knn_model = train(df_reduced, df_labels, algorithm=algorithm, aisnps_set=aisnps_set, k=9, population_level="superpopulation", overwrite_model=OVERWRITE_MODEL)

File ~/ezancestry/ezancestry/dimred.py:107, in dimensionality_reduction(df, algorithm, aisnps_set, n_components, overwrite_model, labels, population_level, models_directory, random_state) 105 if algorithm in set(["pca", "umap"]): 106 try: --> 107 reducer = joblib.load( 108 models_directory.joinpath(f"{algorithm}.{aisnps_set}.bin") 109 ) 110 except FileNotFoundError: 111 return None

File ~/opt/anaconda3/lib/python3.9/site-packages/joblib/numpy_pickle.py:587, in load(filename, mmap_mode) 581 if isinstance(fobj, str): 582 # if the returned file object is a string, this means we 583 # try to load a pickle file generated with an version of 584 # Joblib so we load it with joblib compatibility function. 585 return load_compatibility(fobj) --> 587 obj = _unpickle(fobj, filename, mmap_mode) 588 return obj

File ~/opt/anaconda3/lib/python3.9/site-packages/joblib/numpy_pickle.py:506, in _unpickle(fobj, filename, mmap_mode) 504 obj = None 505 try: --> 506 obj = unpickler.load() 507 if unpickler.compat_mode: 508 warnings.warn("The file '%s' has been generated with a " 509 "joblib version less than 0.10. " 510 "Please regenerate this pickle file." 511 % filename, 512 DeprecationWarning, stacklevel=3)

File ~/opt/anaconda3/lib/python3.9/pickle.py:1212, in _Unpickler.load(self) 1210 raise EOFError 1211 assert isinstance(key, bytes_types) -> 1212 dispatchkey[0] 1213 except _Stop as stopinst: 1214 return stopinst.value

File ~/opt/anaconda3/lib/python3.9/pickle.py:1589, in _Unpickler.load_reduce(self) 1587 args = stack.pop() 1588 func = stack[-1] -> 1589 stack[-1] = func(*args)

File ~/opt/anaconda3/lib/python3.9/site-packages/numba/core/serialize.py:97, in _unpickle__CustomPickled(serialized) 92 def _unpickle__CustomPickled(serialized): 93 """standard unpickling for _CustomPickled. 94 95 Uses NumbaPickler to load. 96 """ ---> 97 ctor, states = loads(serialized) 98 return _CustomPickled(ctor, states)

AttributeError: Can't get attribute '_rebuild_function' on

```

I have tested that it is certainly the UMAP model that is causing the issue.

```python import pandas as pd

import joblib

obj = joblib.load(r"/Users/jacksonc08/ezancestry/data/models/umap.kidd.bin")

``` This gives the same error.

Looking online, it seems to be an issue with the numba package (a dependency of joblib), which no longer includes the _rebuild_function function. See here.

Do you have any recommendations on how to fix this error? Many thanks.

Releases

Pin cyvcf2 2022-05-13 23:16:52

Lowercase filenames 2021-10-19 02:11:13

Mixed case filenames were not compatible on all OS.

Python 3.7 support 2021-10-04 11:34:53

v0.0.4 Release 2021-09-21 01:40:55

  • Include population descriptions in the output

v0.0.3 release 2021-09-20 10:44:32

  • Fixed bugs when parsing the sample identifier from files
  • Fixed reading DataFrames as user input

v0.0.2 release 2021-09-14 01:58:10

Include data in the package. Include the package itself.

Kevin Arvai

Data science & clinical genomics

GitHub Repository Homepage

genomics genomics-visualization personal-genomics streamlit data-visualization dimensionality-reduction ancestry genotypes