GEne Cluster prediction with COnditional random fields.

zellerlab, updated πŸ•₯ 2023-03-21 16:16:00

Hi, I'm GECCO!

🦎 ️Overview

GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).

Actions License Coverage Docs Source Mirror Changelog Issues Preprint PyPI Bioconda Galaxy Versions Wheel

πŸ”§ Installing GECCO

GECCO is implemented in Python, and supports all versions from Python 3.7. It requires additional libraries that can be installed directly from PyPI, the Python Package Index.

Use pip to install GECCO on your machine: console $ pip install gecco-tool

If you'd rather use Conda, a package is available in the bioconda channel. You can install with: console $ conda install -c bioconda gecco

This will install GECCO, its dependencies, and the data needed to run predictions. This requires around 40MB of data to be downloaded, so it could take some time depending on your Internet connection. Once done, you will have a gecco command available in your $PATH.

Note that GECCO uses HMMER3, which can only run on PowerPC and recent x86-64 machines running a POSIX operating system. Therefore, GECCO will work on Linux and OSX, but not on Windows.

🧬 Running GECCO

Once gecco is installed, you can run it from the terminal by giving it a FASTA or GenBank file with the genomic sequence you want to analyze, as well as an output directory:

console $ gecco run --genome some_genome.fna -o some_output_dir

Additional parameters of interest are:

  • --jobs, which controls the number of threads that will be spawned by GECCO whenever a step can be parallelized. The default, 0, will autodetect the number of CPUs on the machine using os.cpu_count.
  • --cds, controlling the minimum number of consecutive genes a BGC region must have to be detected by GECCO. The default is 3.
  • --threshold, controlling the minimum probability for a gene to be considered part of a BGC region. Using a lower number will increase the number (and possibly length) of predictions, but reduce accuracy. The default of 0.8 was selected to optimize precision/recall on a test set of 364 BGCs from MIBiG 2.0.
  • --cds-feature, which can be supplied a feature name to extract genes if the input file already contains gene annotations instead of predicting genes with Pyrodigal. A common value for records downloaded from GenBank is --cds-feature CDS.

πŸ”Ž Results

GECCO will create the following files:

  • {genome}.genes.tsv: The genes file, containing the genes extracted or predicted from the input file, and per-gene BGC probabilities predicted by the CRF.
  • {genome}.features.tsv: The features file, containing the identified domains in the input sequences, in tabular format.
  • {genome}.clusters.tsv: If any were found, a clusters file, containing the coordinates of the predicted clusters along their putative biosynthetic type, in tabular format.
  • {genome}_cluster_{N}.gbk: If any were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.

To get a more visual way of exploring of the predictions, you can open the GenBank files in a genome editing software like UGENE. You can otherwise load the results into an AntiSMASH report: check the Integrations page of the documentation for a step-by-step guide.

πŸ”– Reference

GECCO can be cited using the following preprint:

Accurate de novo identification of biosynthetic gene clusters with GECCO. Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

πŸ’­ Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

πŸ—οΈ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

βš–οΈ License

This software is provided under the GNU General Public License v3.0 or later. GECCO is developped by the Zeller Team at the European Molecular Biology Laboratory in Heidelberg.

Issues

Visualization issue

opened on 2022-09-09 10:06:12 by OwenNaicker

Hi, I am having trouble visualizing the output from GECCO on antiSMASH. I followed the instructions in the GECCO documentation and uploaded the JSON file generated by GECCO into the "extra annotations" tab on antiSMASH but I do not see any of the information from GECCO after running antiSMASH. I get the same results as if I never uploaded the JSON file. Any idea how to solve this issue?

is it memory issues?

opened on 2021-10-21 16:08:44 by Starcommits

Hello, excuse my ignorance please :) I am having an issue with running gecco on my laptop ( macbook pro) and I get this issue. zsh: segmentation fault gecco -v run --genome contigs.fa

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

I am trying to search for BGCs in a contigs file after assembling reads from a soil sample ( metagenomes) . Does gecco work directly on such files or does it have to be only one genome at a time. Am I missing out something? Is it a memory issue or problem in installation? I ran gecco before on a refseq genome from NCBI and it worked fine.

[Question] Documentation - Gecco use cases for 'annotation', downstream 'antismash'

opened on 2021-05-30 19:27:25 by tamuanand

Hi @althonos

I have some questions pertaining to documentation . I know you mention here some documentation and also have a disclaimer

Before I ask my questions, I there is a bug or something wrong in the help text for -vvv (verbose debugging). I do not think that the -vvv is working. Does it stand for very very verbose

  • When I invoke it, it causes the program to exit gecco -vvv run --genome GENOME.fasta -o gecco_GENOME >& verbose_GENOME_gecco.txt &
  • However, the same works if I change vvv to vv

Here is the relevant gecco --help text - it states vvv shows debug information

``` gecco --help

Parameters: -h, --help show the message for gecco or for a given subcommand. -q, --quiet silence any output other than errors (-qq silences everything). -v, --verbose increase verbosity (-v is minimal, -vv is verbose, and -vvv shows debug information). -V, --version show the program version and exit. ```

I have some questions/feature requests:

  1. When do you use the gecco annotate command and what is the purpose of it
  2. In what scenarios does one use gecco for downstream post-processing with antismash. I could not understand the use case for it from the preprint
  3. I am assuming you would have done a downstream BiG-SLiCE process with your datasets. As a feature request or enhancement, it would be nice to have gecco outputs (or scripts) in a compatible way for BiG-SLiCE.
  4. I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE

``` Parameters - Cluster Detection: -c, --cds the minimum number of coding sequences a valid cluster must contain. [default: 3] -m , --threshold the probability threshold for cluster detection. Default depends on the post-processing method (0.4 for gecco, 0.6 for antismash). --postproc the method to use for cluster validation (antismash or gecco). [default: gecco]

```

manual on the tool output

opened on 2021-05-13 09:03:06 by smb20200615

Hello,

Is there a manual that would explain the output files? I am interested in seeing what BCGs are shared by a range of genomes. The command to run the tool seems very simple but I am having trouble interpreting the output.

Many thanks!

Releases

v0.9.6 2023-01-11 19:33:15

Added

  • Gene Ontology annotations to gecco.interpro local metadata.
  • Reference to Gene Ontology terms and derived functions to gecco.model.Domain objects.
  • Gene color based on predicted function in gecco.model.Gene.to_seq_feature.

Fixed

  • Missing gzip import in the CLI preventing usage of gzip-compressed inputs.
  • Invalid coordinates of domains found in reverse-strand genes.
  • Detection of entry points with importlib.metadata on older Python versions.

Changed

  • bgc_id columns of cluster tables are renamed cluster_id.
  • gecco.model.ProductType is renamed to gecco.model.ClusterType.
  • Bumped pyrodigal dependency to v2.0.
  • Bumped pyhmmer dependency to v0.7.

v0.9.5 2022-08-10 12:25:56

Added

  • gecco predict command to predict BGCs from an annotated genome.
  • Protein.with_seq function to assign a new sequence to a protein object.

Fixed

  • Issue with antiSMASH sideload JSON file generation in gecco run and gecco predict.
  • Make gecco.orf handle STOP codons consistently (#9).

v0.9.4 2022-05-31 10:41:35

Added

  • classes_ property to TypeClassifier to access the classes_ attribute of the TypeBinarizer.
  • Alternative ORF finder CDSFinder which simply extracts CDS features from input sequences (#8).
  • Support for annotating domains with "exclusive" HMMs to annotate genes with at most one HMM from the library.

Changed

  • ProductType is not restricted to MIBiG types anymore and can support any string as a base type identifier.
  • PyrodigalFinder now uses multiprocessing.pool.ThreadPool instead of custom thread code thanks to OrfFinder.find_genes reentrancy introduced in Pyrodigal v1.0.
  • PyrodigalFinder can now be used in single / non-meta mode from the API.
  • BUmped minimum rich version to 12.3 to use None total in progress bars when the size of an HMM library is unknown.

Fixed

  • Broken MyPy type annotations in the gecco.model and gecco.cli modules.

v0.9.3 2022-05-13 14:26:09

Changed

  • --format flag of gecco annotate and gecco run CLI commands is now made lowercase before giving value to Bio.SeqIO.

Fixed

  • Genes with duplicate IDs being silently ignored in HMMER.run.

v0.9.2 2022-04-11 17:07:54

Added

  • Padding of short sequences with empty genes when predicting probabilities in ClusterCRF.

v0.9.1 2022-04-05 15:59:57

Changed

  • Make the genes.tsv and features.tsv table contain all genes even when they come from a contig too short to be processed by the CRF sliding window.
  • Replaced the --force-clusters-tsv flag with a --force-tsv flag to force writing TSV tables even when no genes or clusters were found in gecco run or gecco annotate.
Zeller Lab

Projects Relating to the Zeller Team's Research of Host-Microbiota Interactions

GitHub Repository Homepage

natural-products biosynthetic-gene-clusters python bioinformatics genomics metagenomics secondary-metabolites