GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
GECCO is implemented in Python, and supports all versions from Python 3.7. It requires additional libraries that can be installed directly from PyPI, the Python Package Index.
Use pip
to install GECCO on your
machine:
console
$ pip install gecco-tool
If you'd rather use Conda, a package is available
in the bioconda
channel. You can install
with:
console
$ conda install -c bioconda gecco
This will install GECCO, its dependencies, and the data needed to run
predictions. This requires around 40MB of data to be downloaded, so
it could take some time depending on your Internet connection. Once done,
you will have a gecco
command available in your $PATH.
Note that GECCO uses HMMER3, which can only run on PowerPC and recent x86-64 machines running a POSIX operating system. Therefore, GECCO will work on Linux and OSX, but not on Windows.
Once gecco
is installed, you can run it from the terminal by giving it a
FASTA or GenBank file with the genomic sequence you want to analyze, as
well as an output directory:
console
$ gecco run --genome some_genome.fna -o some_output_dir
Additional parameters of interest are:
--jobs
, which controls the number of threads that will be spawned by
GECCO whenever a step can be parallelized. The default, 0, will
autodetect the number of CPUs on the machine using
os.cpu_count
.--cds
, controlling the minimum number of consecutive genes a BGC region
must have to be detected by GECCO. The default is 3.--threshold
, controlling the minimum probability for a gene to be
considered part of a BGC region. Using a lower number will increase the
number (and possibly length) of predictions, but reduce accuracy. The
default of 0.8 was selected to optimize precision/recall on a test set
of 364 BGCs from MIBiG 2.0.--cds-feature
, which can be supplied a feature name to extract genes
if the input file already contains gene annotations instead of predicting
genes with Pyrodigal. A common value
for records downloaded from GenBank is --cds-feature CDS
.GECCO will create the following files:
{genome}.genes.tsv
: The genes file, containing the genes extracted
or predicted from the input file, and per-gene BGC probabilities
predicted by the CRF.{genome}.features.tsv
: The features file, containing the identified
domains in the input sequences, in tabular format.{genome}.clusters.tsv
: If any were found, a clusters file, containing
the coordinates of the predicted clusters along their putative biosynthetic
type, in tabular format.{genome}_cluster_{N}.gbk
: If any were found, a GenBank file per cluster,
containing the cluster sequence annotated with its member proteins and domains.To get a more visual way of exploring of the predictions, you can open the GenBank files in a genome editing software like UGENE. You can otherwise load the results into an AntiSMASH report: check the Integrations page of the documentation for a step-by-step guide.
GECCO can be cited using the following preprint:
Accurate de novo identification of biosynthetic gene clusters with GECCO. Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
Contributions are more than welcome! See CONTRIBUTING.md
for more details.
This software is provided under the GNU General Public License v3.0 or later. GECCO is developped by the Zeller Team at the European Molecular Biology Laboratory in Heidelberg.
Hi, I am having trouble visualizing the output from GECCO on antiSMASH. I followed the instructions in the GECCO documentation and uploaded the JSON file generated by GECCO into the "extra annotations" tab on antiSMASH but I do not see any of the information from GECCO after running antiSMASH. I get the same results as if I never uploaded the JSON file. Any idea how to solve this issue?
Hello, excuse my ignorance please :) I am having an issue with running gecco on my laptop ( macbook pro) and I get this issue. zsh: segmentation fault gecco -v run --genome contigs.fa
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
I am trying to search for BGCs in a contigs file after assembling reads from a soil sample ( metagenomes) . Does gecco work directly on such files or does it have to be only one genome at a time. Am I missing out something? Is it a memory issue or problem in installation? I ran gecco before on a refseq genome from NCBI and it worked fine.
Hi @althonos
I have some questions pertaining to documentation
. I know you mention here some documentation and also have a disclaimer
Before I ask my questions, I there is a bug or something wrong in the help text for -vvv
(verbose debugging). I do not think that the -vvv
is working. Does it stand for very very verbose
gecco -vvv run --genome GENOME.fasta -o gecco_GENOME >& verbose_GENOME_gecco.txt &
change vvv to vv
Here is the relevant gecco --help
text - it states vvv shows debug information
``` gecco --help
Parameters:
-h, --help show the message for gecco
or
for a given subcommand.
-q, --quiet silence any output other than errors
(-qq silences everything).
-v, --verbose increase verbosity (-v is minimal,
-vv is verbose, and -vvv shows
debug information).
-V, --version show the program version and exit.
```
I have some questions/feature requests:
gecco annotate
command and what is the purpose of itgecco
for downstream post-processing with antismash
. I could not understand the use case for it from the preprintfeature request
or enhancement
, it would be nice to have gecco outputs (or scripts) in a compatible way for BiG-SLiCE. ```
Parameters - Cluster Detection:
-c, --cds
```
Hello,
Is there a manual that would explain the output files? I am interested in seeing what BCGs are shared by a range of genomes. The command to run the tool seems very simple but I am having trouble interpreting the output.
Many thanks!
gecco.interpro
local metadata.gecco.model.Domain
objects.gecco.model.Gene.to_seq_feature
.gzip
import in the CLI preventing usage of gzip-compressed inputs.importlib.metadata
on older Python versions.bgc_id
columns of cluster tables are renamed cluster_id
.gecco.model.ProductType
is renamed to gecco.model.ClusterType
.pyrodigal
dependency to v2.0
.pyhmmer
dependency to v0.7
.gecco predict
command to predict BGCs from an annotated genome.Protein.with_seq
function to assign a new sequence to a protein object.gecco run
and gecco predict
.gecco.orf
handle STOP codons consistently (#9).classes_
property to TypeClassifier
to access the classes_
attribute of the TypeBinarizer
.CDSFinder
which simply extracts CDS features from input sequences (#8).ProductType
is not restricted to MIBiG types anymore and can support any string as a base type identifier.PyrodigalFinder
now uses multiprocessing.pool.ThreadPool
instead of custom thread code thanks to OrfFinder.find_genes
reentrancy introduced in Pyrodigal v1.0
. PyrodigalFinder
can now be used in single / non-meta mode from the API.rich
version to 12.3
to use None
total in progress bars when the size of an HMM library is unknown.gecco.model
and gecco.cli
modules.--format
flag of gecco annotate
and gecco run
CLI commands is now made lowercase before giving value to Bio.SeqIO
.HMMER.run
.ClusterCRF
.genes.tsv
and features.tsv
table contain all genes even when they come from a contig too short to be processed by the CRF sliding window.--force-clusters-tsv
flag with a --force-tsv
flag to force writing TSV tables even when no genes or clusters were found in gecco run
or gecco annotate
.Projects Relating to the Zeller Team's Research of Host-Microbiota Interactions
GitHub Repository Homepagenatural-products biosynthetic-gene-clusters python bioinformatics genomics metagenomics secondary-metabolites