Graph Clustering Merger

vlasmirnov, updated 🕥 2023-02-17 05:36:11

MAGUS

Multiple Sequence Alignment using Graph Clustering


Purpose and Functionality

MAGUS is a tool for piecewise large-scale multiple sequence alignment.
The dataset is divided into subsets, which are independently aligned with a base method (currently MAFFT -linsi). These subalignments are merged together with the Graph Clustering Merger (GCM). GCM builds the final alignment by clustering an alignment graph, which is constructed from a set of backbone alignments. This process allows MAGUS to effectively boost MAFFT -linsi to over a million sequences.

The basic procedure is outlined below. Steps 4-7 are GCM. 1. The input is a set of unaligned sequences. Alternatively, the user can provide a set of multiple sequence alignments and skip the next two steps. 2. The dataset is decomposed into subsets. 3. The subsets are aligned with MAFFT -linsi. 4. A set of backbone alignments are generated with MAFFT -linsi (or provided by the user). 5. The backbones are compiled into an alignment graph. 6. The graph is clustered with MCL. 7. The clusters are resolved into a final alignment.


Installing MAGUS

Deepest thanks to Baqiao Liu for setting up the MAGUS PyPI package (https://pypi.org/project/magus-msa/)
This is currently the easiest way to get started with MAGUS.
The package can be installed with

pip3 install magus-msa

and executed with

magus \<arguments>

Alternatively, you can download and extract the code from this repository to a directory of your choice.
Then, you can run MAGUS with

python3 /magus.py


Dependencies

MAGUS requires * Python 3 * MAFFT (linux version is included) * MCL (linux version is included) * FastTree and Clustal Omega are needed if using these guide trees (linux versions included)

If you would like to use some other version of MAFFT and/or MCL (for instance, if you're using Mac), you will need to edit the MAFFT/MCL paths in configuration.py
(I'll pull these out into a separate config file to make it simpler).


Getting Started

Please navigate your terminal to the "example" directory to get started with some sample data.
A few basic ways of running MAGUS are shown below.
Run "magus.py -h" to view the full list of arguments.

Align a set of unaligned sequences from scratch
python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt

-o specifies the output alignment path
-d (optional) specifies the working directory for GCM's intermediate files, like the graph, clusters, log, etc.

Merge a prepared set of alignments
python3 ../magus.py -d outputs -s subalignments -o magus_result.txt

-s specifies the directory with subalignment files. Alternatively, you can pass a list of file paths.


Controlling the pipeline

Specify subset decomposition behavior
python3 ../magus.py -d outputs -i unaligned_sequences.txt -t fasttree --maxnumsubsets 100 --maxsubsetsize 50 -o magus_result.txt

-t specifies the guide tree method to use, and is the main way to set the decomposition strategy.
Available options are fasttree (default), parttree, clustal (recommended for very large datasets), and random.
--maxnumsubsets sets the desired number of subsets to decompose into (default 25).
--maxsubsetsize sets the threshold to stop decomposing subsets below this number (default 50).
Decomposition proceeds until maxnumsubsets is reached OR all subsets are below maxsubsetsize.

Specify beckbones for alignment graph
python3 ../magus.py -d outputs -i unaligned_sequences.txt -r 10 -m 200 -o magus_result.txt
python3 ../magus.py -d outputs -s subalignments -b backbones -o magus_result.txt

-r and -m specify the number of MAFFT backbones and their maximum size, respectively. Default to 10 and 200.
Alternatively, the user can provide his own backbones; -b can be used to provide a directory or a list of files.

Specify graph trace method
python3 ../magus.py -d outputs -i unaligned_sequences.txt --graphtracemethod mwtgreedy -o magus_result.txt

--graphtracemethod is the flag that governs the graph trace method. Options are minclusters (default and recommended), fm, mwtgreedy (recommended for very large graphs), rg, or mwtsearch.

Unconstrained alignment
python3 ../magus.py -d outputs -i unaligned_sequences.txt -c false -o magus_result.txt

By default, MAGUS constrains the merged alignment to induce all subalignments. This constraint can be disabled with -c false.
This drastically slows MAGUS and is strongly not recommended above 200 sequences.


Things to Keep in Mind

  • MAGUS will not overwrite existing backbone, graph and cluster files.
    Please delete them/specify a different working directory to perform a clean run.
  • Related issue: if MAGUS is stopped while running MAFFT, MAFFT's output backbone files will be empty.
    This will cause errors if MAGUS reruns and finds these empty files.
  • A large number of subalignments (>100) will start to significantly slow down the ordering phase, especially for very heterogenous data.
    I would generally disadvise using more than 100 subalignments, unless the data is expected to be well-behaved.

Related Publications

  • Original MAGUS paper: Smirnov, V. and Warnow, T., 2020. MAGUS: Multiple Sequence Alignment using Graph Clustering. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa992
  • GCM-MWT paper:
  • MAGUS on ultra-large datasets:

Issues

Wrong number of characters ERROR

opened on 2022-07-31 19:08:38 by francicco

Hi,

I'm getting:

``` Output: FastTree Version 2.1.11 SSE3, OpenMP (64 threads) Alignment: /user/work/tk19812/software/WITCH/examples/MAGUS/example/OBP.AA.outputs/decomposition/initial_tree/initial_align.txt Amino acid distances: BLOSUM45 Joins: balanced Support: none Search: Fastest+2nd +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.50 ML Model: Le-Gascuel 2008, CAT approximation with 20 rate categories Wrong number of characters for Wallacei_Hege.OBPloc26: expected 351 but have 345 instead. This sequence may be truncated, or another sequence may be too long.

``` when trying to align proteins. I don't understand why. Thanks F

Running on millions of protein sequences

opened on 2022-04-19 06:16:22 by felbecker

Hi,

For a large scale protein MSA test, I compiled the 5 largest Pfam families with 1-3 mio sequences and tried to align them with MAGUS. Unfortunately, MAGUS failed apparently due to a recursion-limit error when reading the tree (see attached error log). I tried:

python3 magus.py -t clustal python3 magus.py -t clustal --recurse True python3 magus.py -t clustal --maxnumsubsets 100 --recurse True

All gave the same error. You wrote "clustal" is the recommended option for large scale data, however, another tree algorithm might be a solution? Can you help? log_errors.txt

Incompatible with python >3.8?

opened on 2021-11-24 11:55:33 by joelnitta

I encountered this error when using python v3.10.0

(magus-env) [email protected]:/wd/MAGUS/example# python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt Traceback (most recent call last): File "/wd/MAGUS/example/../magus.py", line 12, in <module> from align.aligner import mainAlignmentTask File "/wd/MAGUS/align/aligner.py", line 11, in <module> from align.decompose.decomposer import decomposeSequences File "/wd/MAGUS/align/decompose/decomposer.py", line 11, in <module> from align.decompose import initial_tree, kmh File "/wd/MAGUS/align/decompose/initial_tree.py", line 13, in <module> from helpers import sequenceutils, hmmutils, treeutils File "/wd/MAGUS/helpers/treeutils.py", line 7, in <module> import dendropy File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/__init__.py", line 24, in <module> from dendropy.dataio.nexusprocessing import get_rooting_argument File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/dataio/__init__.py", line 20, in <module> from dendropy.dataio import newickreader File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/dataio/newickreader.py", line 29, in <module> from dendropy.dataio import nexusprocessing File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/dataio/nexusprocessing.py", line 30, in <module> from dendropy.utility import container File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/utility/container.py", line 356, in <module> class CaseInsensitiveDict(collections.MutableMapping): AttributeError: module 'collections' has no attribute 'MutableMapping'

It seems similar to this issue, and it went away when I used python 3.8. So I'm guessing it's due to the same problem (use of deprecated collections).

StopIteration error

opened on 2021-11-24 09:06:30 by joelnitta

I am trying to run the example python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt and getting this error:

``` (magus-env) [email protected]:/wd/MAGUS/example# python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt MAGUS was run with: ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt Running a task, output file: /wd/MAGUS/example/magus_result.txt Aligning sequences /wd/MAGUS/example/unaligned_sequences.txt Read 1000 sequences from /wd/MAGUS/example/unaligned_sequences.txt .. Building PASTA-style FastTree initial tree on /wd/MAGUS/example/unaligned_sequences.txt with skeleton size 300.. Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Running an external tool, command: /miniconda3/envs/magus-env/bin/mafft --localpair --maxiterate 1000 --ep 0.123 --quiet --thread 128 --anysymbol /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_sequences.txt > /wd/MAGUS/example/outputs/decomposition/initial_tree/temp_initial_align.txt Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/hmm_model.txt Running an external tool, command: /miniconda3/envs/magus-env/bin/hmmbuild --ere 0.59 --cpu 1 --symfrac 0.0 --informat afa /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/temp_hmm_model.txt /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/hmm_model.txt Read 700 sequences from /wd/MAGUS/example/outputs/decomposition/initial_tree/queries.txt .. Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/queries_chunk_1_aligned.txt Running an external tool, command: /miniconda3/envs/magus-env/bin/hmmalign -o /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/temp_queries_chunk_1_aligned.txt /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/hmm_model.txt /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/queries_chunk_1.txt Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/queries_chunk_1_aligned.txt Read 1000 sequences from /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt .. Found 100% ACGT-N, assuming DNA.. Data type wasn't specified. Inferred data type DNA from /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_tree.tre Running an external tool, command: /miniconda3/envs/magus-env/bin/fasttree -nt -gtr -fastest -nosupport /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt > /wd/MAGUS/example/outputs/decomposition/initial_tree/temp_initial_tree.tre Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_tree.tre Built initial tree on /wd/MAGUS/example/unaligned_sequences.txt in 183.0174605846405 sec.. Using target subset size of 50, and maximum number of subsets 25.. Read 1000 sequences from /wd/MAGUS/example/unaligned_sequences.txt .. Task for /wd/MAGUS/example/magus_result.txt threw an exception: generator raised StopIteration Traceback (most recent call last): File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 306, in tree_iter raise StopIteration StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/wd/MAGUS/tasks/task.py", line 59, in run func(self.taskArgs) File "/wd/MAGUS/align/aligner.py", line 45, in runAlignmentTask decomposeSequences(context) File "/wd/MAGUS/align/decompose/decomposer.py", line 44, in decomposeSequences buildDecomposition(context, subsetsDir) File "/wd/MAGUS/align/decompose/decomposer.py", line 66, in buildDecomposition context.subsetPaths = treeutils.decomposeGuideTree(subsetsDir, context.sequencesPath, guideTreePath, File "/wd/MAGUS/helpers/treeutils.py", line 96, in decomposeGuideTree guideTree = dendropy.Tree.get(path=guideTreePath, schema="newick", preserve_underscores=True) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2732, in get return cls._get_from(kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 155, in _get_from return cls.get_from_path(src=src, schema=schema, **kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 216, in get_from_path return cls._parse_and_create_from_stream(stream=fsrc, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2633, in _parse_and_create_from_stream tree_lists = reader.read_tree_lists( File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/ioservice.py", line 357, in read_tree_lists product = self._read(stream=stream, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 322, in _read for tree in self.tree_iter(stream=stream, RuntimeError: generator raised StopIteration

MAGUS aborted with an exception.. Task manager found a failed task: /wd/MAGUS/example/magus_result.txt Traceback (most recent call last): File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 306, in tree_iter raise StopIteration StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "../magus.py", line 29, in main mainAlignmentTask() File "/wd/MAGUS/align/aligner.py", line 30, in mainAlignmentTask task.awaitTask() File "/wd/MAGUS/tasks/task.py", line 47, in awaitTask awaitTasks([self]) File "/wd/MAGUS/tasks/task.py", line 94, in awaitTasks controller.awaitTasks(tasks) File "/wd/MAGUS/tasks/controller.py", line 34, in awaitTasks observeTaskManager() File "/wd/MAGUS/tasks/controller.py", line 53, in observeTaskManager runTask(task) File "/wd/MAGUS/tasks/manager.py", line 219, in runTask task.run() File "/wd/MAGUS/tasks/task.py", line 59, in run func(self.taskArgs) File "/wd/MAGUS/align/aligner.py", line 45, in runAlignmentTask decomposeSequences(context) File "/wd/MAGUS/align/decompose/decomposer.py", line 44, in decomposeSequences buildDecomposition(context, subsetsDir) File "/wd/MAGUS/align/decompose/decomposer.py", line 66, in buildDecomposition context.subsetPaths = treeutils.decomposeGuideTree(subsetsDir, context.sequencesPath, guideTreePath, File "/wd/MAGUS/helpers/treeutils.py", line 96, in decomposeGuideTree guideTree = dendropy.Tree.get(path=guideTreePath, schema="newick", preserve_underscores=True) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2732, in get return cls._get_from(kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 155, in _get_from return cls.get_from_path(src=src, schema=schema, **kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 216, in get_from_path return cls._parse_and_create_from_stream(stream=fsrc, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2633, in _parse_and_create_from_stream tree_lists = reader.read_tree_lists( File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/ioservice.py", line 357, in read_tree_lists product = self._read(stream=stream, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 322, in _read for tree in self.tree_iter(stream=stream, RuntimeError: generator raised StopIteration

Waiting for 0 tasks to finish.. MAGUS finished in 183.12643766403198 seconds.. ```

I had some trouble with dependencies and python versions, so I am running MAGUS in a conda environment, which is specified with the following environment.yml:

name: magus-env channels: - conda-forge - bioconda - defaults dependencies: - python=3.8.0 - dendropy=4.2.0 - clustalo=1.2.4 - mafft=7.490 - mcl=14.137 - hmmer=3.3.2 - fasttree=2.1.10 - raxml-ng=1.0.3

I also modified the paths in configuration.py as follows:

clustalPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/clustalo") mafftPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/mafft") mclPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/mcl") mlrmclPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "tools/mlrmcl/mlrmcl") hmmalignPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/hmmalign") hmmbuildPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/hmmbuild") hmmsearchPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/hmmsearch") fasttreePath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/fasttree") raxmlPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/raxml-ng")

(I couldn't find a package for mlrmcl, but that doesn't seem to have anything to do with the error, as far as I can tell).

The PermissionError

opened on 2021-11-04 09:55:40 by macelik

Hello again. :)

I am having a trouble, and I could not troubleshoot it. I am recieving The PermissionError: [errno 13] permission denied when I try it with my file which has nearly 600k sequences. MAGNUS is running on WSL2 with 40 cores and 400GB RAM. The code i am using is python ./MAGUS-master/magus.py -d outputs1 -i SGNRDiscarded.fasta -o aligned1.txt -t clustal. The thing is it runs alright without problems with the sample file that is provided (in the same directory). I have tried to chown the directory and as well as with sudo command. What i recognised is that the first time it threw the error the output directory was around 1.10GB and then I run again with sudo command and the time i threw the same error, the output directory was 1.45GB. So the same error but i think at different stages. At the moment I opened the WSL2 with Admin priviliges (right click and select run with admin priv) and running it again. In the link below you can find the captured error.png and log.txt.

PermissionError

Example fails invoking mafft

opened on 2021-06-04 15:24:42 by rcedgar

Cloned git repo today clean Ubuntu (AWS c5a.4xlarge instance with Ubuntu 20.04). Installed dendropy dependency.

``` cd example python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt

...some output deleted...

subprocess.CalledProcessError: Command '/home/ubuntu/magus/MAGUS-master/tools/mafft/mafft --localpair --maxiterate 1000 --ep 0.123 --quiet --thread 16 --anysymbol /home/ubuntu/magus/MAGUS-master/example/outputs/decomposition/initial_tree/skeleton_sequences.txt > /home/ubuntu/magus/MAGUS-master/example/outputs/decomposition/initial_tree/temp_initial_align.txt' returned non-zero exit status 126. ```