Multiple Sequence Alignment using Graph Clustering
MAGUS is a tool for piecewise large-scale multiple sequence alignment.
The dataset is divided into subsets, which are independently aligned with a base method (currently MAFFT -linsi). These subalignments are merged together with the Graph Clustering Merger (GCM). GCM builds the final alignment by clustering an alignment graph, which is constructed from a set of backbone alignments. This process allows MAGUS to effectively boost MAFFT -linsi to over a million sequences.
The basic procedure is outlined below. Steps 4-7 are GCM. 1. The input is a set of unaligned sequences. Alternatively, the user can provide a set of multiple sequence alignments and skip the next two steps. 2. The dataset is decomposed into subsets. 3. The subsets are aligned with MAFFT -linsi. 4. A set of backbone alignments are generated with MAFFT -linsi (or provided by the user). 5. The backbones are compiled into an alignment graph. 6. The graph is clustered with MCL. 7. The clusters are resolved into a final alignment.
Deepest thanks to Baqiao Liu for setting up the MAGUS PyPI package (https://pypi.org/project/magus-msa/)
This is currently the easiest way to get started with MAGUS.
The package can be installed with
pip3 install magus-msa
and executed with
magus \<arguments>
Alternatively, you can download and extract the code from this repository to a directory of your choice.
Then, you can run MAGUS with
python3
MAGUS requires * Python 3 * MAFFT (linux version is included) * MCL (linux version is included) * FastTree and Clustal Omega are needed if using these guide trees (linux versions included)
If you would like to use some other version of MAFFT and/or MCL (for instance, if you're using Mac),
you will need to edit the MAFFT/MCL paths in configuration.py
(I'll pull these out into a separate config file to make it simpler).
Please navigate your terminal to the "example" directory to get started with some sample data.
A few basic ways of running MAGUS are shown below.
Run "magus.py -h" to view the full list of arguments.
Align a set of unaligned sequences from scratch
python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt
-o specifies the output alignment path
-d (optional) specifies the working directory for GCM's intermediate files, like the graph, clusters, log, etc.
Merge a prepared set of alignments
python3 ../magus.py -d outputs -s subalignments -o magus_result.txt
-s specifies the directory with subalignment files. Alternatively, you can pass a list of file paths.
Specify subset decomposition behavior
python3 ../magus.py -d outputs -i unaligned_sequences.txt -t fasttree --maxnumsubsets 100 --maxsubsetsize 50 -o magus_result.txt
-t specifies the guide tree method to use, and is the main way to set the decomposition strategy.
Available options are fasttree (default), parttree, clustal (recommended for very large datasets), and random.
--maxnumsubsets sets the desired number of subsets to decompose into (default 25).
--maxsubsetsize sets the threshold to stop decomposing subsets below this number (default 50).
Decomposition proceeds until maxnumsubsets is reached OR all subsets are below maxsubsetsize.
Specify beckbones for alignment graph
python3 ../magus.py -d outputs -i unaligned_sequences.txt -r 10 -m 200 -o magus_result.txt
python3 ../magus.py -d outputs -s subalignments -b backbones -o magus_result.txt
-r and -m specify the number of MAFFT backbones and their maximum size, respectively. Default to 10 and 200.
Alternatively, the user can provide his own backbones; -b can be used to provide a directory or a list of files.
Specify graph trace method
python3 ../magus.py -d outputs -i unaligned_sequences.txt --graphtracemethod mwtgreedy -o magus_result.txt
--graphtracemethod is the flag that governs the graph trace method. Options are minclusters (default and recommended), fm, mwtgreedy (recommended for very large graphs), rg, or mwtsearch.
Unconstrained alignment
python3 ../magus.py -d outputs -i unaligned_sequences.txt -c false -o magus_result.txt
By default, MAGUS constrains the merged alignment to induce all subalignments. This constraint can be disabled with -c false.
This drastically slows MAGUS and is strongly not recommended above 200 sequences.
Hi,
I'm getting:
``` Output: FastTree Version 2.1.11 SSE3, OpenMP (64 threads) Alignment: /user/work/tk19812/software/WITCH/examples/MAGUS/example/OBP.AA.outputs/decomposition/initial_tree/initial_align.txt Amino acid distances: BLOSUM45 Joins: balanced Support: none Search: Fastest+2nd +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.50 ML Model: Le-Gascuel 2008, CAT approximation with 20 rate categories Wrong number of characters for Wallacei_Hege.OBPloc26: expected 351 but have 345 instead. This sequence may be truncated, or another sequence may be too long.
``` when trying to align proteins. I don't understand why. Thanks F
Hi,
For a large scale protein MSA test, I compiled the 5 largest Pfam families with 1-3 mio sequences and tried to align them with MAGUS. Unfortunately, MAGUS failed apparently due to a recursion-limit error when reading the tree (see attached error log). I tried:
python3 magus.py -t clustal
python3 magus.py -t clustal --recurse True
python3 magus.py -t clustal --maxnumsubsets 100 --recurse True
All gave the same error. You wrote "clustal" is the recommended option for large scale data, however, another tree algorithm might be a solution? Can you help? log_errors.txt
I encountered this error when using python v3.10.0
(magus-env) [email protected]:/wd/MAGUS/example# python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt
Traceback (most recent call last):
File "/wd/MAGUS/example/../magus.py", line 12, in <module>
from align.aligner import mainAlignmentTask
File "/wd/MAGUS/align/aligner.py", line 11, in <module>
from align.decompose.decomposer import decomposeSequences
File "/wd/MAGUS/align/decompose/decomposer.py", line 11, in <module>
from align.decompose import initial_tree, kmh
File "/wd/MAGUS/align/decompose/initial_tree.py", line 13, in <module>
from helpers import sequenceutils, hmmutils, treeutils
File "/wd/MAGUS/helpers/treeutils.py", line 7, in <module>
import dendropy
File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/__init__.py", line 24, in <module>
from dendropy.dataio.nexusprocessing import get_rooting_argument
File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/dataio/__init__.py", line 20, in <module>
from dendropy.dataio import newickreader
File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/dataio/newickreader.py", line 29, in <module>
from dendropy.dataio import nexusprocessing
File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/dataio/nexusprocessing.py", line 30, in <module>
from dendropy.utility import container
File "/miniconda3/envs/magus-env/lib/python3.10/site-packages/dendropy/utility/container.py", line 356, in <module>
class CaseInsensitiveDict(collections.MutableMapping):
AttributeError: module 'collections' has no attribute 'MutableMapping'
It seems similar to this issue, and it went away when I used python 3.8. So I'm guessing it's due to the same problem (use of deprecated collections).
I am trying to run the example python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt
and getting this error:
``` (magus-env) [email protected]:/wd/MAGUS/example# python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt MAGUS was run with: ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt Running a task, output file: /wd/MAGUS/example/magus_result.txt Aligning sequences /wd/MAGUS/example/unaligned_sequences.txt Read 1000 sequences from /wd/MAGUS/example/unaligned_sequences.txt .. Building PASTA-style FastTree initial tree on /wd/MAGUS/example/unaligned_sequences.txt with skeleton size 300.. Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Running an external tool, command: /miniconda3/envs/magus-env/bin/mafft --localpair --maxiterate 1000 --ep 0.123 --quiet --thread 128 --anysymbol /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_sequences.txt > /wd/MAGUS/example/outputs/decomposition/initial_tree/temp_initial_align.txt Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/hmm_model.txt Running an external tool, command: /miniconda3/envs/magus-env/bin/hmmbuild --ere 0.59 --cpu 1 --symfrac 0.0 --informat afa /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/temp_hmm_model.txt /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/hmm_model.txt Read 700 sequences from /wd/MAGUS/example/outputs/decomposition/initial_tree/queries.txt .. Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/queries_chunk_1_aligned.txt Running an external tool, command: /miniconda3/envs/magus-env/bin/hmmalign -o /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/temp_queries_chunk_1_aligned.txt /wd/MAGUS/example/outputs/decomposition/initial_tree/skeleton_hmm/hmm_model.txt /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/queries_chunk_1.txt Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/chunks_queries/queries_chunk_1_aligned.txt Read 1000 sequences from /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt .. Found 100% ACGT-N, assuming DNA.. Data type wasn't specified. Inferred data type DNA from /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt Running a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_tree.tre Running an external tool, command: /miniconda3/envs/magus-env/bin/fasttree -nt -gtr -fastest -nosupport /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_align.txt > /wd/MAGUS/example/outputs/decomposition/initial_tree/temp_initial_tree.tre Completed a task, output file: /wd/MAGUS/example/outputs/decomposition/initial_tree/initial_tree.tre Built initial tree on /wd/MAGUS/example/unaligned_sequences.txt in 183.0174605846405 sec.. Using target subset size of 50, and maximum number of subsets 25.. Read 1000 sequences from /wd/MAGUS/example/unaligned_sequences.txt .. Task for /wd/MAGUS/example/magus_result.txt threw an exception: generator raised StopIteration Traceback (most recent call last): File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 306, in tree_iter raise StopIteration StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/wd/MAGUS/tasks/task.py", line 59, in run func(self.taskArgs) File "/wd/MAGUS/align/aligner.py", line 45, in runAlignmentTask decomposeSequences(context) File "/wd/MAGUS/align/decompose/decomposer.py", line 44, in decomposeSequences buildDecomposition(context, subsetsDir) File "/wd/MAGUS/align/decompose/decomposer.py", line 66, in buildDecomposition context.subsetPaths = treeutils.decomposeGuideTree(subsetsDir, context.sequencesPath, guideTreePath, File "/wd/MAGUS/helpers/treeutils.py", line 96, in decomposeGuideTree guideTree = dendropy.Tree.get(path=guideTreePath, schema="newick", preserve_underscores=True) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2732, in get return cls._get_from(kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 155, in _get_from return cls.get_from_path(src=src, schema=schema, **kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 216, in get_from_path return cls._parse_and_create_from_stream(stream=fsrc, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2633, in _parse_and_create_from_stream tree_lists = reader.read_tree_lists( File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/ioservice.py", line 357, in read_tree_lists product = self._read(stream=stream, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 322, in _read for tree in self.tree_iter(stream=stream, RuntimeError: generator raised StopIteration
MAGUS aborted with an exception.. Task manager found a failed task: /wd/MAGUS/example/magus_result.txt Traceback (most recent call last): File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 306, in tree_iter raise StopIteration StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "../magus.py", line 29, in main mainAlignmentTask() File "/wd/MAGUS/align/aligner.py", line 30, in mainAlignmentTask task.awaitTask() File "/wd/MAGUS/tasks/task.py", line 47, in awaitTask awaitTasks([self]) File "/wd/MAGUS/tasks/task.py", line 94, in awaitTasks controller.awaitTasks(tasks) File "/wd/MAGUS/tasks/controller.py", line 34, in awaitTasks observeTaskManager() File "/wd/MAGUS/tasks/controller.py", line 53, in observeTaskManager runTask(task) File "/wd/MAGUS/tasks/manager.py", line 219, in runTask task.run() File "/wd/MAGUS/tasks/task.py", line 59, in run func(self.taskArgs) File "/wd/MAGUS/align/aligner.py", line 45, in runAlignmentTask decomposeSequences(context) File "/wd/MAGUS/align/decompose/decomposer.py", line 44, in decomposeSequences buildDecomposition(context, subsetsDir) File "/wd/MAGUS/align/decompose/decomposer.py", line 66, in buildDecomposition context.subsetPaths = treeutils.decomposeGuideTree(subsetsDir, context.sequencesPath, guideTreePath, File "/wd/MAGUS/helpers/treeutils.py", line 96, in decomposeGuideTree guideTree = dendropy.Tree.get(path=guideTreePath, schema="newick", preserve_underscores=True) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2732, in get return cls._get_from(kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 155, in _get_from return cls.get_from_path(src=src, schema=schema, **kwargs) File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/basemodel.py", line 216, in get_from_path return cls._parse_and_create_from_stream(stream=fsrc, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/datamodel/treemodel.py", line 2633, in _parse_and_create_from_stream tree_lists = reader.read_tree_lists( File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/ioservice.py", line 357, in read_tree_lists product = self._read(stream=stream, File "/miniconda3/envs/magus-env/lib/python3.8/site-packages/dendropy/dataio/newickreader.py", line 322, in _read for tree in self.tree_iter(stream=stream, RuntimeError: generator raised StopIteration
Waiting for 0 tasks to finish.. MAGUS finished in 183.12643766403198 seconds.. ```
I had some trouble with dependencies and python versions, so I am running MAGUS in a conda environment, which is specified with the following environment.yml:
name: magus-env
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3.8.0
- dendropy=4.2.0
- clustalo=1.2.4
- mafft=7.490
- mcl=14.137
- hmmer=3.3.2
- fasttree=2.1.10
- raxml-ng=1.0.3
I also modified the paths in configuration.py
as follows:
clustalPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/clustalo")
mafftPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/mafft")
mclPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/mcl")
mlrmclPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "tools/mlrmcl/mlrmcl")
hmmalignPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/hmmalign")
hmmbuildPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/hmmbuild")
hmmsearchPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/hmmsearch")
fasttreePath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/fasttree")
raxmlPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), "/miniconda3/envs/magus-env/bin/raxml-ng")
(I couldn't find a package for mlrmcl, but that doesn't seem to have anything to do with the error, as far as I can tell).
Hello again. :)
I am having a trouble, and I could not troubleshoot it. I am recieving The PermissionError: [errno 13] permission denied
when I try it with my file which has nearly 600k sequences. MAGNUS is running on WSL2 with 40 cores and 400GB RAM. The code i am using is python ./MAGUS-master/magus.py -d outputs1 -i SGNRDiscarded.fasta -o aligned1.txt -t clustal
. The thing is it runs alright without problems with the sample file that is provided (in the same directory). I have tried to chown the directory and as well as with sudo command. What i recognised is that the first time it threw the error the output directory was around 1.10GB and then I run again with sudo command and the time i threw the same error, the output directory was 1.45GB. So the same error but i think at different stages. At the moment I opened the WSL2 with Admin priviliges (right click and select run with admin priv) and running it again. In the link below you can find the captured error.png and log.txt.
Cloned git repo today clean Ubuntu (AWS c5a.4xlarge instance with Ubuntu 20.04). Installed dendropy dependency.
``` cd example python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt
subprocess.CalledProcessError: Command '/home/ubuntu/magus/MAGUS-master/tools/mafft/mafft --localpair --maxiterate 1000 --ep 0.123 --quiet --thread 16 --anysymbol /home/ubuntu/magus/MAGUS-master/example/outputs/decomposition/initial_tree/skeleton_sequences.txt > /home/ubuntu/magus/MAGUS-master/example/outputs/decomposition/initial_tree/temp_initial_align.txt' returned non-zero exit status 126. ```