PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)

bacpop, updated 🕥 2023-03-29 13:56:52

POPulation Partitioning Using Nucleotide Kmers

Dev build Status Run tests Documentation Status Anaconda package PyPI version


See the documentation the paper, and databases.

If you find PopPUNK useful, please cite us:

Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research 29:304-316 (2019). doi:10.1101/gr.241455.118

You can also run your command with --citation to get a list of citations and a suggested methods paragraph.



We will retire the PopPUNK website. Databases have been expanded, and can be found here: https://www.bacpop.org/poppunk/.


The change in scikit-learn's API in v1.0.0 and above mean that HDBSCAN models fitted with sklearn <=v0.24 will give an error when loaded. If you run into this, the solution is one of: - Downgrade sklearn to v0.24. - Run model refinement to turn your model into a boundary model instead (this will change clusters). - Refit your model in an environment with sklearn >=v1.0.

If this is a common problem let us know, as we could write a script to 'upgrade' HDBSCAN models. See issue #213 for more details.


We have fixed a number of bugs with may affect the use of poppunk_assign with --update-db. We have also fixed a number of bugs with GPU distances. These are 'advanced' features and are not likely to be encountered in most cases, but if you do wish to use either of these features please make sure that you are using PopPUNK >=v2.4.0 with pp-sketchlib >=v1.7.0.


We have discovered a bug affecting the interaction of pp-sketchlib and PopPUNK. If you have used PopPUNK >=v2.0.0 with pp-sketchlib <v1.5.1 label order may be incorrect (see issue #95).

Please upgrade to PopPUNK >=v2.2 and pp-sketchlib >=v1.5.1. If this is not possible, you can either: - Run scripts/poppunk_pickle_fix.py on your .dists.pkl file and re-run model fits. - Create the database with poppunk_sketch directly, rather than PopPUNK --create-db


This is for the command line version. For more details see installation in the documentation.

There are other interfaces, in-browser and through galaxy, detailed here.

Through conda (recommended)

The easiest way is through conda, which is most easily accessed by first installing miniconda. PopPUNK can then be installed by running: conda install poppunk If the package cannot be found you will need to add the necessary channels: conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge

Quick usage

See the quickstart guide for a brief tutorial.

Docker image

A docker image is available

docker pull mrcide/poppunk:bacpop-20


Query data not working

opened on 2023-03-11 09:45:33 by Drojok


Command used and output returned

Describe the bug

Increase maxIter with mandrake, and also add option

opened on 2023-03-01 18:44:18 by johnlees

10^7 may not be enough, but should definitely be able to change this in visualise

Investigate slow assignment with full DBs vs ref DBs

opened on 2022-12-07 09:12:19 by johnlees None

Replace network code with pp-netlib

opened on 2022-12-07 09:11:47 by johnlees None

Running Poppunk v2.5.0 with --multi-boundry --> OverflowError

opened on 2022-11-24 10:05:44 by avonm

Versions Poppunk v2.5.0 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Command used and output returned poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_QCd_dbscan_k18_k32 --ref-db /tmp/Ecoli_n79k_db_k18_k32_221115 --output /tmp/Ecoli_n79k_QCd_dbscan_refine_multi_1 --multi-boundary 30 --threads 20

Describe the bug Below is the output I get, both the process of the run and the error. The plan is to run poppunk_iterate after this.

Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database

Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.057) to (0.014,0.304) Searching core intercept from 0.006 to 0.042 Searching accessory intercept from 0.064 to 0.448 █████████████████████████████████| 40/40 Trying to optimise score locally

Optimization terminated successfully; The returned value satisfies the termination criteria (using xtol = 1e-05 ) Creating multiple boundary fits Search range (0.000,0.044) to (0.006,0.164) Searching core intercept from 0.004 to 0.022 Searching accessory intercept from 0.044 to 0.231 █▏ | 1/30 Traceback (most recent call last): File "/opt/conda/bin/poppunk", line 11, in sys.exit(main()) File "/opt/conda/lib/python3.9/site-packages/PopPUNK/main.py", line 469, in main assignments = new_model.fit(distMat, refList, model, File "/opt/conda/lib/python3.9/site-packages/PopPUNK/models.py", line 808, in fit multi_refine(scaled_X, File "/opt/conda/lib/python3.9/site-packages/PopPUNK/refine.py", line 296, in multi_refine growNetwork(sample_names, File "/opt/conda/lib/python3.9/site-packages/PopPUNK/refine.py", line 442, in growNetwork G.add_edge_list(edge_list) File "/opt/conda/lib/python3.9/site-packages/graph_tool/init.py", line 2501, in add_edge_list libcore.add_edge_list_iter(self.__graph, edge_list, eprops) OverflowError: can't convert negative value to unsigned int

Clustering may not be reproducible with some large data sets

opened on 2022-11-21 14:04:11 by stitam


PoPPUNK v2.5.0 in a singularity container. Link to the dockerfile: https://github.com/StaPH-B/docker-builds/tree/master/poppunk/2.5.0. Not sure about pp-sketchlib version because the command returned executable file not found in $PATH.

Command used and output returned

poppunk --create-db --output ppdb_13_29_4_0.05_4l --r-files rlist.txt --threads 30 --plot-fit 5 --min-k 13 --max-k 29 --k-step 4 --min-cluster-prop 1e-05 --max-zero-dist 0.005

poppunk --fit-model dbscan --ref-db ppdb_13_29_4_0.05_4 --output results_13_29_4_0.05_4 --threads 30

Describe the bug

My main issue is that poppunk fails to cluster genomes appropriately when I run the tool for 15000 genomes (before scaling up I ran the tool with about 500 genomes, and clustering completed successfully).

First, I ran poppunk with default parameters, everything clustered into a single poppunk cluster. Then, I set up an experiment where I altered the following parameters: min-k (13 or 16), max-k (29 or 31), k-step (3 or 4), max-zero-dist (0.005 or 0.05), min-cluster-prop (0.00001, or 0.0001). 13_29_4_0.05_4 in the section above indicates min-k 13, max-k 29, k-step 4, max-zero-dist 0.05, min-cluster-prop 0.0001 (-log10 scale).

Most parameter combinations produced terrible results. Surprisingly, one completed successfully: the one with default parameters. So I ran poppunk again with default parameters again, and then it failed, again. The most obvious cause is that I keep overlooking something (working on it), but if this is not the case, then I am wondering if e.g. poppunk generates random numbers anywhere? Can you please clarify this? Many thanks.


PopPUNK v2.6.0 2022-11-17 17:21:39

Main changes: - Lineage fits now use reciprocal best match with --reciprocal-only, --count-unique-distances and --max-search-depth, which gives better results. - Fixes for threshold model assignment

What's Changed

  • Update parsing fixes branch with new master by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/195
  • Update with recent changes to master by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/205
  • Integrate changes from V250 candidate by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/206
  • Small fixes for beebop by @muppi1993 in https://github.com/bacpop/PopPUNK/pull/217
  • Update poppunk_iterate.py by @BZhao95 in https://github.com/bacpop/PopPUNK/pull/216
  • Read the docs sphinx version by @johnlees in https://github.com/bacpop/PopPUNK/pull/215
  • Adds poppunk_distribute_fit.py by @samhorsfield96 in https://github.com/bacpop/PopPUNK/pull/226
  • Update with changes to master by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/230
  • Patch for relative paths in plot fit by @johnlees in https://github.com/bacpop/PopPUNK/pull/236
  • Fix option names for k-mer range min/max and k-mer step. by @tmaklin in https://github.com/bacpop/PopPUNK/pull/239
  • Fix for assign with threshold models by @johnlees in https://github.com/bacpop/PopPUNK/pull/240
  • Lineage model fitting - PopPUNK changes by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/232
  • Remove blas by @johnlees in https://github.com/bacpop/PopPUNK/pull/244
  • Proceed with update-db on QC failure by @johnlees in https://github.com/bacpop/PopPUNK/pull/245

New Contributors

  • @BZhao95 made their first contribution in https://github.com/bacpop/PopPUNK/pull/216
  • @samhorsfield96 made their first contribution in https://github.com/bacpop/PopPUNK/pull/226
  • @tmaklin made their first contribution in https://github.com/bacpop/PopPUNK/pull/239

Full Changelog: https://github.com/bacpop/PopPUNK/compare/v2.5.0...v2.6.0

PopPUNK v2.5.0 2022-08-25 10:40:19

Minimum sketchlib version for this release is v2.0.0

New features: - Dendropy replaced with faster & more reliable alternatives #203 - A new logo #202 - Improve iterative PopPUNK code - Documentation update and improvements #191 - Deal better with name clash when querying #190 - Make manual start a bit easier to use #174 - Replace t-SNE with mandrake - Output .microreact files, and allow direct creation of Microreact instances with an API key - Various QC additions to help with multi-cluster merges #194

Bug fixes: - Various fixes to cytoscape visualisation #185 #196 #210 - Hide progress bars when using --plot-fit - Stop always checking query-query dists when clustering (and potential bug adding them to network twice) - Fix N QC when working with reads #207

What's Changed

  • Upgrade of GPU refinement by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/164
  • Remove start_point concept from refine fit by @johnlees in https://github.com/bacpop/PopPUNK/pull/168
  • Upgrades to refinement functions by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/175
  • Update MST calculation documentation by @nickjcroucher in https://github.com/bacpop/PopPUNK/pull/177
  • Move the extend algorithm into the C++ extension by @johnlees in https://github.com/bacpop/PopPUNK/pull/178
  • Multi-boundary method by @johnlees in https://github.com/bacpop/PopPUNK/pull/180
  • Bacpop 17 by @muppi1993 in https://github.com/bacpop/PopPUNK/pull/201
  • Add rapidnj to docker image by @muppi1993 in https://github.com/bacpop/PopPUNK/pull/212
  • Release for v2.5.0 by @johnlees in https://github.com/bacpop/PopPUNK/pull/204
  • pip installable poppunk, almost by @richfitz in https://github.com/bacpop/PopPUNK/pull/209

New Contributors

  • @muppi1993 made their first contribution in https://github.com/bacpop/PopPUNK/pull/201
  • @richfitz made their first contribution in https://github.com/bacpop/PopPUNK/pull/209

Full Changelog: https://github.com/bacpop/PopPUNK/compare/v2.4.0...v2.5.0

PopPUNK v2.4.0 2021-03-23 17:18:43

Minimum sketchlib version for this release is v1.7.0

To use --gpu-graph requires cudf and cugraph to be installed from the nvidia conda channel, which is not part of the standard installation)

New features: - Adds minimum spanning tree computation and visualisation #141 #148 - Add two new network scores based on betweenness #146 - Move boundary code into a C++ extension in this package #146 #158 - Adds GPU accelerated graphs #87 #148 - Adds a docker container which is used for web.poppunk.net #151 #162 - New github actions for testing and building the web API #151 - Add progress bars in for model assignment #155 - Parallelise model assignment #155 - Adds the VLKC terminology, and 'unword' cluster names #161

Bug fixes: - Correctly specify thread count with rapidnj #139 - Regenerate random match changes after --update-db #149 - Fix issue with label order when using --update-db more than once #152 - Update some scripts/ to work with newer versions of numpy and scikit-learn #160 - Keep hyphens in sample names in trees #159 - Fix a plot name #158 - Pin some package versions #140 #142

PopPUNK v2.3.0 2020-12-31 15:33:02

This is a major (API-breaking) update which moves the assign and visualisation functions into their own programs, to make the program more modular. The minimum version of pp-sketchlib required is 1.6.0.

New features: - Lineage assign mode uses matrix code in pp-sketchlib #108 - New algorithm for clique pruning #110 - Visualisation and query moved out of main, and into their own programs #112 #115 #129 - Simpler CLI defaults #125 - Updated documentation #122 - Add edge weights to graph #123 - Add API for use of poppunk_assign with a http server #124 #131 - Add corrected/uncorrected distances when plotting k-mer fits #136

Bug fixes: - More stable generation of documentation #132 - Fixes continue mode for QC function #134 - Fixes long length QC fail #137

PopPUNK v2.2.0 2020-09-30 14:40:43

The first bug fix will affect many results, and all users are encouraged to upgrade

New features: - More thorough sample QC using pp-sketchlib features (#101) - Update to pp-sketchlib v1.5.1 (#104)

Bug fixes: - Misordered labels with older versions of pp-sketchlib (#95) - TypeError with visualisations (#99) - networkx still used in reference prune program (#97)

Sketchlib 1.4.0 2020-07-22 14:52:30

  • Update to sketchlib 1.4.0 (#90) (https://github.com/johnlees/pp-sketchlib/releases/tag/v1.4.0)
  • Use faster, threaded matrix functions (#78, #80)
  • Change from networkx to graph-tool (#83)
  • Use sharedmem (#76)
  • Add --lineage-clustering mode (#72)
  • Better --refine-model default boundary (#94)

NB python >=3.8 is now required (#81, #76)

Bacterial population genetics

Pathogen Informatics and Modelling @ EMBL-EBI / Bacterial Evolutionary Epidemiology Group @ Imperial College London

GitHub Repository Homepage

bacteria genomics population-genetics k-mer sketching