pycoQC computes metrics and generates Interactive QC plots from the sequencing summary report generated by Oxford Nanopore technologies basecaller (Albacore/Guppy)

a-slide, updated 🕥 2022-10-03 20:12:23

pycoQC v2.5.2

pycoQC

JOSS DOI Gitter chat GitHub license Language

PyPI version Downloads

Anaconda Version Anaconda Downloads

install with bioconda Bioconda Downloads

Build Status


PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data

PycoQC relies on the sequencing_summary.txt file generated by Albacore and Guppy, but if needed it can also generate a summary file from basecalled fast5 files. The package supports 1D and 1D2 runs generated with Minion, Gridion and Promethion devices, basecalled with Albacore 1.2.1+ or Guppy 2.1.3+. PycoQC is written in pure Python3. Python 2 is not supported. For a quick introduction see tutorial by Tim Kahlke available at https://timkahlke.github.io/LongRead_tutorials/QC_P.html

Full documentation is available at https://a-slide.github.io/pycoQC

Gallery

summary

reads_len_1D_example]

reads_len_1D_example]

reads_qual_len_2D_example

channels_activity

output_over_time

qual_over_time

len_over_time

align_len

align_score

align_score_len_2D

alignment_coverage

alignment_rate

alignment_summary

Example HTML reports

Example JSON reports

Disclaimer

Please be aware that pycoQC is a research package that is still under development.

It was tested under Linux Ubuntu 16.04 and in an HPC environment running under Red Hat Enterprise 7.1.

Thank you

Classifiers

  • Development Status :: 3 - Alpha
  • Intended Audience :: Science/Research
  • Topic :: Scientific/Engineering :: Bio-Informatics
  • License :: OSI Approved :: GNU General Public License v3 (GPLv3)
  • Programming Language :: Python :: 3

licence

GPLv3 (https://www.gnu.org/licenses/gpl-3.0.en.html)

Copyright © 2020 Adrien Leger & Tommaso Leonardi

Authors

Issues

bam test data

opened on 2023-03-21 16:40:10 by bernt-matthias

Hi

I'm trying to fix the galaxy tool for pycoQC for BAM input (https://github.com/galaxyproject/tools-iuc/pull/5201). Wondering if you have input data that is supposed to work?

With the input data I'm using I get:

``` Job in error state.. tool_id: pycoqc, exit_code: 1, stderr: Checking arguments values Check input data files Parse data files Merge data Cleaning data Discarding lines containing NA values 0 reads discarded Filtering out zero length reads 0 reads discarded Sorting run IDs by decreasing throughput Run-id order ['2bf3a5a5424e9267975cff54d2d8d1731fde919f'] Reordering runids Processing reads with Run_ID 2bf3a5a5424e9267975cff54d2d8d1731fde919f / time offset: 0 Cast value to appropriate type Reindexing dataframe by read_ids 9 Final valid reads WARNING: Low number of reads found. This is likely to lead to errors when trying to generate plots Loading plotting interface Found 9 total reads Found 9 pass reads (qual >= 7.0 and length >= 0) Generating HTML report Parsing html config file Running method run_summary Computing plot Running method basecall_summary Computing plot Running method alignment_summary /usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning:

Mean of empty slice.

/usr/local/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning:

invalid value encountered in double_scalars

    Computing plot
Running method read_len_1D
    Computing plot
Running method align_len_1D
    Computing plot

Traceback (most recent call last): File "/usr/local/bin/pycoQC", line 10, in sys.exit(main_pycoQC()) File "/usr/local/lib/python3.7/site-packages/pycoQC/main.py", line 132, in main_pycoQC quiet = args.quiet) File "/usr/local/lib/python3.7/site-packages/pycoQC/pycoQC.py", line 160, in pycoQC skip_coverage_plot=skip_coverage_plot) File "/usr/local/lib/python3.7/site-packages/pycoQC/pycoQC_report.py", line 89, in html_report fig = method(method_args) File "/usr/local/lib/python3.7/site-packages/pycoQC/pycoQC_plot.py", line 489, in align_len_1D height=height) File "/usr/local/lib/python3.7/site-packages/pycoQC/pycoQC_plot.py", line 535, in 1D_density_plot lab1, dd1, ld1 = self.__1D_density_data ("all", field_name, x_scale, nbins, smooth_sigma) File "/usr/local/lib/python3.7/site-packages/pycoQC/pycoQC_plot.py", line 582, in __1D_density_data min = np.nanmin(data) File "<__array_function internals>", line 6, in nanmin File "/usr/local/lib/python3.7/site-packages/numpy/lib/nanfunctions.py", line 320, in nanmin res = np.fmin.reduce(a, axis=axis, out=out, kwargs) ValueError: zero-size array to reduction operation fmin which has no identity . ```

seeking help on pycoqc output

opened on 2023-02-13 07:06:10 by rezarahman12

Dear pycoQC developer team. Thanks for building this nice tool. My problem is not an issue rather a help on the output file obtained through pycoqc.

I generated html file through pycoqc with below command-

pycoQC -f sequencing_summary.txt -a aligned.bam -o Ctrl1_pycoQC_output.html

However, I can't not open the HTML file in Firefox and Google chrome. Could you please tell me a way to open HTML file in browser?

Best regards Reza

Add any other context about the problem here.

Generated summary file missing required column

opened on 2023-02-09 16:15:10 by skchronicles

Hello @a-slide,

I hope you are having a great day! I was testing out pycoQC and ran into an issue after generating a summaries file with Fast5_to_seq_summary.

Describe the bug The Fast5_to_seq_summary output summaries file was passed to pycoQC and produced the following error:

text Traceback (most recent call last): File "/usr/local/bin/pycoQC", line 8, in <module> sys.exit(main_pycoQC()) File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 115, in main_pycoQC pycoQC ( File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC.py", line 120, in pycoQC parser = pycoQC_parse ( File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 96, in __init__ summary_reads_df = self._parse_summary() File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 136, in _parse_summary df = self._select_df_columns ( File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 397, in _select_df_columns raise pycoQCError("Column {} not found in the provided sequence_summary file".format(col)) pycoQC.common.pycoQCError: Column read_len not found in the provided sequence_summary file

To Reproduce Steps to reproduce the behavior:

  1. Fast5_to_seq_summary command to generate the summary file: bash $ Fast5_to_seq_summary --threads 8 -f sample/fast5 -s summary.tsv --verbose 2

Here are the first few lines of the output summary.tsv file: text read_id run_id channel start_time 000a1b52-fad6-4d6f-b113-c4b24013fcf9 8d6deda632c3a7303f91016b7707e7310e0bc054 256 42618 0026ba30-0061-401d-8dc1-3cb556d71cb9 8d6deda632c3a7303f91016b7707e7310e0bc054 133 29349 000d264a-1a98-4a55-beb5-9f02dd42fce2 8d6deda632c3a7303f91016b7707e7310e0bc054 170 42809 001ddc14-ccb8-42c3-9fd3-74db3c431a75 8d6deda632c3a7303f91016b7707e7310e0bc054 110 42649 0048af85-5c18-4745-b51e-2fab957aceab 8d6deda632c3a7303f91016b7707e7310e0bc054 61 42292 00519880-3d53-4ee3-8528-7a388ad69b24 8d6deda632c3a7303f91016b7707e7310e0bc054 198 42873

As you can see here, there is no column containing sequence/read length information.

  1. pycoQC command to generate the report: bash $ pycoQC -f summary.tsv -o test.html -j test.json --verbose

Expected behavior

I was expecting the summaries file generated by Fast5_to_seq_summary to be compatible with pycoQC. I also tried re-running the Fast5_to_seq_summary with the following fields option (to include everything):

--fields barcode_arrangement barcode_full_arrangement barcode_score calibration_strand_end calibration_strand_genome_template calibration_strand_identity calibration_strand_start called_events channel channel_digitisation channel_offset channel_range channel_sampling_rate device_id duration flow_cell_id mean_qscore_template protocol_run_id read_id read_number run_id sample_id sequence_length_template skip_prob start_mux start_time stay_prob step_prob strand_score

however, that did not seem to help, and I am getting the same error message.

I can see here, in your parser, that you are looking for these columns to rename and then check to see if they exist.

image

however, if I try to pass sequence_length_2 or sequence_length to the --fields option of Fast5_to_seq_summary, it errors out:

text Check input data and options Traceback (most recent call last): File "/usr/local/bin/Fast5_to_seq_summary", line 8, in <module> sys.exit(main_Fast5_to_seq_summary()) File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 168, in main_Fast5_to_seq_summary Fast5_to_seq_summary ( File "/usr/local/lib/python3.10/dist-packages/pycoQC/Fast5_to_seq_summary.py", line 119, in __init__ raise pycoQCError ("Field {} is not valid, please choose among the following valid fields: {}".format(field, ",".join(self.attrs_grp_dict.keys()))) pycoQC.common.pycoQCError: Field sequence_length_2d is not valid, please choose among the following valid fields: mean_qscore_template,sequence_length_template,called_events,skip_prob,stay_prob,step_prob,strand_score,read_id,start_time,duration,start_mux,read_number,channel,channel_digitisation,channel_offset,channel_range,channel_sampling_rate,run_id,sample_id,device_id,protocol_run_id,flow_cell_id,calibration_strand_genome_template,calibration_strand_end,calibration_strand_start,calibration_strand_identity,barcode_arrangement,barcode_full_arrangement,barcode_score

Desktop:

  • OS: Ubuntu 20.04
  • pycoQC Version: v.2.5.2, installed from pypi

If you need anything else, please let me know.

Best Regards, @skchronicles

About passed read number

opened on 2022-10-07 19:39:58 by wuy24

Describe the bug Hi, I have a direct RNA ONT run. I ran pycoQC using the sequencing_summary file generated by Guppy. From the html file, I can see the passed read number is 170979, however, the actual passed reads in the fastq file is 128456. I do not know why. And I wonder how pycoQC count the passed read number. Attached is the sequencing_symmary file. Coud you please help to take a look? Thank you very much!

Best, Ying sequencing_summary_FAT59544_5f60aeab.txt.zip

Bug fix for Fast5_to_seq_summary for directories with 1 fast5 file

opened on 2022-10-03 20:12:23 by godotgildor

Currently, the code looks at the enumerate index i to see if a fast5 file was found in the given directory. However, if a directory does not have any files in it, then the enumeration loop won't get started and the i variable will actually be unbound, so the code as written would raise an UnboundLocalError. If there is a single fast5 file in the directory, then i will be valid, but it will be ==0, causing the current code to error saying that no fast5 files were found.

I've updated the logic to use a simple flag to indicate when we have actually inserted a fast5 file into the Queue during this loop.

stat for each file/barcode

opened on 2022-05-25 14:03:32 by Hedi65

I'm wondering is it possible to get the stat like start and stop time for sequencing for each file (which contains 4000 reads), belonging to each barcode using PycoQC?

thanks in advance

Releases

2020-12-16 16:58:56

v2.5.0.23 2020-08-14 11:30:26

  • Adding multi Fast5 support to Fast5_to_seq_summary #49 thanks to @snajder-r contribution Made Fast5_to_seq_summary work on multi-fast5 files #115
  • Add link to @timkahlke tutorial in documentation
  • Add bioconda badge

v2.5.0.21 2020-03-06 09:37:47

2.5.0.3 2019-09-27 11:12:07

This major update add new functionalities to pycoQC and improve a number of existing ones:

  • BAM files parsing
  • 8 new alignment specific plots
  • Updated json format including alignment information when available
  • Option to filter out duplicated reads due to Guppy over-calling (--filter_duplicated)
  • New tool to split summary sequencing file by barcodes
  • Improve HTML report layout and add list of source files at the end
  • Update documentation and demo notebooks
  • Add Codacity integration

v2.3.1.6 Stable 2019-07-31 13:27:06

This release is mainly a fix for promethion flowcells for which pycoQC used to generate massive report files. It's much better no, but is the size is still an issue it is recommended to drop out the channel activity plot

v2.2.4 stable 2019-05-07 17:19:13

  • Rewrite argparse including better handling of input file regex

  • Rewrite doc with mkdocs.

  • Add methods to generate data dict and json file on top of HTML

  • Summary can now optionally split per run_id or barcode

Adrien Leger

Research scientist at Oxford Nanopore Technologies

GitHub Repository Homepage

jupyter-notebook generates-plots computing-metrics nanopore