PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data
PycoQC relies on the sequencing_summary.txt file generated by Albacore and Guppy, but if needed it can also generate a summary file from basecalled fast5 files. The package supports 1D and 1D2 runs generated with Minion, Gridion and Promethion devices, basecalled with Albacore 1.2.1+ or Guppy 2.1.3+. PycoQC is written in pure Python3. Python 2 is not supported. For a quick introduction see tutorial by Tim Kahlke available at https://timkahlke.github.io/LongRead_tutorials/QC_P.html
Full documentation is available at https://a-slide.github.io/pycoQC
]
]
Please be aware that pycoQC is a research package that is still under development.
It was tested under Linux Ubuntu 16.04 and in an HPC environment running under Red Hat Enterprise 7.1.
Thank you
GPLv3 (https://www.gnu.org/licenses/gpl-3.0.en.html)
Copyright © 2020 Adrien Leger & Tommaso Leonardi
Hi
I'm trying to fix the galaxy tool for pycoQC for BAM input (https://github.com/galaxyproject/tools-iuc/pull/5201). Wondering if you have input data that is supposed to work?
With the input data I'm using I get:
``` Job in error state.. tool_id: pycoqc, exit_code: 1, stderr: Checking arguments values Check input data files Parse data files Merge data Cleaning data Discarding lines containing NA values 0 reads discarded Filtering out zero length reads 0 reads discarded Sorting run IDs by decreasing throughput Run-id order ['2bf3a5a5424e9267975cff54d2d8d1731fde919f'] Reordering runids Processing reads with Run_ID 2bf3a5a5424e9267975cff54d2d8d1731fde919f / time offset: 0 Cast value to appropriate type Reindexing dataframe by read_ids 9 Final valid reads WARNING: Low number of reads found. This is likely to lead to errors when trying to generate plots Loading plotting interface Found 9 total reads Found 9 pass reads (qual >= 7.0 and length >= 0) Generating HTML report Parsing html config file Running method run_summary Computing plot Running method basecall_summary Computing plot Running method alignment_summary /usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning:
Mean of empty slice.
/usr/local/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning:
invalid value encountered in double_scalars
Computing plot
Running method read_len_1D
Computing plot
Running method align_len_1D
Computing plot
Traceback (most recent call last):
File "/usr/local/bin/pycoQC", line 10, in
Dear pycoQC developer team. Thanks for building this nice tool. My problem is not an issue rather a help on the output file obtained through pycoqc.
I generated html file through pycoqc with below command-
pycoQC -f sequencing_summary.txt -a aligned.bam -o Ctrl1_pycoQC_output.html
However, I can't not open the HTML file in Firefox and Google chrome. Could you please tell me a way to open HTML file in browser?
Best regards Reza
Add any other context about the problem here.
Hello @a-slide,
I hope you are having a great day! I was testing out pycoQC and ran into an issue after generating a summaries file with Fast5_to_seq_summary
.
Describe the bug
The Fast5_to_seq_summary
output summaries file was passed to pycoQC
and produced the following error:
text
Traceback (most recent call last):
File "/usr/local/bin/pycoQC", line 8, in <module>
sys.exit(main_pycoQC())
File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 115, in main_pycoQC
pycoQC (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC.py", line 120, in pycoQC
parser = pycoQC_parse (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 96, in __init__
summary_reads_df = self._parse_summary()
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 136, in _parse_summary
df = self._select_df_columns (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 397, in _select_df_columns
raise pycoQCError("Column {} not found in the provided sequence_summary file".format(col))
pycoQC.common.pycoQCError: Column read_len not found in the provided sequence_summary file
To Reproduce Steps to reproduce the behavior:
Fast5_to_seq_summary
command to generate the summary file:
bash
$ Fast5_to_seq_summary --threads 8 -f sample/fast5 -s summary.tsv --verbose 2
Here are the first few lines of the output summary.tsv
file:
text
read_id run_id channel start_time
000a1b52-fad6-4d6f-b113-c4b24013fcf9 8d6deda632c3a7303f91016b7707e7310e0bc054 256 42618
0026ba30-0061-401d-8dc1-3cb556d71cb9 8d6deda632c3a7303f91016b7707e7310e0bc054 133 29349
000d264a-1a98-4a55-beb5-9f02dd42fce2 8d6deda632c3a7303f91016b7707e7310e0bc054 170 42809
001ddc14-ccb8-42c3-9fd3-74db3c431a75 8d6deda632c3a7303f91016b7707e7310e0bc054 110 42649
0048af85-5c18-4745-b51e-2fab957aceab 8d6deda632c3a7303f91016b7707e7310e0bc054 61 42292
00519880-3d53-4ee3-8528-7a388ad69b24 8d6deda632c3a7303f91016b7707e7310e0bc054 198 42873
As you can see here, there is no column containing sequence/read length information.
pycoQC
command to generate the report:
bash
$ pycoQC -f summary.tsv -o test.html -j test.json --verbose
Expected behavior
I was expecting the summaries file generated by Fast5_to_seq_summary
to be compatible with pycoQC
. I also tried re-running the Fast5_to_seq_summary
with the following fields option (to include everything):
--fields barcode_arrangement barcode_full_arrangement barcode_score calibration_strand_end calibration_strand_genome_template calibration_strand_identity calibration_strand_start called_events channel channel_digitisation channel_offset channel_range channel_sampling_rate device_id duration flow_cell_id mean_qscore_template protocol_run_id read_id read_number run_id sample_id sequence_length_template skip_prob start_mux start_time stay_prob step_prob strand_score
however, that did not seem to help, and I am getting the same error message.
I can see here, in your parser, that you are looking for these columns to rename and then check to see if they exist.
however, if I try to pass sequence_length_2
or sequence_length
to the --fields
option of Fast5_to_seq_summary
, it errors out:
text
Check input data and options
Traceback (most recent call last):
File "/usr/local/bin/Fast5_to_seq_summary", line 8, in <module>
sys.exit(main_Fast5_to_seq_summary())
File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 168, in main_Fast5_to_seq_summary
Fast5_to_seq_summary (
File "/usr/local/lib/python3.10/dist-packages/pycoQC/Fast5_to_seq_summary.py", line 119, in __init__
raise pycoQCError ("Field {} is not valid, please choose among the following valid fields: {}".format(field, ",".join(self.attrs_grp_dict.keys())))
pycoQC.common.pycoQCError: Field sequence_length_2d is not valid, please choose among the following valid fields: mean_qscore_template,sequence_length_template,called_events,skip_prob,stay_prob,step_prob,strand_score,read_id,start_time,duration,start_mux,read_number,channel,channel_digitisation,channel_offset,channel_range,channel_sampling_rate,run_id,sample_id,device_id,protocol_run_id,flow_cell_id,calibration_strand_genome_template,calibration_strand_end,calibration_strand_start,calibration_strand_identity,barcode_arrangement,barcode_full_arrangement,barcode_score
Desktop:
If you need anything else, please let me know.
Best Regards, @skchronicles
Describe the bug Hi, I have a direct RNA ONT run. I ran pycoQC using the sequencing_summary file generated by Guppy. From the html file, I can see the passed read number is 170979, however, the actual passed reads in the fastq file is 128456. I do not know why. And I wonder how pycoQC count the passed read number. Attached is the sequencing_symmary file. Coud you please help to take a look? Thank you very much!
Best, Ying sequencing_summary_FAT59544_5f60aeab.txt.zip
Currently, the code looks at the enumerate index i
to see if a fast5 file was found in the given directory. However, if a directory does not have any files in it, then the enumeration loop won't get started and the i
variable will actually be unbound, so the code as written would raise an UnboundLocalError
. If there is a single fast5 file in the directory, then i
will be valid, but it will be ==0
, causing the current code to error saying that no fast5 files were found.
I've updated the logic to use a simple flag to indicate when we have actually inserted a fast5 file into the Queue during this loop.
I'm wondering is it possible to get the stat like start and stop time for sequencing for each file (which contains 4000 reads), belonging to each barcode using PycoQC?
thanks in advance
This major update add new functionalities to pycoQC and improve a number of existing ones:
This release is mainly a fix for promethion flowcells for which pycoQC used to generate massive report files. It's much better no, but is the size is still an issue it is recommended to drop out the channel activity plot
Rewrite argparse including better handling of input file regex
Rewrite doc with mkdocs.
Add methods to generate data dict and json file on top of HTML
Summary can now optionally split per run_id or barcode
jupyter-notebook generates-plots computing-metrics nanopore