Preseq
Supported Tool
Preseq estimates the complexity of a library, showing how many additional unique reads are sequenced for increasing total read count.
Description
The Preseq module parses results generated by Preseq, a tool that estimates the complexity of a library, showing how many additional unique reads are sequenced for increasing total read count.
When preseq lc_extrap
is run with the default parameters, the extrapolation points
reach 10 billion molecules making the plot difficult to interpret in most scenarios.
It also includes a lot of data in the reports, which can unnecessarily inflate report
file sizes. To avoid this, MultiQC trims back the x axis until each dataset
shows 80% of its maximum y-value (unique molecules).
To disable this feature and show all of the data, add the following to your MultiQC configuration:
Using coverage instead of read counts
Preseq reports its numbers as “Molecule counts”. This isn’t always very intuitive, and it’s often easier to talk about sequencing depth in terms of coverage. You can plot the estimated coverage instead by specifying the reference genome or target size, and the read length in your MultiQC configuration:
These parameters make the script take every molecule count and divide it by (genome_size / read_length).
MultiQC comes with effective genome size presets for Human and Mouse, so you can
provide the genome build name instead, like this: genome_size: hg38_genome
. The
following values are supported: hg19_genome
, hg38_genome
, mm10_genome
.
When the genome and read sizes are provided, MultiQC will plot the molecule counts on the X axis (“total” data) and coverages on the Y axis (“unique” data). However, you can customize what to plot on each axis (counts or coverage), e.g.:
Plotting externally calculated read counts
To mark on the plot the read counts calculated externally from BAM or fastq files,
create a file with preseq_real_counts
in the filename and place it with your analysis files.
It should be space or tab delimited with 2 or 3 columns (column 1 = preseq file name,
column 2 = real read count, optional column 3 = real unique read count). For example:
You can generate a line for such a file using samtools:
File search patterns
preseq:
- contents: EXPECTED_DISTINCT
num_lines: 2
- contents: distinct_reads
num_lines: 2
preseq/real_counts:
fn: "*preseq_real_counts*"