Correcting duplication effects in sequencing-based genotypes
Most pipelines for calling genotypes or providing allelic counts from sequencing data assume that each of the possible alleles have been sampled at random. For a biallelic SNP, this corresponds to binomial sampling with the observed read depth as the sample size. However, the processes involved in generating the sequence reads can sometimes lead to duplication events whereby a particular sequence is copied and read multiple times. This leads to greater variation in allele counts than would be predicted by the binomial model, resulting in false inference of homozygous genotypes. A common practice with randomly sheared DNA fragments is to remove exact duplicate reads, but for restriction enzyme-based reduced representational sequencing (RE-RRS) some exact duplicate reads are expected. We investigate duplicate effects on an Illumina NovaSeq 6000 for RE-RRS. Duplication events specific to patterned flowcells result in duplicates tending to be spatially close to the original read, allowing for bioinformatic deduplication involving unsupervised spatial clustering of candidate duplicates. We have found that this may need to be complemented by using statistical models that allow for extra-binomial variation. Model parameters can be estimated using results from parents and their offspring or from multiple results on the same individual. Combining these two approaches supports suitable downstream analyses despite the presence of duplications in the sequencing results.
History
Rights statement
This is an open-access output. It may be used, distributed or reproduced in any medium, provided the original author and source are credited.Publication date
2023-11-22Project number
- 49050
Language
- English
Does this contain Māori information or data?
- No