Welcome to the Repeatscape

Repeatscape http://repeatscape.utgenome.org/ is a frequency count database of short-read sequences, ranging from 25 to 36 bases.

Motivation

Revolutionary new sequencers have the ability to produce billions of nucleotides in a fairly short period, but they suffer from the major drawback that the sequenced reads are 25-36 nt long and are likely aligned to multiple locations.

This ambiguity prevents a short read from being anchored to its originating location and may lead to the incorrect conclusion that no reads are observed from the original position; the ambiguity should instead be treated as missing information. Moreover, sequencing errors in a short read may align the read to a unique but false positive position with no mismatches.

To resolve these issues, the specificity of every short region in the genome should be precomputed because the possibility that the absence of a read indicates a lack of information and that the presence of reads are false positive alignments due to sequencing errors needs to be taken into account in lower specific regions.

In contrast, the absence and presence of reads is more reliable in highly specific regions. We measured the specificity of every short region as the number of its occurrences in the entire genome, while taking into account the effect of sequencing errors. The resulting database is a collection of frequencies of all short regions and is flexible enough to be incorporated into a wide variety of genomic information.