Uppercase vs lowercase letters in reference genome

What does this soft masking actually mean?

A lot of the sequence in genomes are repetitive. Human genome, for example, has (at least) two-third repetitive elements.[1].

These repetitive elements are soft-masked by converting the upper case letters to lower case. An important use-case of these soft-masked bases will be in homology searches: An atatatatatat will tend to appear both in human and mouse genomes but is likely non-homologous.

How confident can I be about the sequence in these regions?

As you can be about in non soft-masked based positions. Soft-masking is done after determining portions in the genome that are likely repetitive. There is no uncertainty whether a particular base is 'A' or 'G', just that it is part of a repeat and hence should be represented as an 'a'.

What does a lowercase n represent?

UCSC uses Tandom Repeat Finder and RepeatMasker for soft-masking potential repeats. NCBI most likely uses TANTAN. 'N's represents no sequence information is available for that base. It being replaced by 'n' is likely an artifact of the repeat-masking software where it soft-masks an 'N' by an 'n' to indicate that portion of the genome is likely a repeat too.

answered May 24, 2017 at 6:01 Saket Choudhary Saket Choudhary 991 8 8 silver badges 17 17 bronze badges

$\begingroup$ Informative answer, but I think it's controversial to say the human genome is "(at least) two-third repetitive elements"; the P-clouds method you cite is quite permissive and half is a more commonly accepted figure. And soft-masking doesn't involve masking all repeats generally, just interspersed repeats and low complexity sequences. Also there is always uncertainty around base calling and assembly building, and more so for repetitive sequences, although mm10 is one of the best assemblies of course. $\endgroup$

Commented May 25, 2017 at 17:04 $\begingroup$

The use of lower/upper case letters and N / n letters in genomes sequences is not completely standardised and you should always check the specification of the resource you are using.

Lower case letters are most commonly used to represent “soft-masked sequences”, a convention popularised by RepeatMasker, where interspersed repeats (which covers transposons, retrotransposons and processed pseudogenes) and low complexity sequences are marked with lower case letters. Note that larger repeats, such as sizable tandem repeats, segmental duplications, and whole gene duplications are not generally masked.

However, there are other uses for lower/upper case letters, for example, Ensembl have used upper/lower case letters to represent exonic and intronic sequences respectively.

N and n nucleotides may represent “hard masked sequences”, where interspersed repeats and low complexity sequences are replaced by N s. But N / n s may alternatively represent ambiguous nucleotides, indeed this is the IUPAC specification.

Also note occasionally (although fortunately rarely) X / x is used to represent ambiguous nucleotides or “hard-masked sequences” too.