A reference genome is a database assembled based on whole genome sequencing data from multiple members of a species of interest. This genome assembly is considered as a representative example of the genome organization of a typical member of the species. For an organism to be effectively studied from a genetics perspective, having a high quality reference genome is a must.
Among other things, it facilitates comparative genomics studies with other species. The purpose of such studies is usually to identify genetic mutations that might be involved in causing disease or conferring a particular phenotype. A lot of discoveries in human medical genomics have their roots in comparative genomics studies. Having a high accuracy, ‘gapless’ reference genome, ensures that the significance of genetic variation is properly understood and utilized for novel findings. In other words, the reference genome is a ‘gold standard’ against which DNA sequencing information from other members of the same species is compared. If there are mistakes in the reference genome, benign and deleterious mutations and their frequency could be misinterpreted.
The current feline reference genome – positives and negatives
The most current version of the feline reference genome, Felis_catus_9.0, was published in January 2020 and is a significant improvement over the previous reference genome (Felis_catus_8.0). For example, the N50 contig length value for Felis_catus_9.0 is 42 Mb. This corresponds to roughly 1000-fold increase in ungapped sequence length compared to Felis_catus_8.0. This N50 value surpasses N50 values for all other carnivore reference genome assemblies.
Felis_catus_9.0 uses long-read sequencing technology and is thus a superior tool for the identification of structural variants (SVs), especially ones spanning multiple megabases. In fact, Felis_catus_9.0 identified more variants than reference genomes for other mammals, such as dog, sheep, horse, pig and cow. However, long-read sequencing technology is known to have a bias towards introducing insertion or deletion (indel) errors in homopolymer regions. In fact, Basepaws’ analysis of our internal genomic database reveals a total of 82,362 sites in Felis_catus_9.0 where the non-reference allele frequency is 100%. Out of these sites, 87% (i.e., 72,139 sites) are indels and indicate potential misassembly errors in the reference genome. Our analysis also shows that 2,794 of these putative misassemblies are found in 2,215 exonic regions. Reference genome errors in gene-coding regions, particularly exons, can have serious negative consequences for genome medicine applications. Two such errors (identified as part of the 82,362 potential misassemblies) are found in the gene tyrosinase (tyr). In Felis_catus_9.0, tyr has a frameshift in the middle of its sequencing, resulting in a wrongly translated protein. This is illustrated in the figure below where we compared the tyr gene sequence from a published feline paper with the tyr sequence obtained from Felis_catus_9.0. The sequence from the reference genome contains an insertion of a G and a deletion of an A (marked in red). These errors lead to a wrong partial translation which we identified after aligning the tyr protein sequence from Felis_catus_9.0 with the previously published feline tyr sequence and the tyr sequences from related mammalian species – Puma concolor (cougar), Canis lupus familiaris (dog), Bos taurus (cow) and Homo sapiens (human). The protein sequence region modified as a consequence of the frameshift is highlighted in yellow. Blue and red colors were used to indicate conserved and altered amino acids, respectively.