11/07/2001 (updated 7/15/2002)
Abstract: This is a short summary of a study of the sensitivity and specificity of various alignment strategies for the human and mouse genomes. The tests are all based on the nine testing regions (this allowed us to simulate different levels of shotgun coverage and to verify the correctness of alignments). While these regions are not representative of the whole genome, based on controls we show that our results are applicable to whole genome alignments. Our tests are based on the BLAT (translated and untranslated) alignment tool, although again this is not a particularly restrictive assumption, in the sense that we believe that many of our conclusions should hold for other alignment methods as well. Our main conclusions are:
|
Simulated reads (3x and 5x coverage) were generated for the nine testing regions. Contigs are also used in this testing and were assembled with the Arachne program [1] from these simulated reads (thanks to Serafim Batzoglou). Reads and contigs were masked with RepeatMasker. Alignments were computed with both DNA blat and translated BLAT [2] (against the whole genome - thanks to Jim Kent for use of the BLAT webserver, local BLAT, and of course for generating the mouse BLAT track all of which were used extensively in this study) with simulated reads and contigs at the various levels of coverage, as well as WGS reads. Hits were sorted using Jim Kent's scoring scheme and only hits with a score above 30 were kept. The score is defined by BLAT as:
score = match - mismatch - gapPenalties |
This cutoff was applied by Jim Kent for the UCSC mouse track and for the statistics that have been mailed. We discuss in the next paragraphs alternatives for the selection of "hits". Again, we would like to emphasize that we tested with BLAT because of the speed and availability of the tool, and though results might be different with different aligners, the general patterns of sensitivity/specificity tradeoff we are concerned with should be similar.
The hit specificity is defined by:
specificity = number of correct hits / number of hits returned |
where "correct" means that the hit is inside the region where the read comes from. That is, we are aligning simulated reads from the testing regions against the whole human genome and checking whether they hit in the orthologous human region to the mouse region they were generated from (so we aren't requiring exact placement within the actual region). Hits are pre-selected using a filter and there are different possible filters. We compared two types of filters where the hits are selected according to their score:
![]() Fig.1: Secondary filter: all the hits with a score>30 are kept |
![]() Fig.2: Primary filter, filters out false positives |
The 9 testing regions represent about 5Mb and are covered by 20,367 reads (simulated with 3X coverage). Using Translated Blat 3,699 (18%) of them returned at least one hit.
Using the "secondary filter", this resulted in 13,900 hits with a score>30. Only 8,085 hits were correct, i.e. hit in the same region the reads originated from. This yields a specificity of 58% for the filter, which means that 42% of the hits are outside the regions and somewhere else on the genome.
When we used the "primary filter", we found a much better specificity of 87% (7460/8616) for the hits. In that case only 13% of the hits were placed incorrectly. 3154 reads have been correctly placed with a specificity of 89% (3154/3699).
We also examined alignments of the 16 million WGS mouse reads on the human genome, using "the secondary filter". Because of the poor specificity, there were 23,016 hits with a score > 30 on the 9 testing regions. This is in contrast to the 7,460 correct hits from the simulated 3x reads. This means that roughly 2/3 of the hits come from incorrect placements from the rest of the genome [see Note 1].
Table I shows Translated Blat sensitivities (in bases pair hit) for each of the regions using simulated 3X reads. In the first column, we use the "primary filter" to compute the specificity.
In the second column, blat results for the 9 testing regions are downloaded from UCSC genome browser, and are used to compute the sensitivities, and the specificity come from the analysis described above. As we explained the sensitivities in this column are boosted by misplaced hits from the whole genome.
One can see that translated Blat remains a very sensitive tool, and we have shown that a very high specificity, can be obtained with a tighter filter. The extra specificity does, however, come at a slight cost of sensitivity. Notice that the "only intron" statistics are not really significant in this experiment because of the bias generated by the HoxA cluster which is very highly conserved and has many such "only introns".
Table I: sensitivities (in bases pair hit) of Translated Blat using 3X simulated reads on the nine testing regions. Values are compared with the one obtained from the Mouse Track alignement. |
--------------------------------------------- primary from UCSC's whole genome alignment --------------------------------------------- hits specificity: 87% 58% --------------------------------------------- sensitivity by regions: upstream100 21.4% 30.9% upstream200 18.1% 24.0% upstream400 13.6% 18.7% upstream800 11.1% 14.9% downstream200 7.9% 11.9% mrna 55.4% 62.0% utr 11.2% 19.9% utr5 19.9% 29.4% utr3 7.8% 16.2% CDS 70.3% 76.2% firstCDS 59.9% 67.8% middleCDS 73.5% 82.1% endCDS 55.7% 68.0% onlyCDS 81.9% 48.3% splice 52.4% 67.2% splice5 51.2% 67.9% splice3 53.5% 66.5% intron 2.4% 5.2% firstintron 1.5% 4.6% middleintron 2.4% 5.0% endintron 2.3% 5.5% onlyintron 20.1% 23.2% interGene 2.1% 5.2% --------------------------------------------- |
We have attempted to purge many of the false positive hits from Jim Kent's mouse track for the whole genome. To do that we have downloaded all the mouse tracks (psl files, i.e. blat output) and applied the following scheme:
This left 57% of the hits (3,381,415/5,978,049), 43% were rejected. The specificity should be 87+% (instead of 58%) - although it is not possible for us to directly measure this. The CDS sensitivity decreased in a manner consistent with our simulations.
Table II: Purged mouse track with primary filter. Statistics are for chr 22 and whole genome and are compared with secondary filter. Sensitivity in bases pair hit. |
----------------------------------------------------- secondary primary ----------------------------------------------------- chr22 whole chr22 whole genome genome ----------------------------------------------------- sensitivity by regions: upstream100 20.3% 26.7% 20.0% 25.4% upstream200 15.4% 21.4% 15.1% 20.4% upstream400 10.7% 15.9% 10.3% 15.1% upstream800 7.5% 11.4% 7.2% 10.8% downstream200 7.0% 12.0% 6.6% 11.3% mrna 56.6% 60.5% 52.8% 56.8% utr 16.1% 22.5% 15.1% 21.1% utr5 25.6% 30.6% 25.1% 28.9% utr3 13.5% 20.6% 12.4% 19.4% CDS 80.9% 82.4% 75.4% 77.4% firstCDS 76.8% 76.5% 73.3% 71.8% middleCDS 86.5% 86.3% 81.1% 81.3% endCDS 76.2% 77.0% 68.7% 72.0% onlyCDS 55.7% 64.5% 51.6% 58.9% splice 72.9% 74.1% 69.1% 69.6% splice5 73.8% 75.0% 69.8% 70.5% splice3 72.0% 73.2% 68.3% 68.8% intron 2.7% 4.0% 2.5% 3.6% firstintron 2.3% 3.8% 2.1% 3.4% middleintron 2.8% 4.2% 2.6% 3.8% endintron 3.5% 3.9% 3.4% 3.5% onlyintron 2.3% 4.1% 2.0% 3.6% ----------------------------------------------------- |
We have performed DNA Blat sensitivity and specificity analyses using 3X and 5X simulated reads for the 9 testing regions as well as contigs assembled from these reads. The coverage obtained with these contigs is 0.64X for assembly of 3X reads and 0.75X for the assembly of 5X reads (assuming non overlapping contigs). The specificity was computed using the "primary filter" to select hits. Table III shows specificity (as described above) and sensitivity (in bases pairs hit).
The sensitivity of DNA Blat is not as good as translated Blat for the CDS, but only 10% below. But sensitivity is higher in noncoding regions (e.g. UTR), which results in more CNS covered than with translated Blat.
Table III. specificity and sensitivity using dna Blat with simulated reads and contigs assembled from these simulated reads. Hit specificity in bases pair hit and computed using the primary filter (see text). ------------------------------------------------------------------------- reads contigs from 3X from 5X ------------------------------------------------------------------------- coverage 3X 5X 0.64X 0.75X ------------------------------------------------------------------------- specificity 75% 75% 83% 85% ------------------------------------------------------------------------- sensitivity: upstream100 31.2% 35.7% 24.7% 32.8% upstream200 25.5% 29.0% 21.0% 26.6% upstream400 19.5% 21.8% 17.1% 20.7% upstream800 15.1% 16.3% 13.4% 15.9% downstream200 14.6% 16.1% 12.3% 15.2% mrna 52.6% 57.0% 47.6% 56.6% utr 21.8% 23.9% 17.9% 22.2% utr5 28.7% 30.8% 20.4% 27.6% utr3 19.2% 21.3% 17.0% 20.2% CDS 63.0% 68.2% 57.6% 68.3% firstCDS 57.4% 64.1% 50.0% 62.5% middleCDS 64.7% 70.7% 59.4% 71.8% endCDS 51.8% 54.9% 48.6% 55.6% onlyCDS 73.6% 74.8% 66.6% 67.4% splice 40.3% 44.5% 35.5% 43.6% splice5 38.6% 42.9% 34.1% 40.9% splice3 41.9% 46.2% 37.0% 46.4% intron 3.4% 3.9% 2.9% 3.9% firstintron 2.5% 3.0% 2.2% 3.5% middleintron 3.3% 3.8% 2.8% 3.7% endintron 3.3% 3.3% 2.8% 3.1% onlyintron 25.4% 27.3% 23.3% 28.9% ------------------------------------------------------------------------ |
The specificity of DNA blat makes it difficult to use it as a tool to align mouse reads to the human genome, but using an appropriate filter to select hits can be used with longer contigs.
We have also merged the results obtained from the assembled contigs and the unused reads, i.e. the reads that were not included in the assembly. The sensitivities are shown in Table IV. These values, when compared with the values obtained with the contigs alone, show that the unused reads improved the sensitivies for the assembly from 3X reads and for the assembly from 5X reads. The effect is less important for the 5X assmebly, which confirms that the contigs are of better quality.
Table IV. Sensitivities (in bases pair hit)
of the contigs and the unused reads. ------------------------------------------------------- contigs from 3X contigs from 5X + unused reads + unused reads ------------------------------------------------------- upstream100 26.7% 34.8% upstream200 22.5% 28.2% upstream400 18% 21.9% upstream800 14% 16.8% downstream200 13.5% 16.0% mrna 50.8% 58.2% utr 19.7% 23.5% utr5 24.9% 30.3% utr3 17.7% 20.8% CDS 61.3% 69.9% firstCDS 55.4% 64.4% middleCDS 62.4% 73.1% endCDS 51.7% 55.8% onlyCDS 76.1% 73.8% splice 37.5% 44.4% splice5 35.5% 41.5% splice3 39.4% 47.3% intron 3.1% 4.0% firstintron 2.4% 3.6% middleintron 3% 3.9% endintron 2.8% 3.2% onlyintron 24.5% 29.1% ------------------------------------------------------- |
Note 1 the mouse track is "soft masked", i.e. the translated index is
set up on masked DNA, the reads themselves are not masked. In our tests, the
reads and contigs were masked. Submitting masked sequences generates fewer hits.
References
[1] Batzoglou Serafim, personal communication and Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. 2002. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12:177-89.
[2] Kent J. 2002. BLAT - The BLAST-Like Alignment Tool. Genome
Res. 12: 656-664