Blat analysis

Analysis of sensitivity/specificity tradeoffs for whole genome alignments using BLAT

analysis by Olivier Couronne, Inna Dubchak (LBNL) and Lior Pachter (UC Berkeley)

11/07/2001 (updated 7/15/2002)

Abstract:
This is a short summary of a study of the sensitivity and specificity of various alignment strategies for the human and mouse genomes. The tests are all based on the nine testing regions (this allowed us to simulate different levels of shotgun coverage and to verify the correctness of alignments). While these regions are not representative of the whole genome, based on controls we show that our results are applicable to whole genome alignments. Our tests are based on the BLAT (translated and untranslated) alignment tool, although again this is not a particularly restrictive assumption, in the sense that we believe that many of our conclusions should hold for other alignment methods as well.
Our main conclusions are:

An alignment method should be based on both DNA (better for noncoding regions such as UTRs) and protein (better for CDS),
The sensitivity/specificity tradeoff is a nontrivial issue that must be addressed with care. In particular, obvious heuristics could have a serious specificity problem,
By filtering hits according to slightly more sophisticated criteria it is possible to improve specificity while preserving sensitivity,
Alignments with contigs are significantly more specific than with reads,

Methods

Simulated reads (3x and 5x coverage) were generated for the nine testing regions. Contigs are also used in this testing and were assembled with the Arachne program [1] from these simulated reads (thanks to Serafim Batzoglou). Reads and contigs were masked with RepeatMasker. Alignments were computed with both DNA blat and translated BLAT [2] (against the whole genome - thanks to Jim Kent for use of the BLAT webserver, local BLAT, and of course for generating the mouse BLAT track all of which were used extensively in this study) with simulated reads and contigs at the various levels of coverage, as well as WGS reads. Hits were sorted using Jim Kent's scoring scheme and only hits with a score above 30 were kept. The score is defined by BLAT as:

score = match - mismatch - gapPenalties

This cutoff was applied by Jim Kent for the UCSC mouse track and for the statistics that have been mailed. We discuss in the next paragraphs alternatives for the selection of "hits". Again, we would like to emphasize that we tested with BLAT because of the speed and availability of the tool, and though results might be different with different aligners, the general patterns of sensitivity/specificity tradeoff we are concerned with should be similar.

I. Translated Blat's performance aligning with reads

The hit specificity is defined by:

specificity = number of correct hits / number of hits returned

where "correct" means that the hit is inside the region where the read comes from. That is, we are aligning simulated reads from the testing regions against the whole human genome and checking whether they hit in the orthologous human region to the mouse region they were generated from (so we aren't requiring exact placement within the actual region). Hits are pre-selected using a filter and there are different possible filters. We compared two types of filters where the hits are selected according to their score:

the "secondary filter" where all the hits with a score>30 are kept. This method is the one used by Jim Kent for the mouse track on the browser and the statistics previously sent,
the "primary filter" is to select the first hit (when the hits are sorted by score) as well as the hits (score>30 also) in the same neighborhood in the human genome (within 50kb of the first hit).

Fig.1: Secondary filter: all the hits with a score>30 are kept
Fig.2: Primary filter, filters out false positives

Simulated reads on the nine testing regions

The 9 testing regions represent about 5Mb and are covered by 20,367 reads (simulated with 3X coverage). Using Translated Blat 3,699 (18%) of them returned at least one hit.

Using the "secondary filter", this resulted in 13,900 hits with a score>30. Only 8,085 hits were correct, i.e. hit in the same region the reads originated from. This yields a specificity of 58% for the filter, which means that 42% of the hits are outside the regions and somewhere else on the genome.

When we used the "primary filter", we found a much better specificity of 87% (7460/8616) for the hits. In that case only 13% of the hits were placed incorrectly. 3154 reads have been correctly placed with a specificity of 89% (3154/3699).

We also examined alignments of the 16 million WGS mouse reads on the human genome, using "the secondary filter". Because of the poor specificity, there were 23,016 hits with a score > 30 on the 9 testing regions. This is in contrast to the 7,460 correct hits from the simulated 3x reads. This means that roughly 2/3 of the hits come from incorrect placements from the rest of the genome [see Note 1].

Table I shows Translated Blat sensitivities (in bases pair hit) for each of the regions using simulated 3X reads. In the first column, we use the "primary filter" to compute the specificity.

In the second column, blat results for the 9 testing regions are downloaded from UCSC genome browser, and are used to compute the sensitivities, and the specificity come from the analysis described above. As we explained the sensitivities in this column are boosted by misplaced hits from the whole genome.

One can see that translated Blat remains a very sensitive tool, and we have shown that a very high specificity, can be obtained with a tighter filter. The extra specificity does, however, come at a slight cost of sensitivity. Notice that the "only intron" statistics are not really significant in this experiment because of the bias generated by the HoxA cluster which is very highly conserved and has many such "only introns".

Table I: sensitivities (in bases pair hit) of Translated Blat using 3X simulated reads on the nine testing regions. Values are compared with the one obtained from the Mouse Track alignement.

---------------------------------------------
                    primary      from UCSC's
                                whole genome
                                 alignment
---------------------------------------------
hits specificity:      87%           58%
---------------------------------------------
sensitivity by regions:

  upstream100        21.4%         30.9%
  upstream200        18.1%         24.0%
  upstream400        13.6%         18.7%
  upstream800        11.1%         14.9%
downstream200         7.9%         11.9%
         mrna        55.4%         62.0%
          utr        11.2%         19.9%
         utr5        19.9%         29.4%
         utr3         7.8%         16.2%
          CDS        70.3%         76.2%
     firstCDS        59.9%         67.8%
    middleCDS        73.5%         82.1%
       endCDS        55.7%         68.0%
      onlyCDS        81.9%         48.3%
       splice        52.4%         67.2%
      splice5        51.2%         67.9%
      splice3        53.5%         66.5%
       intron         2.4%          5.2%
  firstintron         1.5%          4.6%
 middleintron         2.4%          5.0%
    endintron         2.3%          5.5%
   onlyintron        20.1%         23.2%
    interGene         2.1%          5.2%
---------------------------------------------

Whole genome analysis

We have attempted to purge many of the false positive hits from Jim Kent's mouse track for the whole genome. To do that we have downloaded all the mouse tracks (psl files, i.e. blat output) and applied the following scheme:

sort by read and score (match-mismatch-gapPenaties),
for each read, keep first hit and neighbors,
reject all others hits.

This left 57% of the hits (3,381,415/5,978,049), 43% were rejected. The specificity should be 87+% (instead of 58%) - although it is not possible for us to directly measure this. The CDS sensitivity decreased in a manner consistent with our simulations.

Table II: Purged mouse track with primary filter. Statistics are for chr 22 and whole genome and are compared with secondary filter. Sensitivity in bases pair hit.

-----------------------------------------------------
                    secondary            primary     
-----------------------------------------------------
                  chr22   whole       chr22   whole  
                          genome              genome 
-----------------------------------------------------
sensitivity by regions:                              

   upstream100    20.3%   26.7%        20.0%   25.4%
   upstream200    15.4%   21.4%        15.1%   20.4%
   upstream400    10.7%   15.9%        10.3%   15.1%
   upstream800     7.5%   11.4%         7.2%   10.8%
 downstream200     7.0%   12.0%         6.6%   11.3%
          mrna    56.6%   60.5%        52.8%   56.8%
           utr    16.1%   22.5%        15.1%   21.1%
          utr5    25.6%   30.6%        25.1%   28.9%
          utr3    13.5%   20.6%        12.4%   19.4%
           CDS    80.9%   82.4%        75.4%   77.4%
      firstCDS    76.8%   76.5%        73.3%   71.8%
     middleCDS    86.5%   86.3%        81.1%   81.3%
        endCDS    76.2%   77.0%        68.7%   72.0%
       onlyCDS    55.7%   64.5%        51.6%   58.9%
        splice    72.9%   74.1%        69.1%   69.6%
       splice5    73.8%   75.0%        69.8%   70.5%
       splice3    72.0%   73.2%        68.3%   68.8%
        intron     2.7%    4.0%         2.5%    3.6%
   firstintron     2.3%    3.8%         2.1%    3.4%
  middleintron     2.8%    4.2%         2.6%    3.8%
     endintron     3.5%    3.9%         3.4%    3.5%
    onlyintron     2.3%    4.1%         2.0%    3.6%
-----------------------------------------------------

II. DNA Blat specificity/sensitivity and Arachne assembly

We have performed DNA Blat sensitivity and specificity analyses using 3X and 5X simulated reads for the 9 testing regions as well as contigs assembled from these reads. The coverage obtained with these contigs is 0.64X for assembly of 3X reads and 0.75X for the assembly of 5X reads (assuming non overlapping contigs). The specificity was computed using the "primary filter" to select hits. Table III shows specificity (as described above) and sensitivity (in bases pairs hit).

The sensitivity of DNA Blat is not as good as translated Blat for the CDS, but only 10% below. But sensitivity is higher in noncoding regions (e.g. UTR), which results in more CNS covered than with translated Blat.

By slicing a region randomly, we can simulate a perfect assembly. In that case it is possible to show that the specificity should increase with the length of the contig. This effect is seen for the contigs coming from 3X reads and from 5X (see figure 3).

Fig.3: specifity of Blat versus length of the query. cnX: contig assembled from nX reads using the Arachne assembler.
The sensitivities for the assemblies must be analyzed in the context of the coverage provided by the contigs: 0.64X for contigs from 3X reads and 0.75X for contigs from 5X reads. This means that the sensitivities are not that bad, and will improve with higher coverage. But the coverage is too low right now to get a significant increase of sensitivity.

Table III. specificity and sensitivity using dna Blat with simulated reads and contigs assembled from these simulated reads. Hit specificity in bases pair hit and computed using the “primary filter” (see text).

-------------------------------------------------------------------------
                           reads                       contigs
                                                  from 3X   from 5X
-------------------------------------------------------------------------
    coverage              3X         5X              0.64X     0.75X               
-------------------------------------------------------------------------
specificity            75%           75%            83%          85%             
------------------------------------------------------------------------- 
sensitivity:                                                         

  upstream100         31.2%        35.7%           24.7%        32.8%
  upstream200         25.5%        29.0%           21.0%        26.6%
  upstream400         19.5%        21.8%           17.1%        20.7%
  upstream800         15.1%        16.3%           13.4%        15.9%
downstream200         14.6%        16.1%           12.3%        15.2%          
         mrna         52.6%        57.0%           47.6%        56.6%          
          utr         21.8%        23.9%           17.9%        22.2%          
         utr5         28.7%        30.8%           20.4%        27.6%
         utr3         19.2%        21.3%           17.0%        20.2%
          CDS         63.0%        68.2%           57.6%        68.3%          
     firstCDS         57.4%        64.1%           50.0%        62.5%          
    middleCDS         64.7%        70.7%           59.4%        71.8%          
       endCDS         51.8%        54.9%           48.6%        55.6%          
      onlyCDS         73.6%        74.8%           66.6%        67.4%
       splice         40.3%        44.5%           35.5%        43.6%
      splice5         38.6%        42.9%           34.1%        40.9%          
      splice3         41.9%        46.2%           37.0%        46.4%           
       intron          3.4%         3.9%            2.9%         3.9%          
  firstintron          2.5%         3.0%            2.2%         3.5%
 middleintron          3.3%         3.8%            2.8%         3.7%
    endintron          3.3%         3.3%            2.8%         3.1%
   onlyintron         25.4%        27.3%           23.3%        28.9%
------------------------------------------------------------------------

The specificity of DNA blat makes it difficult to use it as a tool to align mouse reads to the human genome, but using an appropriate filter to select hits can be used with longer contigs.

We have also merged the results obtained from the assembled contigs and the unused reads, i.e. the reads that were not included in the assembly. The sensitivities are shown in Table IV. These values, when compared with the values obtained with the contigs alone, show that the unused reads improved the sensitivies for the assembly from 3X reads and for the assembly from 5X reads. The effect is less important for the 5X assmebly, which confirms that the contigs are of better quality.

Table IV. Sensitivities (in bases pair hit) of the contigs and the unused reads.

-------------------------------------------------------
                     contigs from 3X   contigs from 5X
                     + unused reads    + unused reads
-------------------------------------------------------

  upstream100            26.7%           34.8%
  upstream200            22.5%           28.2%
  upstream400              18%           21.9%
  upstream800              14%           16.8%
downstream200            13.5%           16.0%
         mrna            50.8%           58.2%
          utr            19.7%           23.5%
         utr5            24.9%           30.3%
         utr3            17.7%           20.8%
          CDS            61.3%           69.9%
     firstCDS            55.4%           64.4%
    middleCDS            62.4%           73.1%
       endCDS            51.7%           55.8%
      onlyCDS            76.1%           73.8%
       splice            37.5%           44.4%
      splice5            35.5%           41.5%
      splice3            39.4%           47.3%
       intron             3.1%            4.0%
  firstintron             2.4%            3.6%
 middleintron               3%            3.9%
    endintron             2.8%            3.2%
   onlyintron            24.5%           29.1%
-------------------------------------------------------

Note 1 the mouse track is "soft masked", i.e. the translated index is set up on masked DNA, the reads themselves are not masked. In our tests, the reads and contigs were masked. Submitting masked sequences generates fewer hits.

References

[1] Batzoglou Serafim, personal communication and Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. 2002. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12:177-89.

[2] Kent J. 2002. BLAT - The BLAST-Like Alignment Tool. Genome Res. 12: 656-664