cshl

Functional Genomics Laboratory: Gingeras Group

STAR

Spliced Transcripts Alignments and Reconstruction


Introduction

 

The recent advances in the sequencing technology made it an attractive tool for the studies of the transcriptome at single nucleotide resolutions. The analysis of the tens of millions of relatively short (36-mer) and medium (100-mer) reads produced by sequencing of cellular transcriptomes comprising spliced and un-spliced RNAs presents unique challenges. Two main tasks make these analyses extremely computationally intensive: the first one is the accurate alignment of increasingly longer reads that display growing numbers of mismatches caused by sequencing errors, sequencing errors, SNPs, RNA editing etc. The second task is the mapping of reads derived from non-contiguous regions of the genome, for example splice junctions and chimeric RNAs.

While the first task is shared with the DNA re-sequencing efforts, the second one is specific and crucial to the RNA sequencing, since it provides the connectivity information and allows the reconstruction of the RNA molecules.

 

Various sequence alignment algorithms have been applied to these problems, including existing general purpose approaches (e.g., BLAT), or more recent tools designed for the high throughput sequencing reads (e.g., Bowtie/Tophat). The latter strategies rely upon either previously annotated splice junctions, the sequence characteristics of annotated junctions (e.g., canonical introns motifs, maximum exon/intron size etc) and/or the construction of a reference database of junction sites. In this paper we describe STAR, the Splice Transcripts Alignment and Reconstruction tool. The STAR’s novel alignment strategy for de novo detection of the splice junctions does not require any previous knowledge of splice junctions’ loci and does not use a priori properties of the junctions. This unbiased splice junction mapping is imperative for discovery of novel (un-annotated, non-canonical) splice junctions and isoforms, as well as other increasingly important RNA species such as inter-chromosomal chimeric RNAs.  The key algorithm of our alignment strategy is the search for the maximum mappable length of a read, implemented as a speed-efficient suffix array search. Another important novelty of our approach is the split/search/extend algorithm driven by the sequencing quality scores, which allows a confident alignment of reads comprising a large number of sequencing errors. The STAR’s penalty system is highly user-configurable and assigns probabilistic meaning to the alignment scores.

 

Some other important advantages of the STAR are briefly described below. STAR is capable of finding the multiple loci to which a read can be mapped, providing estimates of the relative probabilities of these alignments. STAR can align reads of any length, working accurately and efficiently for both long and short RNA molecules. The STAR can align reads containing any number of splice junctions, indels and/or mismatches, which is important as the length of the reads continues to increase rapidly with the advances of the sequencing technology. STAR can deal with arbitrarily large intron length (important, for examples, for distal exons and chimeric RNA), as well as with extremely short exons (micro-exons). STAR performs an “auto”-trimming of the poor quality read ends, which are a common occurrence as the read length is pushed to the limit. STAR can detect the non-templated poly-A tails, thus providing a means to determine the transcription termination site for poly-A+ mRNAs. STAR treats the paired-end sequencing in a most straightforward way by incorporating naturally the paired-end information into the mapping process.

 

Although the STAR algorithm is heuristic and non-exhaustive (i.e. it does not find all the possible alignments, unlike, for example, the Smith-Waterman local search), we showed that it can recover almost all highly probable alignments for almost all the reads. STAR is very fast: on a modern but not overly expensive server it can align 75 million per hour of 76-mer reads to human genome, including the splice junctions, multiple mappers and numerous (~7) mismatches.

 

In the following example we use STAR to map 270M 76-mer reads from human cell line (K562) total RNA samples (ENCODE, Carrie Davis). The table below compares the STAR’s results with those of an exhaustive mapping with up to 2 mismatches allowed. One can see that for these libraries STAR maps twice as many reads as the full length (76nt) exhaustive search, or 50% more reads than the exhaustive search with the reads trimmed to 50nt, illustrating the STAR’s ability to auto-trim the poor quality tails.

 

 

uniquely mapped reads

mean mapped length (bases)

0-2 MM

51M

76

0-2MM & trim to 50

72M

50

STAR

106M

64.8

 

We found 3.75M reads that cross splice junctions, from those 96% crossing canonical junctions (GT/AG introns) and 4% - non-canonical. We found 87k unique junctions in the genome, of which 90% are annotated canonical junctions, 7% are un-annotated canonical and 3% are non-canonical mostly un-annotated junctions. This illustrates STAR’s ability to correctly detect annotated junctions, as well as predict a large number of un-annotated canonical and non-canonical sites.




Cold Spring Harbor Laboratory, Genome Research Center © All rights reserved