Wednesday, July 16, 2014

Trinity RNA-Seq

In January a site went live called I suspect due to the timing and the subject matter it was a student project. The idea was to write summaries - no more than 500 words - of scientific papers and allow people to comment on and discuss them. I thought it was a neat idea. They had some ideas for incentivizing writers and so forth, but I didn't have time to contribute anything until this summer, by which time the authors had apparently lost interest. I sent in a summary of a paper, but after several weeks it had not been approved by the moderators. Maybe it just wasn't that good! 

The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data

This is what I sent them:

The reconstruction of a transcriptome from the short reads generated by RNA-Seq techniques presents many challenges, particularly in the absence of an existing reference genome with which to compare the reads. Challenges include: uneven coverage of the various transcripts; (ii) uneven coverage inside each transcript; (iii) sequencing errors in highly expressed transcripts; (iv) transcripts encoded by adjacent loci can overlap and thus can be erroneously fused to form a chimeric transcript; (v) data structures need to accommodate multiple transcripts per locus, owing to alternative splicing; and (vi) sequences that are repeated in different genes introduce ambiguity. The Trinity pipeline leverages several properties of transcriptomes in its assembly procedure: it uses transcript expression to guide the initial transcript assembly procedure in a strand-specific manner, it partitions RNA-Seq reads into sets of disjoint transcriptional loci, and it traverses each of the transcript graphs systematically to explore the sets of transcript sequences that best represent variants resulting from alternative splicing or gene duplication by exploiting pairs of RNA-Seq reads. The series of steps performed by the pipeline correctly reconstructs a significant percentage of the transcripts without relying on the existence of a reference genome.

A major data structure used the pipeline is the de Bruijn graph. A de Bruijn graph places each k-mer in a node, and has connected nodes if the k-mers are identical in all but the first or last position. While an efficient structure for representing heavily overlapping sequences, there are challenges in the usage of these graphs: (i) efficiently constructing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. Sequencing errors would introduce false nodes to the graph, potentially resulting in a great deal of wasted memory.

The Trinity pipeline consists of the following steps: it first analyzes the short reads to create a dictionary of all sequences of length 25 in the reads, indexing the locations where each sequence may be found. After removing likely errors, the unique k-mers are recombined, starting with the most frequently occurring sequences and extending the combination until no more k-mers can be matched. Each contig is then added to a cluster based on potential alternative spliced transcripts or otherwise unique portions of paralogous genes. Then, a de Bruijn graph is generated from each cluster with the weight of each edge assigned from the number of k-mers in the original read set that support the connection. In the final phase, a merge-and-prune operation on each graph, for error correction, is performed, followed by an enumeration of potential paths through the graph with a greater likelihood placed on paths with greater read support.

The authors built transcriptomes from both original data and reference sets, having a great deal of success in either case.