The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.
This is what I sent them:
The reconstruction of a
transcriptome from the
short reads generated by RNA-Seq techniques presents
many challenges, particularly in the absence of an existing reference genome
with which to compare the reads. Challenges include: uneven coverage of the
various transcripts; (ii) uneven coverage inside each transcript; (iii)
sequencing errors in highly expressed transcripts; (iv) transcripts encoded by
adjacent loci can overlap and thus can be erroneously fused to form a chimeric transcript; (v)
data structures need to accommodate multiple transcripts per locus, owing to
alternative splicing; and (vi) sequences that are repeated in different genes
introduce ambiguity. The Trinity pipeline leverages several properties of
transcriptomes in its assembly procedure: it uses transcript expression to
guide the initial transcript assembly procedure in a strand-specific manner, it
partitions RNA-Seq reads into sets of disjoint transcriptional loci, and it traverses
each of the transcript graphs systematically to explore the sets of transcript
sequences that best represent variants resulting from alternative splicing or
gene duplication by exploiting pairs of RNA-Seq reads. The series of steps
performed by the pipeline correctly reconstructs a significant percentage of
the transcripts without relying on the existence of a reference genome.
A major data structure used the
pipeline is the de Bruijn graph. A de Bruijn graph places each
k-mer in a node, and has
connected nodes if the k-mers are identical in all but the first or last
position. While an efficient structure for representing heavily overlapping
sequences, there are challenges in the usage of these graphs: (i) efficiently
constructing this graph from large amounts (billions of base pairs) of raw
data; (ii) defining a suitable scoring and enumeration algorithm to recover all
plausible splice forms and paralogous transcripts; and (iii) providing
robustness to the noise stemming from sequencing errors and other artifacts in
the data. Sequencing errors would introduce false nodes to the graph,
potentially resulting in a great deal of wasted memory.
The Trinity pipeline consists of
the following steps: it first analyzes the short reads to create a dictionary
of all sequences of length 25 in the reads, indexing the locations where each
sequence may be found. After removing likely errors, the unique k-mers are recombined,
starting with the most frequently occurring sequences and extending the
combination until no more k-mers can be matched. Each contig is then added to a
cluster based on potential alternative spliced transcripts or otherwise unique
portions of paralogous genes. Then, a de Bruijn graph is generated from each
cluster with the weight of each edge assigned from the number of k-mers in the
original read set that support the connection. In the final phase, a
merge-and-prune operation on each graph, for error correction, is performed,
followed by an enumeration of potential paths through the graph with a greater
likelihood placed on paths with greater read support.
The authors built transcriptomes
from both original data and reference sets, having a great deal of success in
either case.