First: Principles: 2014

Wednesday, July 16, 2014

Trinity RNA-Seq

In January a site went live called readingroom.info. I suspect due to the timing and the subject matter it was a student project. The idea was to write summaries - no more than 500 words - of scientific papers and allow people to comment on and discuss them. I thought it was a neat idea. They had some ideas for incentivizing writers and so forth, but I didn't have time to contribute anything until this summer, by which time the authors had apparently lost interest. I sent in a summary of a paper, but after several weeks it had not been approved by the moderators. Maybe it just wasn't that good!

The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.

This is what I sent them:

The reconstruction of a transcriptome from the short reads generated by RNA-Seq techniques presents many challenges, particularly in the absence of an existing reference genome with which to compare the reads. Challenges include: uneven coverage of the various transcripts; (ii) uneven coverage inside each transcript; (iii) sequencing errors in highly expressed transcripts; (iv) transcripts encoded by adjacent loci can overlap and thus can be erroneously fused to form a chimeric transcript; (v) data structures need to accommodate multiple transcripts per locus, owing to alternative splicing; and (vi) sequences that are repeated in different genes introduce ambiguity. The Trinity pipeline leverages several properties of transcriptomes in its assembly procedure: it uses transcript expression to guide the initial transcript assembly procedure in a strand-specific manner, it partitions RNA-Seq reads into sets of disjoint transcriptional loci, and it traverses each of the transcript graphs systematically to explore the sets of transcript sequences that best represent variants resulting from alternative splicing or gene duplication by exploiting pairs of RNA-Seq reads. The series of steps performed by the pipeline correctly reconstructs a significant percentage of the transcripts without relying on the existence of a reference genome.

A major data structure used the pipeline is the de Bruijn graph. A de Bruijn graph places each k-mer in a node, and has connected nodes if the k-mers are identical in all but the first or last position. While an efficient structure for representing heavily overlapping sequences, there are challenges in the usage of these graphs: (i) efficiently constructing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. Sequencing errors would introduce false nodes to the graph, potentially resulting in a great deal of wasted memory.

The Trinity pipeline consists of the following steps: it first analyzes the short reads to create a dictionary of all sequences of length 25 in the reads, indexing the locations where each sequence may be found. After removing likely errors, the unique k-mers are recombined, starting with the most frequently occurring sequences and extending the combination until no more k-mers can be matched. Each contig is then added to a cluster based on potential alternative spliced transcripts or otherwise unique portions of paralogous genes. Then, a de Bruijn graph is generated from each cluster with the weight of each edge assigned from the number of k-mers in the original read set that support the connection. In the final phase, a merge-and-prune operation on each graph, for error correction, is performed, followed by an enumeration of potential paths through the graph with a greater likelihood placed on paths with greater read support.

The authors built transcriptomes from both original data and reference sets, having a great deal of success in either case.

Tuesday, July 08, 2014

The Slim protocol

I wrote earlier about Fitnesse and Slim style integration testing.

There are only a few commands associated with Slim tables, but the precise meaning of the command is left to the Slim server. For example, according to the Slim documentation, the Import instruction:

causes the <path> to be added to the search path for fixtures. In java <path> gets added to the CLASSPATH. In .NET, the <path> would name a dll.

This explains why import tests are always green. You can pass them any random string and it's just added to the path. In WaferSlim, the concept is extended slightly: if the string has a slash or backslash in it, it's considered a path. Otherwise, it's considered a file and is added to the list of files searched for classes. Let's take an example. Create a file called fixtures.py with the following code:

class SomeDecisionTable:
def setInput(self, x):
self.x = str(x)

def getOutput(self):
return int(self.x) + 1

There's an odd Unicode issue that I think requires context variables to be saved as strings. I don't know if Fitnesse, Slim or Waferslim is responsible for that. At any rate, it's simple enough: set the input to an integer and the output will be the integer plus one.

Save that file. Now the Import table will take two lines, one with the path, one with the file:

!|import |

|/path/to/fixtures|

|fixtures|

Now WaferSlim knows where to find our code, so let's take a look at another table.

Here's what the Slim protocol tells us: it can send Make instructions and Call instructions. A Make instruction tells the server to create an object; a Call instruction tells it to call a method on that object. I'd hoped to go through the Fitnesse code to determine exactly how that works, but I didn't. We'll take it on faith that the first line of a table sends a Make instruction and further lines send Call instructions. So, in order to make calls into our object, we write:

!|SomeDecisionTable|
|input|get output?|
|1 |2 |

|10 |11 |

From the first line of the table, Fitnesse sends a Make: SomeDecisionTable command to the Slim server. Because of our Import statements, the server searches the /path/to/fixtures directory for the fixtures.py file, finds the SomeDecisionTable class, and instantiates it.

The second line of the table tells Fitnesse what Call statements to make. It will call setInput and then getOutput. Each further line of the table gives arguments for the input and output. So the sequence of events from the Slim server's perspective is as follows:

Make: SomeDecisionTable
Call: setInput(1)
Call: getOutput()
Call: setInput(10)
Call: getOutput()

Click the test button now and everything should run green.

Sunday, July 06, 2014

Fitnesse and Slim

Fitnesse is a rather nice testing tool. It's been around since what seems like the beginning of the integration testing movement. It's based around HTML tables that are translated into application code. The wiring that translates tables into code are called fixtures. Since Fitnesse was written in Java, it was mostly useful for testing Java code, although many translations and tricks were devised to allow tests of other languages. Fitnesse also includes a wiki to make the process of creating the test tables easier.

A few years after Fitnesse was introduced, an alternate translation tool, SLIM, was created. The idea behind Slim was to allow testers to implement tests in their language of choice, with the communication between Fitnesse and the tests taking place over a socket. This allows any application that implements the correct protocol to run tests written in Fitnesse tables.

Several applications were created to support the protocol in various languages, including RubySlim for Ruby and WaferSlim for Python. I had occasion to write a set of integration tests recently, so I thought it would be simple to set up a Slim server to get the tests running. Turned out, I was wrong.

For my tests, I wanted to set up the simplest thing that could possibly work. So what is the minimum requirement for a Slim test page? If you look at the RubySlim documentation you end up with a page that looks something like this:

!define TEST_SYSTEM {slim}
!define TEST_RUNNER {rubyslim}
!define COMMAND_PATTERN {rubyslim}
!path your/ruby/fixtures

!|import|
|<ruby module of fixtures>|

|SomeDecisionTable|
|input|get output?|
|1 |2 |

What all these things do is not clearly explained. Click the test button; you get a bunch of cryptic error messages. So what do they mean?

Well, the first line is pretty clear: if you want to use Slim, you need to set the test system. For the rest of it, we pretty much need to go directly to the source code. It turns out that the heart of Slim is in the Java class ProcessBuilder. In the CommandRunner class of Fitnesse we find:

ProcessBuilder processBuilder = new ProcessBuilder(command);

The command argument is no more than the COMMAND_PATTERN variable defined on the page, with a single argument appended, a port number. So if you wanted to use WaferSlim (a Python Slim server), you might say:

!define COMMAND PATTERN {python /home/benfulton/slim-init.py}

(Since we need to understand how Slim works, let's get the WaferSlim source from Github rather than installing it via Pip. The slim-init source is from this gist.)

So, after downloading WaferSlim and the code from the gist, we should be able to get the Simplest Possible Page to work. Here it is:

!define TEST_SYSTEM {slim}

!define COMMAND_PATTERN {python /home/benfulton/slim-init.py }

!|import |

|RubySlim|

|app|

Click the test button, and you should get a green test.

If you don't, you may have a Slim versioning error. Go to protocol.py and change the version number from 0.1 to 0.3. Or Python may not be in your path, or the init-script may not be found. Or - here's a tricky one - you put a space before the python command. If you did that, the first command that gets passed to the ProcessBuilder class will be an empty string, and it won't run. Still, this will run green without too much effort.

But wait! What's the use of this import statement that went green before we wrote anything?