First: Principles

Thursday, February 13, 2020

Take a newick tree and stre-e-etch the leaves to make it ultrametric

I needed to generate some ultrametric Newick trees for some simulations. There's a nice Newick tree generator online at http:/Trex/trex.uqam.ca/index.php?action=randomtreegenerator&project=trex, but it doesn't have any way to make an ultrametric tree. (Ultrametric means that the lengths from root to tip are all of the same.) So I put together a short script using BioPython to replace the length of each leaf node to make the lengths all identical to the longest length.

https://gist.github.com/benfulton/a10583c8ae41897489d415c066c7579f

How to delete empty rows in an Excel spreadsheet

How to delete empty rows in an Excel spreadsheet:

https://www.itsupportguides.com/knowledge-base/office-2016/excel-2016-how-to-delete-empty-rows/

Monday, August 10, 2015

How Installing a Random Tool Caused Ruby on Rails to Stop Working

At work I maintain a little Ruby on Rails site. Not a particularly interesting site, just a set of pages of statistics gathered from various databases around the organization. It runs on a RHEL6 virtual machine and I write most of the code on my Windows laptop using RubyMine from JetBrains - a product I wholeheartedly endorse, by the way.

It does take a little patience to run Rails on Windows - not due to any deficit in either product, just because the vast majority of people who work with Rails are running it on some other platform, so when you have problems, a lot of times you're on your own.

Take this problem I ran into recently, for example. I launched RubyMine to check out a bug, fired up the ol' testing server, browsed to the web page, and was greeted with this:

JSON::ParserError in Statistics#index
Showing /app/views/layouts/application.html.erb where line #5 raised:

757: unexpected token at 'Node Commands

Syntax:
node {operator} [options] [arguments]

Parameters:
/? or /help - Display this help message.
list - List nodes or node history or the cluster
listcores - List cores on the cluster
view - View properties of a node
online - Set nodes or node to online state
offline - Set one or more nodes to the offline state

For more information about HPC command-line tools,
see http://go.microsoft.com/fwlink/?LinkId=120724.

(in /app/assets/javascripts/accounts.js.coffee)

Now, of all possible error messages to get, one involving HPC command-line tools was not one that I had ever considered. What's going on here?

The line that gives the error is one that occurs in every Rails application:

<%= javascript_include_tag 'application' %>

What it really comes down to saying is, take all of the Javascript and Coffeescript code in the application, and include it here. A little wasteful, perhaps, but in a production environment all of the code will get glommed into a single file appropriate for caching in the browser, so it's not a bad system. Still, where's the error?

Obviously it must be in the Coffeescript file referenced at the end of the error message. But why is it suddenly complaining about HPC? I went and commented that whole file out - no joy. The same error got reported with a different file. I could find no information anywhere about what the cause might have been, and I didn't expect searching for the error message itself would do any good. I work in an HPC - which stands for High  Performance Computing, incidentally - environment so I was fairly sure the error was something very specific to my machine.

I tried, of course. Several different Google searches failed - but I finally hit on the right combination of keywords that led me to a Google Plus comment on a YouTube tutorial.

It turns out that Rails, in its infinite struggle for flexibility, relies on a gem called ExecJS to decide how it's going to convert Coffeescript to Javascript. ExecJS, in turn, is of the opinion that Node.js is a really good way to do the conversion. And it determines if Node.js is available by trying to run an application called Node.

Which is fine, unless you happen to run Windows; and you happen to work in an HPC environment; and you happen to have gone to a conference the other week where you decided to install Microsoft's HPC tools; which happen to include a nice command for managing compute nodes; called - you guessed it - Node. ExecJS was trying to use Microsoft's Node command to convert Coffeescript to Javascript, resulting in the error you see. Hats off to Jonathan McDonald for nailing the issue.

Wednesday, July 16, 2014

Trinity RNA-Seq

In January a site went live called readingroom.info. I suspect due to the timing and the subject matter it was a student project. The idea was to write summaries - no more than 500 words - of scientific papers and allow people to comment on and discuss them. I thought it was a neat idea. They had some ideas for incentivizing writers and so forth, but I didn't have time to contribute anything until this summer, by which time the authors had apparently lost interest. I sent in a summary of a paper, but after several weeks it had not been approved by the moderators. Maybe it just wasn't that good!

The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.

This is what I sent them:

The reconstruction of a transcriptome from the short reads generated by RNA-Seq techniques presents many challenges, particularly in the absence of an existing reference genome with which to compare the reads. Challenges include: uneven coverage of the various transcripts; (ii) uneven coverage inside each transcript; (iii) sequencing errors in highly expressed transcripts; (iv) transcripts encoded by adjacent loci can overlap and thus can be erroneously fused to form a chimeric transcript; (v) data structures need to accommodate multiple transcripts per locus, owing to alternative splicing; and (vi) sequences that are repeated in different genes introduce ambiguity. The Trinity pipeline leverages several properties of transcriptomes in its assembly procedure: it uses transcript expression to guide the initial transcript assembly procedure in a strand-specific manner, it partitions RNA-Seq reads into sets of disjoint transcriptional loci, and it traverses each of the transcript graphs systematically to explore the sets of transcript sequences that best represent variants resulting from alternative splicing or gene duplication by exploiting pairs of RNA-Seq reads. The series of steps performed by the pipeline correctly reconstructs a significant percentage of the transcripts without relying on the existence of a reference genome.

A major data structure used the pipeline is the de Bruijn graph. A de Bruijn graph places each k-mer in a node, and has connected nodes if the k-mers are identical in all but the first or last position. While an efficient structure for representing heavily overlapping sequences, there are challenges in the usage of these graphs: (i) efficiently constructing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. Sequencing errors would introduce false nodes to the graph, potentially resulting in a great deal of wasted memory.

The Trinity pipeline consists of the following steps: it first analyzes the short reads to create a dictionary of all sequences of length 25 in the reads, indexing the locations where each sequence may be found. After removing likely errors, the unique k-mers are recombined, starting with the most frequently occurring sequences and extending the combination until no more k-mers can be matched. Each contig is then added to a cluster based on potential alternative spliced transcripts or otherwise unique portions of paralogous genes. Then, a de Bruijn graph is generated from each cluster with the weight of each edge assigned from the number of k-mers in the original read set that support the connection. In the final phase, a merge-and-prune operation on each graph, for error correction, is performed, followed by an enumeration of potential paths through the graph with a greater likelihood placed on paths with greater read support.

The authors built transcriptomes from both original data and reference sets, having a great deal of success in either case.

Tuesday, July 08, 2014

The Slim protocol

I wrote earlier about Fitnesse and Slim style integration testing.

There are only a few commands associated with Slim tables, but the precise meaning of the command is left to the Slim server. For example, according to the Slim documentation, the Import instruction:

causes the <path> to be added to the search path for fixtures. In java <path> gets added to the CLASSPATH. In .NET, the <path> would name a dll.

This explains why import tests are always green. You can pass them any random string and it's just added to the path. In WaferSlim, the concept is extended slightly: if the string has a slash or backslash in it, it's considered a path. Otherwise, it's considered a file and is added to the list of files searched for classes. Let's take an example. Create a file called fixtures.py with the following code:

class SomeDecisionTable:
def setInput(self, x):
self.x = str(x)

def getOutput(self):
return int(self.x) + 1

There's an odd Unicode issue that I think requires context variables to be saved as strings. I don't know if Fitnesse, Slim or Waferslim is responsible for that. At any rate, it's simple enough: set the input to an integer and the output will be the integer plus one.

Save that file. Now the Import table will take two lines, one with the path, one with the file:

!|import |

|/path/to/fixtures|

|fixtures|

Now WaferSlim knows where to find our code, so let's take a look at another table.

Here's what the Slim protocol tells us: it can send Make instructions and Call instructions. A Make instruction tells the server to create an object; a Call instruction tells it to call a method on that object. I'd hoped to go through the Fitnesse code to determine exactly how that works, but I didn't. We'll take it on faith that the first line of a table sends a Make instruction and further lines send Call instructions. So, in order to make calls into our object, we write:

!|SomeDecisionTable|
|input|get output?|
|1 |2 |

|10 |11 |

From the first line of the table, Fitnesse sends a Make: SomeDecisionTable command to the Slim server. Because of our Import statements, the server searches the /path/to/fixtures directory for the fixtures.py file, finds the SomeDecisionTable class, and instantiates it.

The second line of the table tells Fitnesse what Call statements to make. It will call setInput and then getOutput. Each further line of the table gives arguments for the input and output. So the sequence of events from the Slim server's perspective is as follows:

Make: SomeDecisionTable
Call: setInput(1)
Call: getOutput()
Call: setInput(10)
Call: getOutput()

Click the test button now and everything should run green.

Sunday, July 06, 2014

Fitnesse and Slim

Fitnesse is a rather nice testing tool. It's been around since what seems like the beginning of the integration testing movement. It's based around HTML tables that are translated into application code. The wiring that translates tables into code are called fixtures. Since Fitnesse was written in Java, it was mostly useful for testing Java code, although many translations and tricks were devised to allow tests of other languages. Fitnesse also includes a wiki to make the process of creating the test tables easier.

A few years after Fitnesse was introduced, an alternate translation tool, SLIM, was created. The idea behind Slim was to allow testers to implement tests in their language of choice, with the communication between Fitnesse and the tests taking place over a socket. This allows any application that implements the correct protocol to run tests written in Fitnesse tables.

Several applications were created to support the protocol in various languages, including RubySlim for Ruby and WaferSlim for Python. I had occasion to write a set of integration tests recently, so I thought it would be simple to set up a Slim server to get the tests running. Turned out, I was wrong.

For my tests, I wanted to set up the simplest thing that could possibly work. So what is the minimum requirement for a Slim test page? If you look at the RubySlim documentation you end up with a page that looks something like this:

!define TEST_SYSTEM {slim}
!define TEST_RUNNER {rubyslim}
!define COMMAND_PATTERN {rubyslim}
!path your/ruby/fixtures

!|import|
|<ruby module of fixtures>|

|SomeDecisionTable|
|input|get output?|
|1 |2 |

What all these things do is not clearly explained. Click the test button; you get a bunch of cryptic error messages. So what do they mean?

Well, the first line is pretty clear: if you want to use Slim, you need to set the test system. For the rest of it, we pretty much need to go directly to the source code. It turns out that the heart of Slim is in the Java class ProcessBuilder. In the CommandRunner class of Fitnesse we find:

ProcessBuilder processBuilder = new ProcessBuilder(command);

The command argument is no more than the COMMAND_PATTERN variable defined on the page, with a single argument appended, a port number. So if you wanted to use WaferSlim (a Python Slim server), you might say:

!define COMMAND PATTERN {python /home/benfulton/slim-init.py}

(Since we need to understand how Slim works, let's get the WaferSlim source from Github rather than installing it via Pip. The slim-init source is from this gist.)

So, after downloading WaferSlim and the code from the gist, we should be able to get the Simplest Possible Page to work. Here it is:

!define TEST_SYSTEM {slim}

!define COMMAND_PATTERN {python /home/benfulton/slim-init.py }

!|import |

|RubySlim|

|app|

Click the test button, and you should get a green test.

If you don't, you may have a Slim versioning error. Go to protocol.py and change the version number from 0.1 to 0.3. Or Python may not be in your path, or the init-script may not be found. Or - here's a tricky one - you put a space before the python command. If you did that, the first command that gets passed to the ProcessBuilder class will be an empty string, and it won't run. Still, this will run green without too much effort.

But wait! What's the use of this import statement that went green before we wrote anything?