Monday, September 03, 2012

Coding errors in DNA analysis software

The problem:

A method for analysing similarities in protein sequences is to use a substitution scoring matrix. The matrix will assign a specific score to each individual protein match, so you can compare the sequences by looking at each individual pair of proteins in the sequence, looking in the matrix to determine the compatibility score for the pair, and adding up (or otherwise aggregating) the total score. The higher the score, the more likely it is, presumably, that the two sequences are actually related.

Emergent Homochirality  - for the Emergence Group
The art and science of comparing proteins

As a completely made-up example, if you have one sequence that goes Alanine - Glutamine - Isoleucine - Serine, and another sequence that goes Alanine - Glutamine - Tryptophan - Serine, you would look in the matrix for the scores for Alanine paired with Alanine, Glutamine with Glutamine, Isoleucine with Tryptophan, and Serine with Serine. If the identically matching pairs all scored 10, and the Isoleucine-Tryptophan pair scored 2, you would have a score of 32. On the other hand, if your matrix says that Isoleucine paired with Tryptophan gives you a -50, your score would total -20.

The program:

It's obviously pretty important to your analysis what matrix you use. Lots of people have come up with lots of matrices. One in particular, known as BLOSUM, was introduced in 1992 by Steven and Jorja Henikoff. The Henikoffs wrote a program (source code here) that analyzed a slew of already-matched sequences and assigned them scores, based on the probability that you would see them in a matching pair and the probability that you would see them at all. (This factor is relevant since some proteins overall appear with more frequency than others. You can find more details in this paper). They wrote it all up in a nice paper and the BLOSUM matrices have been in heavy use ever since.

But the funny thing is, there was a bug.

Yes, yes, I know what you're thinking: a bug? In a piece of software? Impossible! But it's true, and the bug wasn't pointed out, from the time the program was written in 1992, until 2008. It seems that, according to a paper by Mark Styczynski et al, there was an "incorrect normalization during a weighting procedure", which caused the program to generate different matrices based on the ordering of the sequences used as source.

So why did it take so long to find?

I'm thinking that the primary reason is because the source code is pretty incomprehensible. I'm not blaming the authors - I was writing code in 1992 and I'm sure that if I could even find any of it, it would be no easier to understand. My code, like theirs, would have been written in C, and in 1992 I think we were lucky to get code to work at all. Today, we have niceties such as unit testing, powerful IDE's, language features like garbage collection, and automatically generated documentation. In 1992, we had lint, and sometimes we even used it. Today, we have the confidence to make changes to the code strictly for reasons of readability and clarity. In 1992, we ran the program, looked at the output, and hoped it looked right. If it did, we were done. Plus, we had lint.

Here's the second reason the bug took so long to find: The matrix that was generated by the buggy code...worked, as far as anyone could see. People used the matrix for years and were perfectly satisfied with the results. As a matter of fact, Styczynski ran a comparison with the corrected matrices and determined that results were poorer using them than they had been using the buggy ones. I'm not sure why or how that could be. I suppose it might simply be that the Henikoffs happened to get lucky to create the better matrix, or tried a bunch of different ways to build a matrix and chose the most successful one to write about, not realizing that the paper described a slightly different algorithm than the code was actually running. Or even that the method Styczynski used to compare the matrices was wrong.

But there's no question...

Whatever the reason, it's hard to argue with Styczynski's conclusion: "there is significant room for improvement in our understanding of protein evolution.".