Friday, December 21, 2012

Unit Test Coverage in Biopython


This fall, for the first time, I've had the professional opportunity to work on some Python code. I find the language quite elegant and simple to write, much more so than Ruby, although I couldn't really say why. Along with my work on a Master's degree in bioinformatics, I decided to look at the Biopython project and see about adding some code coverage statistics to it.

Biopython is an open-source software project that provides Python libraries for a variety of bioinformatics purposes. It includes, among other features, a variety of file parsers, alignment algorithms, and interfaces with various common bioinformatics tools and databases. To assure quality, a battery of automated tests is run against the Biopython source code before every release. While Biopython has a substantial amount of automated tests, no statistics are gathered concerning the code coverage of the tests.

It took several steps. The first was simply to run the existing test suite; I downloaded the codebase to my Windows machine, and tried to build it, but it turned out to be really difficult. Biopython has a dependency on NumPy as well as several other packages. I shaved several yaks attempting to get it running, but gave up eventually. Instead, I created a virtual Linux machine and installed it on that.

This went much more smoothly. (Note that I don't find Linux to have any natural advantage here. People who play with the Biopython source code run it on Linux, so that install gets a lot more attention.) I was able to run the test suite after a fashion. Not all of the tests ran the first time, which is a bit scary - it's hard to modify code with which you're not familiar without passing tests. But I realized after a while that the suite does some things like connect to remote services to verify that the connections work, and those services were down. There's an option to run only tests that don't attempt connections, and that worked - again, after a fashion. There are so many external dependencies that the test suite is set up to skip tests that require packages that aren't installed. But as all I was looking for was a passing test suite, I was OK with that.

The next step was a coverage tool. I chose Ned Batchelder's Coverage which worked nicely. All I had to do was replace the line "python run_tests.py --offline" with "coverage run run_tests.py --offline" and the coverage ran. I found the architecture of the tool a little strange; you don't specify any kind of output format. Instead, the tool creates its own binary data file (prefixed with a '.' to hide it on Linux systems) and you run a separate command to generate output in various formats. But this worked well and I had my report, in a nice HTML format so I could look at it in a browser.

Coverage is nice, but I don't know that a single report is very useful. Comparing coverage between builds is where the true benefit lies. Are some tests no longer covering code that they were meant to? Were tests disabled for some reason? Why did the coverage percentage drop? Ideally, each time the code is built, a new coverage report would be generated. So, I went on to look at the Biopython build process.

Biopython uses Travis for its builds. Travis is a rather nice public continuous integration server that integrates with Github, where the Biopython source code lives. You provide a well-named build script in your source tree and Travis monitors your repository for changes. It was easy enough to incorporate the coverage tool, but Travis didn't provide a good mechanism for reporting. The simplest thing, I decided, was to use a script to automatically upload a file back to Github. I jumped through several more hoops to get this working - authentication, permissions - and today I find, first, that Github is deprecating file uploads and, second, that Travis is supporting a new artifact system . So I think most of that work is out the window.

Bother.

So I'll have to rework some of that. Nonetheless, the coverage effort gave me some interesting results. Biopython consists of 298 source files comprising 39,805 statements. Automated unit tests covered all but 12,499 of those statements, for a coverage percentage of 69%. 33 files (11%) had no coverage at all, while 60 files (20%) were fully covered. Based on these numbers, Biopython has at this moment an almost acceptable level of coverage, as it is a truism in the software development world that coverage rates above 70% tend to lead to diminishing returns of value.

However, as a code library, Biopython is free from the problems of automated user interface testing, which is one of the more difficult areas of automated testing. For this reason, one might expect a somewhat higher coverage rate. There is reason to suspect, however, that as time goes by, the percentage will go higher. An examination of some of the source files with zero coverage reveal deprecation warnings, i.e. indications to the user that the module in question should not be used for new development. At some point it is to be assumed that these files will no longer be part of Biopython, which will drive the code coverage percentage upwards.

The introduction of a code coverage measuring tool to the Biopython build process is an important step, but only a first one. The code coverage tool selected is particular to Python, while a certain amount of the Biopython source code is written in C, a much more difficult language to instrument even for automated unit testing, much less code coverage reporting on that testing. However, useful work could be done in attempting to add this coverage. The techniques used here only monitor statement counts – perhaps branching counts would be a valuable addition. Biopython also has many dependencies on third-party products, some of which are installed during unit testing and some are not. Tests around these integration packages might prove useful. Finally, some work could be done on analyzing differences in builds, sending out alerts to interested parties if a given build should have a sudden drop in the code coverage percentage, for example.

Code coverage is a useful tool for long-running projects. Hopefully some of this effort will make it into the main Biopython codebase!

Thursday, September 13, 2012

Denisovan Gene Sequencing

What the fossil record shows about Denisovans: They had at least one finger, one toe, and a tooth.

This evidence comes from the Denisova Cave in Siberia.Scientists found the finger, toe, and tooth, which came from three different individuals, in different levels of the cave, and after doing a DNA analysis determined that their last common ancestor with humans lived about a million years ago. (The fossils dated from about 50,000 years ago). Of course, with such a minimal amount of material available, it's a bit tricky to get a complete gene sequence. The fact that the cave is in Siberia helps some (the average temperature in the cave is right around freezing) but some nice work on sequencing from a group led by Matthias Meyer helped as well.

Here's what they did to the source material:

DNA is dephosphorylated, heat denatured, and ligated to a biotinylated adaptor oligonucleotide, which allows its immobilization on streptavidincoated beads.

I'm sure you're kicking yourself for not thinking of it first. At any rate, the immobilization of the DNA on the beads seems to be the important part, as it allows for copying of the sequence thus creating extra source material to work with. We now know more about Denisovan gene seqences than we do about Neanderthal sequences, as the quality of these Denisovan genes is better, less contaminated, than anything we have from the Neanderthals. Pretty cool stuff! Here's an article from Ars Technica if you don't feel like wading through the original paper.


Monday, September 03, 2012

Coding errors in DNA analysis software

The problem:

A method for analysing similarities in protein sequences is to use a substitution scoring matrix. The matrix will assign a specific score to each individual protein match, so you can compare the sequences by looking at each individual pair of proteins in the sequence, looking in the matrix to determine the compatibility score for the pair, and adding up (or otherwise aggregating) the total score. The higher the score, the more likely it is, presumably, that the two sequences are actually related.

Emergent Homochirality  - for the Emergence Group
The art and science of comparing proteins

As a completely made-up example, if you have one sequence that goes Alanine - Glutamine - Isoleucine - Serine, and another sequence that goes Alanine - Glutamine - Tryptophan - Serine, you would look in the matrix for the scores for Alanine paired with Alanine, Glutamine with Glutamine, Isoleucine with Tryptophan, and Serine with Serine. If the identically matching pairs all scored 10, and the Isoleucine-Tryptophan pair scored 2, you would have a score of 32. On the other hand, if your matrix says that Isoleucine paired with Tryptophan gives you a -50, your score would total -20.

The program:

It's obviously pretty important to your analysis what matrix you use. Lots of people have come up with lots of matrices. One in particular, known as BLOSUM, was introduced in 1992 by Steven and Jorja Henikoff. The Henikoffs wrote a program (source code here) that analyzed a slew of already-matched sequences and assigned them scores, based on the probability that you would see them in a matching pair and the probability that you would see them at all. (This factor is relevant since some proteins overall appear with more frequency than others. You can find more details in this paper). They wrote it all up in a nice paper and the BLOSUM matrices have been in heavy use ever since.

But the funny thing is, there was a bug.

Yes, yes, I know what you're thinking: a bug? In a piece of software? Impossible! But it's true, and the bug wasn't pointed out, from the time the program was written in 1992, until 2008. It seems that, according to a paper by Mark Styczynski et al, there was an "incorrect normalization during a weighting procedure", which caused the program to generate different matrices based on the ordering of the sequences used as source.

So why did it take so long to find?


I'm thinking that the primary reason is because the source code is pretty incomprehensible. I'm not blaming the authors - I was writing code in 1992 and I'm sure that if I could even find any of it, it would be no easier to understand. My code, like theirs, would have been written in C, and in 1992 I think we were lucky to get code to work at all. Today, we have niceties such as unit testing, powerful IDE's, language features like garbage collection, and automatically generated documentation. In 1992, we had lint, and sometimes we even used it. Today, we have the confidence to make changes to the code strictly for reasons of readability and clarity. In 1992, we ran the program, looked at the output, and hoped it looked right. If it did, we were done. Plus, we had lint.

Here's the second reason the bug took so long to find: The matrix that was generated by the buggy code...worked, as far as anyone could see. People used the matrix for years and were perfectly satisfied with the results. As a matter of fact, Styczynski ran a comparison with the corrected matrices and determined that results were poorer using them than they had been using the buggy ones. I'm not sure why or how that could be. I suppose it might simply be that the Henikoffs happened to get lucky to create the better matrix, or tried a bunch of different ways to build a matrix and chose the most successful one to write about, not realizing that the paper described a slightly different algorithm than the code was actually running. Or even that the method Styczynski used to compare the matrices was wrong.

But there's no question...


Whatever the reason, it's hard to argue with Styczynski's conclusion: "there is significant room for improvement in our understanding of protein evolution.".

Tuesday, August 21, 2012

On the teaching of genetics

Rosemary J. Redfield wrote an article on the teaching of genetics that resonated with me. Apparently the standard theory for teaching genetics is a sort of reprise of the history of genetics - you start with Mendel and dominant and recessive genes, move on to genes being on chromosomes, and slowly get on, if you're lucky, by the end of the course, to the molecular analysis of genes. The theory behind it is that the students will stand in Mendel's shoes and ask, "Well, why are some genes dominant", which will be answered by the next phase of the course, which will prompt more questions, to be answered by the following unit, and so on.

The problem is, that strategy doesn't work.

(Close-Up) Erratic Black Hole Regulates Inside Quasar (NASA, Chandra, 03/25/09)It reminds me of astronomy classes both in high school and college. Now, astronomy is an awesome and fascinating subject. Go pick up any popular science magazine with an article on astronomy and just check out the language that they use: "Quasar". "Black Hole". "Dark Energy". "Strange Planet". It's like the whole subject was created just to appeal to teenagers. There is a podcast dedicated to astronomy called AstronomyCast that goes over a lot of this stuff, and my eleven-year-old son cannot go to sleep at night without listening to at least a few episodes.

But I hope that the interest isn't torn out of him in high school. If his courses are anything like mine, they will discuss: Stonehenge. Galileo. How, if you stay up night after night, you can see the position of the planets change slightly in relation to the stars. How an optical telescope works.

“Students know what the cool stuff is.”
This isn't the cool stuff. Students know what the cool stuff is. If they don't listen to AstronomyCast, they've probably seen episodes of Nova, or at the very least, played a videogame in which black holes or wormholes play an integral part.

Genetics is just the same. Engage the student's interest by hitting them with the cool stuff first. Don't try to emulate the thinking of Mendel, because the instruments and techniques we use today are so much more powerful than Mendel ever dreamed of, and students know that, and often know what the techniques are. Redfield suggests starting with personal genomics, which seems like a good plan. Students, who know that they have a unique genetic makeup, should be interested in knowing what that makeup is, or at least how to find out. This would lead directly to the ethical questions surrounding that knowledge, and the course is off and running. Redfield is on to something.

Saturday, August 18, 2012

DNA as storage mechanism

It seems some East Coast researchers are pushing the envelope in storing information in DNA. They encoded a book, roughly 5 megabits of data, into oligonucleotides, which were "synthesized by ink-jet printed, high-fidelity DNA microchips" which I don't fully understand but I presume means "made into a glob of DNA". The authors then sequenced the DNA and recovered all the data, with 10 incorrect bits. DNA
The innate four bases (A,G,C,T) of DNA seem to lend themselves to some interesting storage techniques. The authors used simple redundancy for their storage - A and C both represented 0, G and T were 1, which was apparently a departure from earlier attempts which encoded each pair of bits into a single base. This made it easier to construct more robust sequences. I wonder if additional error-handling could have been done by placing checksum bases at intervals along the strand? Two bases would provide a range of 16 possible checksum values which seems it would handle a nice string of bits.

The book that was encoded had 50,000 words and eleven pictures. With an average code space of 40 bits per word, the text should have taken a tiny fraction of the total space, with the images providing the majority. Suppose that all ten bit errors were in one picture? It would be interesting to know how tightly compressed the images were. With a high compression factor, some of the bit errors might be substantial, but small changes to the compression might make a large difference in the visibility of any bit errors.

The authors say that DNA storage is dense, stable, and energy-efficient, but prohibitively expensive and slow to read and write compared to more standard storage. It will be fun to see how this technology evolves!

Tuesday, July 03, 2012

Back to school!

In the fall I'll be enrolling for classes once more. Cathy's now finished her Doctor of Nursing Practice degree so now it's my turn. I've been accepted to the bioinformatics program at Indiana University. It appears that my first classes will be: Intro to Biology, Intro to Informatics, and Intro to Bioinformatics. It should be a lot of fun - I'm really looking forward to it.