Last time I discussed using a program called BWA to try to determine if a given sequence was part of the human genome or not. It didn't seem to do a great job. How do we approach it programmatically?
Let's review the problem first. You have a machine that has analyzed some DNA, and it's given you some sequences it's found. Roughly 300,000 of them. Some with maybe only five characters, some with a few hundred. We call them "reads".
Recall that a DNA sequence, from a programming standpoint, consists of any number of any of four characters: A,C,G, and T. A human genome, in the end, is just a DNA sequence. So if the human sequence was "AAGGTTTCC" and your sequence is "AGGT", bing! You've got a match.
Now, here's the catch: a read, instead of being four characters, is reasonably a minimum of fifty characters. The human genome, instead of being nine characters, is roughly six billion characters. 6,000,000,000 characters. If your first thought is to fire up a text editor and do a search, you had better make sure that (a) the editor is capable of handling a six gigabyte file, and (b) that your computer has six gigs available to throw around (In 2013, a high-end laptop will probably have eight gigs, so just barely enough space to hold the genome and run a couple of other programs.)
Another, minor issue is that most genome files are formatted with sixty-character lines. If your sequence is split across two lines, a simple text search won't find it. You can preprocess the file to remove all the carriage returns, but if you do, your text editor had better be able to handle one line with 6 billion characters on it!
Your next idea: grep. Will grep find a 50-character match in a genome? I couldn't find a good way to do it reliably, since, again, genome files tend to be formatted into 60-character lines, and grep isn't really capable of finding a string that breaks across a line. Moreover, grep is a line-based tool, so odds are if you try preprocessing the line, grep will think (rightly) that you have a file with one six-billion character line in it, try to read in the line, and run out of memory.
Here's the other catch: Suppose you could get grep to work, and it managed to search the whole genome in three seconds. But we have 300,000 reads to get through! That's about 250 hours worth of search time and we're under a three-hour deadline. So we can't really afford to search the entire genome for each read.
So we'll write a script. A good processing plan might be this: index the genome, index the reads, then march down the two together to find our list of best prefixes. Sounds great!
But it turns out that even indexing a file the size of the human genome is a pain. I'll think about why next time.
Also see: A Million Dollars Up For Grabs
Ramblings of a software developer with a degree in bioinformatics. Agile development mixed with DNA sequencing - what could go wrong?
Monday, May 13, 2013
Saturday, March 23, 2013
Analyzing DNA with BWA
Last time I discussed a million-dollar challenge posted by Innocenture. The challenge is to analyze a series of DNA reads and determine the source. My professor suggested using a program named BWA, a Burrows-Wheeler Aligner. Burrows-Wheeler alignment is an algorithm for matching up two sequences which may not be the same length. Perhaps one sequence has two extra nucleotides in the middle but the entire rest of the sequences match. These kinds of sequences are very difficult to find naively, and which match is better can come down to a judgment call.
So the goal is to align each sequence in the example file to the human genome. BWA, for speed, asks that you index the target genome - the human genome, in this case - which takes a fair amount of time. Then, you pass it each sequence on the command line, with a command something like:
>bwa align AAGCTCTA human_genome
and it does its magic and returns the location in the genome of the best several matches.
So I worked up a Python script to analyze the input file and pass each sequence to BWA in that format, and let it run for a few hours. (I'm never sure how many because I never remember to change the machine's power settings to not switch itself off after a few minutes of no UI activity). But it eventually completed. I went to look at the results, and I found them very intriguiing. Although, according to the challenge, no less than 90% of the DNA was human, the BWA program only managed to match about 50%, about 150,000 reads.
I took a closer look at the BWA manual, and found this option:
If I understand this correctly, it means that if the alignment has more than one gap in it, BWA will discard it as not being a match. You can change the value of this parameter, but when I did, it seemed to slow down the analysis quite a bit. I didn't let it complete, but based on the portions I did run I suspect it would have gone over the three-hour limitation - at least on my workstation, which I'm sure is underpowered compared to the target hardware.
So I started to think about what sort of coding would need to be done to meet this challenge. Next time I'll think about Analyzing DNA Programmatically
So the goal is to align each sequence in the example file to the human genome. BWA, for speed, asks that you index the target genome - the human genome, in this case - which takes a fair amount of time. Then, you pass it each sequence on the command line, with a command something like:
>bwa align AAGCTCTA human_genome
and it does its magic and returns the location in the genome of the best several matches.
So I worked up a Python script to analyze the input file and pass each sequence to BWA in that format, and let it run for a few hours. (I'm never sure how many because I never remember to change the machine's power settings to not switch itself off after a few minutes of no UI activity). But it eventually completed. I went to look at the results, and I found them very intriguiing. Although, according to the challenge, no less than 90% of the DNA was human, the BWA program only managed to match about 50%, about 150,000 reads.
I took a closer look at the BWA manual, and found this option:
| -o INT | Maximum number of gap opens [1] |
If I understand this correctly, it means that if the alignment has more than one gap in it, BWA will discard it as not being a match. You can change the value of this parameter, but when I did, it seemed to slow down the analysis quite a bit. I didn't let it complete, but based on the portions I did run I suspect it would have gone over the three-hour limitation - at least on my workstation, which I'm sure is underpowered compared to the target hardware.
So I started to think about what sort of coding would need to be done to meet this challenge. Next time I'll think about Analyzing DNA Programmatically
Friday, March 22, 2013
A million dollars up for grabs
Of course, you have to do a few things to earn it. Innocenture is running a challenge, with a million dollar prize, for the ability to take a DNA read, analyze it, and determine the precise species of each read. Here's the kicker: it has to be done rapidly.
They have some examples available. The input is in a large XML file that contains the DNA reads and some information on the quality of the reads, and they provide you with the output file they would be looking for. One example has a little more that 300,000 reads of between 50 and 200 nucleotides, putatively taken from a human. According to the output, at least 90% of the reads are from human DNA.
So how do we duplicate this output file? There's a program called BLAST available on the web for bioinformatic analysis - you give it a sequence of DNA and it almost immediately comes back with the closest matches across their entire, huge DNA database.
So, we might be able to slam that database with the reads and get back the results. There's just one problem - notice I said 300,000 reads? Suppose we could get back each one in one tenth of a second. That makes 30,000 seconds, or a total of eight hours and 20 minutes of runtime. Sadly, the million dollars probably won't be given away unless you can get the reads done in under three hours. Oh, and did I mention the application won't actually have internet access?
So the BLAST site is out, which is unfortunate, because it really does an amazing job at matching sequences. What do we do instead?
Due to the nature of science in the United States - "publish or perish" - there are a whole lot of little bioinformatics applications around. Mostly, people will write one, publish a paper about it, and then forget about it. There's no point in maintaining it or going back and improving it since there's no chance of writing another paper about it unless you change the algorithms significantly.
Still, a few of these applications manage to have some shelf life. I'll look at Analyzing DNA with BWA next.
Also see:
Analyzing DNA Programmatically
They have some examples available. The input is in a large XML file that contains the DNA reads and some information on the quality of the reads, and they provide you with the output file they would be looking for. One example has a little more that 300,000 reads of between 50 and 200 nucleotides, putatively taken from a human. According to the output, at least 90% of the reads are from human DNA.
So how do we duplicate this output file? There's a program called BLAST available on the web for bioinformatic analysis - you give it a sequence of DNA and it almost immediately comes back with the closest matches across their entire, huge DNA database.
So, we might be able to slam that database with the reads and get back the results. There's just one problem - notice I said 300,000 reads? Suppose we could get back each one in one tenth of a second. That makes 30,000 seconds, or a total of eight hours and 20 minutes of runtime. Sadly, the million dollars probably won't be given away unless you can get the reads done in under three hours. Oh, and did I mention the application won't actually have internet access?
So the BLAST site is out, which is unfortunate, because it really does an amazing job at matching sequences. What do we do instead?
Due to the nature of science in the United States - "publish or perish" - there are a whole lot of little bioinformatics applications around. Mostly, people will write one, publish a paper about it, and then forget about it. There's no point in maintaining it or going back and improving it since there's no chance of writing another paper about it unless you change the algorithms significantly.
Still, a few of these applications manage to have some shelf life. I'll look at Analyzing DNA with BWA next.
Also see:
Analyzing DNA Programmatically
Thursday, March 21, 2013
RNA Polymerase ||| and the RIG-I pathway
A little story about immune responses in cells.
Type-I interferons (IFNs) are important for antiviral and autoimmune responses. They interfere with viruses as the viruses try to borrow the cell's replication mechanism to reproduce themselves.
The cell will produce interferons due to a couple of proteins: the retinoic acid induced gene I (RIG-I) and mitochondrial antiviral signaling (MAVS) proteins.
These, in turn, start the production process when cytosolic double-stranded RNA or single-stranded RNA containing 5′-triphosphate (5′-ppp) are nearby.
Here's a surprising thing: Cytosolic B-form double-stranded DNA can also induce IFN-β. For example, a DNA sequence of repeating AT can induce it (It’s known as poly(dA-dT). But no one knew how. Until a paper came out in 2009 by Yu-Hsin Chiu and a couple of other people. It turned out that inside the cell, the poly(dA-DT) was actually being converted into 5′-ppp.
But how? It turns out that an enzyme uses the poly(dA-dT) as a template to synthesizes 5′-ppp RNA. The enzyme is DNA-dependent RNA polymerase III (Pol-III). This was interesting because it was known that the Pol-III had a role in the nucleus of the cell, but not that it had to do with the immune system.
If you inhibit the working of Pol-III in a cell, and then introduce a bacteria like Legionella pneumophil, the bacteria grows in the cell. The implication is that Pol-III senses the DNA of the bacteria and triggers the IFN process.
Then, they put different things in the cell. Of all the things tested, only poly(dA-dT) activated the IRF3.
To ensure that there wasn’t something going on at another step in the path, some other things were tried: A silencing RNA strand was introduced into the cell that would stop the production of RIG-I and MAVS. No IFN-β was produced. DNASE-I is an enzyme that breaks down DNA. When that was introduced, no IFN-β was produced. On the other hand, IFN-β was produced in the presence of RNASE-I, so breaking down RNA had no effect.
Nucleic acids from the poly(dA-dT) cells were able to induce IFN- β, even in the presence of DNase I, so it wasn’t DNA that was causing it. Production stopped in the presence of RNase I though, so it must have been RNA that was being produced.
Similar tests were done to determine the exact length of the poly. As few as 30 base pairs were able to trigger the IFN. But, longer sequences with G’s and C have failed to trigger anything.
Type-I interferons (IFNs) are important for antiviral and autoimmune responses. They interfere with viruses as the viruses try to borrow the cell's replication mechanism to reproduce themselves.
The cell will produce interferons due to a couple of proteins: the retinoic acid induced gene I (RIG-I) and mitochondrial antiviral signaling (MAVS) proteins.
These, in turn, start the production process when cytosolic double-stranded RNA or single-stranded RNA containing 5′-triphosphate (5′-ppp) are nearby.
Here's a surprising thing: Cytosolic B-form double-stranded DNA can also induce IFN-β. For example, a DNA sequence of repeating AT can induce it (It’s known as poly(dA-dT). But no one knew how. Until a paper came out in 2009 by Yu-Hsin Chiu and a couple of other people. It turned out that inside the cell, the poly(dA-DT) was actually being converted into 5′-ppp.
But how? It turns out that an enzyme uses the poly(dA-dT) as a template to synthesizes 5′-ppp RNA. The enzyme is DNA-dependent RNA polymerase III (Pol-III). This was interesting because it was known that the Pol-III had a role in the nucleus of the cell, but not that it had to do with the immune system.
If you inhibit the working of Pol-III in a cell, and then introduce a bacteria like Legionella pneumophil, the bacteria grows in the cell. The implication is that Pol-III senses the DNA of the bacteria and triggers the IFN process.
How did they do it?
In a cell, they attached a luciferase reporter to the IFN-β promoter, so if the cell creates IFN-β, it would bioluminesce.Then, they put different things in the cell. Of all the things tested, only poly(dA-dT) activated the IRF3.
To ensure that there wasn’t something going on at another step in the path, some other things were tried: A silencing RNA strand was introduced into the cell that would stop the production of RIG-I and MAVS. No IFN-β was produced. DNASE-I is an enzyme that breaks down DNA. When that was introduced, no IFN-β was produced. On the other hand, IFN-β was produced in the presence of RNASE-I, so breaking down RNA had no effect.
Nucleic acids from the poly(dA-dT) cells were able to induce IFN- β, even in the presence of DNase I, so it wasn’t DNA that was causing it. Production stopped in the presence of RNase I though, so it must have been RNA that was being produced.
Similar tests were done to determine the exact length of the poly. As few as 30 base pairs were able to trigger the IFN. But, longer sequences with G’s and C have failed to trigger anything.
RNA Characteristics
Two enzymes, polynucleotide kinase (PNK) and shrimp alkaline phosphatase (SAP) are used by chemists: the former adds a phosphate group to a DNA or RNA molecule, the latter removes one. A third enzyme, Terminator Exonuclease, or Ter Ex, breaks apart RNA with exactly one phosphate at the 5’ end.
When the SAP was used to remove the phosphate, the RNA no longer induced IFN- β (the PNK had no effect). Even when the PNK was used to add back the phosphate that was removed, there was still no induction, implying that a single phosphate was inadequate. Similarly, treating the RNA with Ter Ex also made no difference.
Another pair of RNase enzymes break apart specifically single stranded RNA (ssRNA) or double stranded RNA (dsRNA). RNase III breaks apart dsRNA, while RNase T1 breaks apart ssRNA. RNase III turned out to inhibit the IFN- β, indicating that dsRNA was required.
When the SAP was used to remove the phosphate, the RNA no longer induced IFN- β (the PNK had no effect). Even when the PNK was used to add back the phosphate that was removed, there was still no induction, implying that a single phosphate was inadequate. Similarly, treating the RNA with Ter Ex also made no difference.
Another pair of RNase enzymes break apart specifically single stranded RNA (ssRNA) or double stranded RNA (dsRNA). RNase III breaks apart dsRNA, while RNase T1 breaks apart ssRNA. RNase III turned out to inhibit the IFN- β, indicating that dsRNA was required.
Put all these together and it seems that the trigger is dsRNA with multiple phosphate groups attached.
So the chain takes you from the poly(dA-dT) to a 5′-ppp.
Conclusion
Other tests bring you to the conclusion that Pol-III is the enzyme that triggers this conversion. Thus, Poly-III, in the cytoplasm of a cell, actually acts as a DNA sensor that will trigger an immune response. An entirely different function from the one it has inside the cell nucleus. Quite a surprise!
Friday, December 21, 2012
Unit Test Coverage in Biopython
This fall, for the first time, I've had the professional opportunity to work on some Python code. I find the language quite elegant and simple to write, much more so than Ruby, although I couldn't really say why. Along with my work on a Master's degree in bioinformatics, I decided to look at the Biopython project and see about adding some code coverage statistics to it.
Biopython is an open-source software project that provides Python libraries for a variety of bioinformatics purposes. It includes, among other features, a variety of file parsers, alignment algorithms, and interfaces with various common bioinformatics tools and databases. To assure quality, a battery of automated tests is run against the Biopython source code before every release. While Biopython has a substantial amount of automated tests, no statistics are gathered concerning the code coverage of the tests.
It took several steps. The first was simply to run the existing test suite; I downloaded the codebase to my Windows machine, and tried to build it, but it turned out to be really difficult. Biopython has a dependency on NumPy as well as several other packages. I shaved several yaks attempting to get it running, but gave up eventually. Instead, I created a virtual Linux machine and installed it on that.
This went much more smoothly. (Note that I don't find Linux to have any natural advantage here. People who play with the Biopython source code run it on Linux, so that install gets a lot more attention.) I was able to run the test suite after a fashion. Not all of the tests ran the first time, which is a bit scary - it's hard to modify code with which you're not familiar without passing tests. But I realized after a while that the suite does some things like connect to remote services to verify that the connections work, and those services were down. There's an option to run only tests that don't attempt connections, and that worked - again, after a fashion. There are so many external dependencies that the test suite is set up to skip tests that require packages that aren't installed. But as all I was looking for was a passing test suite, I was OK with that.
The next step was a coverage tool. I chose Ned Batchelder's Coverage which worked nicely. All I had to do was replace the line "python run_tests.py --offline" with "coverage run run_tests.py --offline" and the coverage ran. I found the architecture of the tool a little strange; you don't specify any kind of output format. Instead, the tool creates its own binary data file (prefixed with a '.' to hide it on Linux systems) and you run a separate command to generate output in various formats. But this worked well and I had my report, in a nice HTML format so I could look at it in a browser.
Coverage is nice, but I don't know that a single report is very useful. Comparing coverage between builds is where the true benefit lies. Are some tests no longer covering code that they were meant to? Were tests disabled for some reason? Why did the coverage percentage drop? Ideally, each time the code is built, a new coverage report would be generated. So, I went on to look at the Biopython build process.
Biopython uses Travis for its builds. Travis is a rather nice public continuous integration server that integrates with Github, where the Biopython source code lives. You provide a well-named build script in your source tree and Travis monitors your repository for changes. It was easy enough to incorporate the coverage tool, but Travis didn't provide a good mechanism for reporting. The simplest thing, I decided, was to use a script to automatically upload a file back to Github. I jumped through several more hoops to get this working - authentication, permissions - and today I find, first, that Github is deprecating file uploads and, second, that Travis is supporting a new artifact system . So I think most of that work is out the window.
Bother.
So I'll have to rework some of that. Nonetheless, the coverage effort gave me some interesting results. Biopython consists of 298 source files comprising 39,805 statements. Automated unit tests covered all but 12,499 of those statements, for a coverage percentage of 69%. 33 files (11%) had no coverage at all, while 60 files (20%) were fully covered. Based on these numbers, Biopython has at this moment an almost acceptable level of coverage, as it is a truism in the software development world that coverage rates above 70% tend to lead to diminishing returns of value.
However, as a code library, Biopython is free from the problems of automated user interface testing, which is one of the more difficult areas of automated testing. For this reason, one might expect a somewhat higher coverage rate. There is reason to suspect, however, that as time goes by, the percentage will go higher. An examination of some of the source files with zero coverage reveal deprecation warnings, i.e. indications to the user that the module in question should not be used for new development. At some point it is to be assumed that these files will no longer be part of Biopython, which will drive the code coverage percentage upwards.
The introduction of a code coverage measuring tool to the Biopython build process is an important step, but only a first one. The code coverage tool selected is particular to Python, while a certain amount of the Biopython source code is written in C, a much more difficult language to instrument even for automated unit testing, much less code coverage reporting on that testing. However, useful work could be done in attempting to add this coverage. The techniques used here only monitor statement counts – perhaps branching counts would be a valuable addition. Biopython also has many dependencies on third-party products, some of which are installed during unit testing and some are not. Tests around these integration packages might prove useful. Finally, some work could be done on analyzing differences in builds, sending out alerts to interested parties if a given build should have a sudden drop in the code coverage percentage, for example.
Code coverage is a useful tool for long-running projects. Hopefully some of this effort will make it into the main Biopython codebase!
Subscribe to:
Posts (Atom)
