Saturday, March 23, 2013

Analyzing DNA with BWA

Last time I discussed a million-dollar challenge posted by Innocenture. The challenge is to analyze a series of DNA reads and determine the source. My professor suggested using a program named BWA, a Burrows-Wheeler Aligner. Burrows-Wheeler alignment is an algorithm for matching up two sequences which may not be the same length. Perhaps one sequence has two extra nucleotides in the middle but the entire rest of the sequences match. These kinds of sequences are very difficult to find naively, and which match is better can come down to a judgment call.

So the goal is to align each sequence in the example file to the human genome. BWA, for speed, asks that you index the target genome - the human genome, in this case - which takes a fair amount of time. Then, you pass it each sequence on the command line, with a command something like:

>bwa align AAGCTCTA human_genome

and it does its magic and returns the location in the genome of the best several matches.

So I worked up a Python script to analyze the input file and pass each sequence to BWA in that format, and let it run for a few hours. (I'm never sure how many because I never remember to change the machine's power settings to not switch itself off after a few minutes of no UI activity). But it eventually completed. I went to look at the results, and I found them very intriguiing. Although, according to the challenge, no less than 90% of the DNA was human, the BWA program only managed to match about 50%, about 150,000 reads.

I took a closer look at the BWA manual, and found this option:

-o INTMaximum number of gap opens [1]

If I understand this correctly, it means that if the alignment has more than one gap in it, BWA will discard it as not being a match. You can change the value of this parameter, but when I did, it seemed to slow down the analysis quite a bit. I didn't let it complete, but based on the portions I did run I suspect it would have gone over the three-hour limitation - at least on my workstation, which I'm sure is underpowered compared to the target hardware.

So I started to think about what sort of coding would need to be done to meet this challenge. Next time I'll think about Analyzing DNA Programmatically