Friday, March 22, 2013

A million dollars up for grabs

Of course, you have to do a few things to earn it. Innocenture is running a challenge, with a million dollar prize, for the ability to take a DNA read, analyze it, and determine the precise species of each read. Here's the kicker: it has to be done rapidly.

They have some examples available. The input is in a large XML file that contains the DNA reads and some information on the quality of the reads, and they provide you with the output file they would be looking for. One example has a little more that 300,000 reads of between 50 and 200 nucleotides, putatively taken from a human. According to the output, at least 90% of the reads are from human DNA.

So how do we duplicate this output file? There's a program called BLAST available on the web for bioinformatic analysis - you give it a sequence of DNA and it almost immediately comes back with the closest matches across their entire, huge DNA database.

So, we might be able to slam that database with the reads and get back the results. There's just one problem - notice I said 300,000 reads? Suppose we could get back each one in one tenth of a second. That makes 30,000 seconds, or a total of eight hours and 20 minutes of runtime. Sadly, the million dollars probably won't be given away unless you can get the reads done in under three hours. Oh, and did I mention the application won't actually have internet access?

So the BLAST site is out, which is unfortunate, because it really does an amazing job at matching sequences. What do we do instead?

Due to the nature of science in the United States - "publish or perish" - there are a whole lot of little bioinformatics applications around. Mostly, people will write one, publish a paper about it, and then forget about it. There's no point in maintaining it or going back and improving it since there's no chance of writing another paper about it unless you change the algorithms significantly.

Still, a few of these applications manage to have some shelf life. I'll look at Analyzing DNA with BWA next.

Also see:
Analyzing DNA Programmatically