Last time I discussed using a program called BWA to try to determine if a given sequence was part of the human genome or not. It didn't seem to do a great job. How do we approach it programmatically?
Let's review the problem first. You have a machine that has analyzed some DNA, and it's given you some sequences it's found. Roughly 300,000 of them. Some with maybe only five characters, some with a few hundred. We call them "reads".
Now, here's the catch: a read, instead of being four characters, is reasonably a minimum of fifty characters. The human genome, instead of being nine characters, is roughly six billion characters. 6,000,000,000 characters. If your first thought is to fire up a text editor and do a search, you had better make sure that (a) the editor is capable of handling a six gigabyte file, and (b) that your computer has six gigs available to throw around (In 2013, a high-end laptop will probably have eight gigs, so just barely enough space to hold the genome and run a couple of other programs.)
Another, minor issue is that most genome files are formatted with sixty-character lines. If your sequence is split across two lines, a simple text search won't find it. You can preprocess the file to remove all the carriage returns, but if you do, your text editor had better be able to handle one line with 6 billion characters on it!
Your next idea: grep. Will grep find a 50-character match in a genome? I couldn't find a good way to do it reliably, since, again, genome files tend to be formatted into 60-character lines, and grep isn't really capable of finding a string that breaks across a line. Moreover, grep is a line-based tool, so odds are if you try preprocessing the line, grep will think (rightly) that you have a file with one six-billion character line in it, try to read in the line, and run out of memory.
Here's the other catch: Suppose you could get grep to work, and it managed to search the whole genome in three seconds. But we have 300,000 reads to get through! That's about 250 hours worth of search time and we're under a three-hour deadline. So we can't really afford to search the entire genome for each read.
So we'll write a script. A good processing plan might be this: index the genome, index the reads, then march down the two together to find our list of best prefixes. Sounds great!
But it turns out that even indexing a file the size of the human genome is a pain. I'll think about why next time.
Also see: A Million Dollars Up For Grabs