Sunday, May 19, 2013

Indexing the human genome

Last time I had decided that to efficiently analyze the reads, we had to make an index of the human genome. So how do we go about that?

Library of Congress Classification - Reading Room
Indexing at the Library of Congress
What we need is an efficient way to access any given substring in the genome. It's not quite the same as indexing a book; rather than determining the locations of "dynamo", "father", and "pseudopodia" in the book, we need to be able to find the location of EVERY substring. It's as if, in our book, we had to find instances of "dynamo", "ynamo", "namo" and so on. Not only that, but if the book had the sentence, "A dynamo has unlimited duration." we have to find instances of "namo h", "namo ha" "namo has" and so on.

So we can't just split the genome by word boundaries like we would for a book. Can we split it into even-sized chunks and index those? For example, could we choose a chunk size of five and split every ten characters into two index entries?

It won't work. For example, if the genome was "AGACTTGCTG", we might choose to index every five characters (called a 5-mer). This would give us two strings, "AGACT" and "TGCTG", which is fine, but if we come along later and try to search for "CTTGC", we're out of luck - that's not in our index. But it is in the string.

So we have to choose an index size and go through the genome character by character. In our ten-character genome, indexing by 5-mer, we get the strings:

AGACT
GACTT
ACTTG
CTTGC
TTGCT
TGCTG

(and a few shorter strings at the end, if we so desire.)

To be useful, we'll have to store our index as a dictionary of 5-mers to an array of integers, representing the locations in the genome where that string was found. For our sample, we have 30 characters worth of 5-mers, and just six integers to save, for a total of 54 bytes. What happens when we index the whole genome?

It's a fair assumption that every possible string will be in the genome, so we'll have 4or 1024 entries in our index. The number of values, though...we have to have an index value for every single character in the genome. If our genome was the book Moby Dick, we'd have around a million characters to index. Each index value would have, on average, about 1000 items in it. If we're really hoping to match all of our reads to each item in the index, we're going to go through 1000 entries for each read, which might be a bit slow. Unfortunately, our genome isn't Moby Dick. It's roughly the size of 6,000 Moby Dicks. Each index is going to be in just about six million locations. It can't possibly read all those at the speed we require.

Okay, so if we have too many items per index value, we just have to make our index larger. What if we do 10-mers, or even 20-mers? Well, 10-mers gives us 410 entries - call it a million. That means each index will be found in about 6,000 locations. Kind of a lot, but maybe doable. If we do 20-mers - well, 420 is 1,099,511,627,776. This is more on the lines of what we need in an index - it's well beyond the number of characters in the genome, so each index shouldn't show up in more than one or two locations. There's just one small problem: We now have six billion entries with twenty-character identifiers, and the space our index needs is now up to 120 gigabytes!

Maybe we could tweak and tune and find a sweet spot, but instead we'll try a different approach entirely to indexing. Next time.

Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome



Monday, May 13, 2013

Analyzing DNA programmatically

Last time I discussed using a program called BWA to try to determine if a given sequence was part of the human genome or not. It didn't seem to do a great job. How do we approach it programmatically?

Let's review the problem first. You have a machine that has analyzed some DNA, and it's given you some sequences it's found. Roughly 300,000 of them. Some with maybe only five characters, some with a few hundred. We call them "reads".

Genomes of Canis lupus major ups! familiaris : Assembled chromosomes
Recall that a DNA sequence, from a programming standpoint, consists of any number of any of four characters: A,C,G, and T. A human genome, in the end, is just a DNA sequence. So if the human sequence was "AAGGTTTCC" and your sequence is "AGGT", bing! You've got a match.

Now, here's the catch: a read, instead of being four characters, is reasonably a minimum of fifty characters. The human genome, instead of being nine characters, is roughly six billion characters. 6,000,000,000 characters. If your first thought is to fire up a text editor and do a search, you had better make sure that (a) the editor is capable of handling a six gigabyte file, and (b) that your computer has six gigs available to throw around (In 2013, a high-end laptop will probably have eight gigs, so just barely enough space to hold the genome and run a couple of other programs.)

Another, minor issue is that most genome files are formatted with sixty-character lines. If your sequence is split across two lines, a simple text search won't find it. You can preprocess the file to remove all the carriage returns, but if you do, your text editor had better be able to handle one line with 6 billion characters on it!

Your next idea: grep. Will grep find a 50-character match in a genome? I couldn't find a good way to do it reliably, since, again, genome files tend to be formatted into 60-character lines, and grep isn't really capable of finding a string that breaks across a line. Moreover, grep is a line-based tool, so odds are if you try preprocessing the line, grep will think (rightly) that you have a file with one six-billion character line in it, try to read in the line, and run out of memory.

Here's the other catch: Suppose you could get grep to work, and it managed to search the whole genome in three seconds. But we have 300,000 reads to get through! That's about 250 hours worth of search time and we're under a three-hour deadline. So we can't really afford to search the entire genome for each read.

So we'll write a script. A good processing plan might be this: index the genome, index the reads, then march down the two together to find our list of best prefixes. Sounds great!

But it turns out that even indexing a file the size of the human genome is a pain. I'll think about why next time.

Also see: A Million Dollars Up For Grabs