Actually, there's a fair amount of redundancy in an index. If there are three instances of the word "Sheep" in our book, and we add a fourth instance of it to the index, we're increased the number of sheep, so to speak, by 33%. If we do that for every word in the book we'll add a significant chunk to the overall size.
But suppose, rather than just give the user the page numbers where the word "Sheep" is, we provided them with with the line and word numbers as well. "Sheep" is on page 88, line 14, word six. Now, the index is in alphabetical order of course, so what we can do is simply eliminate all those redundant words from the index. So the index entry for "Sheep" would just say "88,14,6". Say readers want to find the word "Streams". They would go and find an entry "88,14,6" and look up the word in the book. Finding that it's "Sheep", they realize the word is later in the index, since "Streams" is alphabetically after "Sheep". They go to the next entry in the index, maybe "81,19,11", and look up that word in the book. It's "Streams", so they've found their word, and the index didn't require any of those annoying, redundant words in it!
|Sheep. By a stream.|
I won't go into the details of how to create such an index right now, but it can be done in relatively few lines of code. (One easy-to-use library is called SAIS.) Now, the simplest way to write out a suffix array is a text file of the alphabetized suffix indices, in ASCII format, one number per line. This, unfortunately, brings us straight back to the size problem - let's say eight bytes on average, per line, with one line per character in the genome, and we end up with 24 gigabytes worth of index. But at least it's a workable index. If we split the index into several files to keep it manageable, it even suggests a refinement of our attack on the overall problem. We'll see how next time.
(Update: I wrote an article on suffix array algorithms.)
Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome
Part V: Suffix Arrays