GDB (https://www.gnu.org/software/gdb/ ) is a handy way to debug command line applications. But in the case of applications that are running many threads, it doesn't by default follow a single thread, so as you step through the code it jumps between threads and it's easy to lose track of where you are. The solution is the scheduler-locking command, which forces the stepper to only step through one active thread at a time.
(gdb) set scheduler-locking on
See here for details: https://sourceware.org/gdb/current/onlinedocs/gdb/All_002dStop-Mode.html
Ramblings of a software developer with a degree in bioinformatics. Agile development mixed with DNA sequencing - what could go wrong?
Thursday, July 23, 2020
Sunday, July 19, 2020
Creating a Hyperbook in Microsoft Word
When someone creates a document they'll possibly set up a table of contents which conveniently links to the chapter headings they've created. They'll very likely provide hyperlinks to their sources or references so it's easy to go out on the web and find the sources. But it's pretty rare to provide handy links inside the document pointing to other places in the document - a hyperbook. Now, there's no reason not to keep providing outside links as well, but a good hyperbook is like a self-contained Wikipedia - lots of good information and lots of links to related subjects of interest directly on the page.
To insert internal links into a document in Microsoft Word, do the following.  On the Insert tab, there's a Links panel. Click that, then Link, then Insert Link. The dialog that comes up offers a variety of ways to insert internal links. Very nice for creating hyperbooks.
Thursday, July 02, 2020
Force-closing an SSH connection
On occasion when I'm using SSH to connect to a remote server, I run an application that hangs. If the terminal's running inside a GUI, you can always close down the entire terminal and restart it, but there's an easier way: hit Enter, then type "~." (squiggle dot) That force-closes the SSH session leaving your terminal intact. I learned this from SuperUser:
https://superuser.com/q/467398
https://superuser.com/q/467398
Friday, June 05, 2020
Downloading files from Google Cloud
In doing some testing with GATK 4, I found myself in need of downloading files from Google Cloud. Google Cloud likes to use URL's
that start with gs: For example, the URL for some tumor data is
You can't just visit that URL in your browser though; or at least I couldn't. I had to install gsutil as described here: https://cloud.google.com/storage/docs/gsutil_install#linux . This is one of those weird installs where they provide a script online that you can run; a bit dangerous, but at least they don't ask for sudo. It downloads about a gazillion files then asks permission to muck with your settings. I said no, of course, and it gave me a couple of files to source if I wanted. One of them had to do with providing autocomplete, but the other one simply added a directory to the path, so I created
an environment module to do that work. Now I can download the files I need:
that start with gs: For example, the URL for some tumor data is
gs://gatk-best-practices/somatic-b37/HCC1143.bam .
You can't just visit that URL in your browser though; or at least I couldn't. I had to install gsutil as described here: https://cloud.google.com/storage/docs/gsutil_install#linux . This is one of those weird installs where they provide a script online that you can run; a bit dangerous, but at least they don't ask for sudo. It downloads about a gazillion files then asks permission to muck with your settings. I said no, of course, and it gave me a couple of files to source if I wanted. One of them had to do with providing autocomplete, but the other one simply added a directory to the path, so I created
an environment module to do that work. Now I can download the files I need:
$ gsutil cp gs://gatk-best-practices/somatic-b37/HCC1143.bam .
Friday, May 15, 2020
Github offers successor options
Github is putting some thought into what happens to to your repositories if you're "unable" to manage them - a kind way of saying if you die. Nothing I have would have any consequence, but certainly I'm involved with some organizations that would need taking over. Here's how you name a successor for your repositories:
https://github.blog/changelog/2020-05-11-account-successors/
https://github.blog/changelog/2020-05-11-account-successors/
Thursday, April 30, 2020
Covid-19 infections by county, rate of increase
Wednesday, April 29, 2020
Generating sequentially increasing values in C++
Say you need to generate a sequence, in C++, simply consisting of the first so many integers, like, 1,2,3,4,5.
With a little programming experience, you can come up with a dozen different ways to do this, but here's an obscure one:
std::list<int> l(5);
std::iota(l.begin(), l.end(), 1);
According to CppReference, the function is named after the integer function ⍳ from the programming language APL.
The more you know.
With a little programming experience, you can come up with a dozen different ways to do this, but here's an obscure one:
std::list<int> l(5);
std::iota(l.begin(), l.end(), 1);
According to CppReference, the function is named after the integer function ⍳ from the programming language APL.
The more you know.
Saturday, April 11, 2020
Command Line tips from CLI Magic
Tips for working with the command line from CLI magic. I like this one to show all listening TCP/UDP ports for the current user:
$ lsof -Pan -i tcp -i udp
https://www.patreon.com/posts/climagic-003-5-35703693
$ lsof -Pan -i tcp -i udp
https://www.patreon.com/posts/climagic-003-5-35703693
Wednesday, April 01, 2020
Covid-19 infections by county, over time
This is a visualization choropleth I put together of Covid-19 infections for each county over time. It's based on the NY Times data set and I built the images in Python based off a very good, if ancient, tutorial I found at FlowingData.
Edit: Updated through Apr. 13 data, and also made the original larger. Not sure if that matters for this web page or not.
This image is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. I would appreciate attribution if you care to use it!
Edit: Updated through Apr. 13 data, and also made the original larger. Not sure if that matters for this web page or not.
This image is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. I would appreciate attribution if you care to use it!
Tuesday, March 31, 2020
BeautifulSoup4 in Python
A lot of code available that uses BeautifulSoup tells you to call it like so:
from beautifulsoup import BeautifulSoup
If you've installed BeautifulSoup4, though, this won't work. The main module name has been changed to bs4. So change the code that does that to 
from bs4 import BeautifulSoup
Please make a note of it.
Thursday, February 27, 2020
Pre-defined compiler macros
If you're working in a multiple-compiler environment, you may wish to get information at compile time about which compiler you're building with. Here's a list of macros defined by many compilers:
https://sourceforge.net/p/predef/wiki/Compilers/
https://sourceforge.net/p/predef/wiki/Compilers/
Friday, February 14, 2020
Enabling wind power nationwide
A report from the Obama Department of Energy in 2015.
https://www.energy.gov/sites/prod/files/2015/05/f22/Enabling%20Wind%20Power%20Nationwide_18MAY2015_FINAL.pdf
https://www.energy.gov/sites/prod/files/2015/05/f22/Enabling%20Wind%20Power%20Nationwide_18MAY2015_FINAL.pdf
Thursday, February 13, 2020
Take a newick tree and stre-e-etch the leaves to make it ultrametric
I needed to generate some ultrametric Newick trees for some simulations. There's a nice Newick tree generator online at http:/Trex/trex.uqam.ca/index.php?action=randomtreegenerator&project=trex, but it doesn't have any way to make an ultrametric tree. (Ultrametric means that the lengths from root to tip are all of the same.) So I put together a short script using BioPython to replace the length of each leaf node to make the lengths all identical to the longest length.
How to delete empty rows in an Excel spreadsheet
How to delete empty rows in an Excel spreadsheet:
https://www.itsupportguides.com/knowledge-base/office-2016/excel-2016-how-to-delete-empty-rows/
https://www.itsupportguides.com/knowledge-base/office-2016/excel-2016-how-to-delete-empty-rows/
Monday, August 10, 2015
How Installing a Random Tool Caused Ruby on Rails to Stop Working
At work I maintain a little Ruby on Rails site. Not a particularly interesting site, just a set of pages of statistics gathered from various databases around the organization. It runs on a RHEL6 virtual machine and I write most of the code on my Windows laptop using RubyMine from JetBrains - a product I wholeheartedly endorse, by the way.
It does take a little patience to run Rails on Windows - not due to any deficit in either product, just because the vast majority of people who work with Rails are running it on some other platform, so when you have problems, a lot of times you're on your own.
Take this problem I ran into recently, for example. I launched RubyMine to check out a bug, fired up the ol' testing server, browsed to the web page, and was greeted with this:
JSON::ParserError in Statistics#index
Showing /app/views/layouts/application.html.erb where line #5 raised:
757: unexpected token at 'Node Commands
Syntax:
node {operator} [options] [arguments]
Parameters:
/? or /help - Display this help message.
list - List nodes or node history or the cluster
listcores - List cores on the cluster
view - View properties of a node
online - Set nodes or node to online state
offline - Set one or more nodes to the offline state
For more information about HPC command-line tools,
see http://go.microsoft.com/fwlink/?LinkId=120724.
(in /app/assets/javascripts/accounts.js.coffee)
Now, of all possible error messages to get, one involving HPC command-line tools was not one that I had ever considered. What's going on here?
The line that gives the error is one that occurs in every Rails application:
It does take a little patience to run Rails on Windows - not due to any deficit in either product, just because the vast majority of people who work with Rails are running it on some other platform, so when you have problems, a lot of times you're on your own.
Take this problem I ran into recently, for example. I launched RubyMine to check out a bug, fired up the ol' testing server, browsed to the web page, and was greeted with this:
JSON::ParserError in Statistics#index
Showing /app/views/layouts/application.html.erb where line #5 raised:
757: unexpected token at 'Node Commands
Syntax:
node {operator} [options] [arguments]
Parameters:
/? or /help - Display this help message.
list - List nodes or node history or the cluster
listcores - List cores on the cluster
view - View properties of a node
online - Set nodes or node to online state
offline - Set one or more nodes to the offline state
For more information about HPC command-line tools,
see http://go.microsoft.com/fwlink/?LinkId=120724.
(in /app/assets/javascripts/accounts.js.coffee)
Now, of all possible error messages to get, one involving HPC command-line tools was not one that I had ever considered. What's going on here?
The line that gives the error is one that occurs in every Rails application:
<%= javascript_include_tag 'application' %>
What it really comes down to saying is, take all of the Javascript and Coffeescript code in the application, and include it here. A little wasteful, perhaps, but in a production environment all of the code will get glommed into a single file appropriate for caching in the browser, so it's not a bad system. Still, where's the error?
Obviously it must be in the Coffeescript file referenced at the end of the error message. But why is it suddenly complaining about HPC? I went and commented that whole file out - no joy. The same error got reported with a different file. I could find no information anywhere about what the cause might have been, and I didn't expect searching for the error message itself would do any good. I work in an HPC - which stands for High  Performance Computing, incidentally - environment so I was fairly sure the error was something very specific to my machine.
I tried, of course. Several different Google searches failed - but I finally hit on the right combination of keywords that led me to a Google Plus comment on a YouTube tutorial.
It turns out that Rails, in its infinite struggle for flexibility, relies on a gem called ExecJS to decide how it's going to convert Coffeescript to Javascript. ExecJS, in turn, is of the opinion that Node.js is a really good way to do the conversion. And it determines if Node.js is available by trying to run an application called Node.
Which is fine, unless you happen to run Windows; and you happen to work in an HPC environment; and you happen to have gone to a conference the other week where you decided to install Microsoft's HPC tools; which happen to include a nice command for managing compute nodes; called - you guessed it - Node. ExecJS was trying to use Microsoft's Node command to convert Coffeescript to Javascript, resulting in the error you see. Hats off to Jonathan McDonald for nailing the issue.
Wednesday, July 16, 2014
Trinity RNA-Seq
In January a site went live called readingroom.info. I suspect due to the timing and the subject matter it was a student project. The idea was to write summaries - no more than 500 words - of scientific papers and allow people to comment on and discuss them. I thought it was a neat idea. They had some ideas for incentivizing writers and so forth, but I didn't have time to contribute anything until this summer, by which time the authors had apparently lost interest. I sent in a summary of a paper, but after several weeks it had not been approved by the moderators. Maybe it just wasn't that good! 
The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.
This is what I sent them:
The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.
This is what I sent them:
The reconstruction of a
transcriptome from the
short reads generated by RNA-Seq techniques presents
many challenges, particularly in the absence of an existing reference genome
with which to compare the reads. Challenges include: uneven coverage of the
various transcripts; (ii) uneven coverage inside each transcript; (iii)
sequencing errors in highly expressed transcripts; (iv) transcripts encoded by
adjacent loci can overlap and thus can be erroneously fused to form a chimeric transcript; (v)
data structures need to accommodate multiple transcripts per locus, owing to
alternative splicing; and (vi) sequences that are repeated in different genes
introduce ambiguity. The Trinity pipeline leverages several properties of
transcriptomes in its assembly procedure: it uses transcript expression to
guide the initial transcript assembly procedure in a strand-specific manner, it
partitions RNA-Seq reads into sets of disjoint transcriptional loci, and it traverses
each of the transcript graphs systematically to explore the sets of transcript
sequences that best represent variants resulting from alternative splicing or
gene duplication by exploiting pairs of RNA-Seq reads. The series of steps
performed by the pipeline correctly reconstructs a significant percentage of
the transcripts without relying on the existence of a reference genome.
A major data structure used the
pipeline is the de Bruijn graph. A de Bruijn graph places each
k-mer in a node, and has
connected nodes if the k-mers are identical in all but the first or last
position. While an efficient structure for representing heavily overlapping
sequences, there are challenges in the usage of these graphs: (i) efficiently
constructing this graph from large amounts (billions of base pairs) of raw
data; (ii) defining a suitable scoring and enumeration algorithm to recover all
plausible splice forms and paralogous transcripts; and (iii) providing
robustness to the noise stemming from sequencing errors and other artifacts in
the data. Sequencing errors would introduce false nodes to the graph,
potentially resulting in a great deal of wasted memory.
The Trinity pipeline consists of
the following steps: it first analyzes the short reads to create a dictionary
of all sequences of length 25 in the reads, indexing the locations where each
sequence may be found. After removing likely errors, the unique k-mers are recombined,
starting with the most frequently occurring sequences and extending the
combination until no more k-mers can be matched. Each contig is then added to a
cluster based on potential alternative spliced transcripts or otherwise unique
portions of paralogous genes. Then, a de Bruijn graph is generated from each
cluster with the weight of each edge assigned from the number of k-mers in the
original read set that support the connection. In the final phase, a
merge-and-prune operation on each graph, for error correction, is performed,
followed by an enumeration of potential paths through the graph with a greater
likelihood placed on paths with greater read support.
The authors built transcriptomes
from both original data and reference sets, having a great deal of success in
either case.
Tuesday, July 08, 2014
The Slim protocol
I wrote earlier about Fitnesse and Slim style integration testing.
There are only a few commands associated with Slim tables, but the precise meaning of the command is left to the Slim server. For example, according to the Slim documentation, the Import instruction:
causes the <path> to be added to the search path for fixtures. In java <path> gets added to the CLASSPATH. In .NET, the <path> would name a dll.
This explains why import tests are always green. You can pass them any random string and it's just added to the path. In WaferSlim, the concept is extended slightly: if the string has a slash or backslash in it, it's considered a path. Otherwise, it's considered a file and is added to the list of files searched for classes. Let's take an example. Create a file called fixtures.py with the following code:
class SomeDecisionTable:
def setInput(self, x):
self.x = str(x)
  
def getOutput(self):
return int(self.x) + 1
Make: SomeDecisionTable
Call: setInput(1)
Call: getOutput()
Call: setInput(10)
Call: getOutput()
Click the test button now and everything should run green.
There are only a few commands associated with Slim tables, but the precise meaning of the command is left to the Slim server. For example, according to the Slim documentation, the Import instruction:
causes the <path> to be added to the search path for fixtures. In java <path> gets added to the CLASSPATH. In .NET, the <path> would name a dll.
This explains why import tests are always green. You can pass them any random string and it's just added to the path. In WaferSlim, the concept is extended slightly: if the string has a slash or backslash in it, it's considered a path. Otherwise, it's considered a file and is added to the list of files searched for classes. Let's take an example. Create a file called fixtures.py with the following code:
class SomeDecisionTable:
def setInput(self, x):
self.x = str(x)
def getOutput(self):
return int(self.x) + 1
There's an odd Unicode issue that I think requires context variables to be saved as strings. I don't know if Fitnesse, Slim or Waferslim is responsible for that. At any rate, it's simple enough: set the input to an integer and the output will be the integer plus one.
Save that file. Now the Import table will take two lines, one with the path, one with the file:
!|import |
|/path/to/fixtures|
|fixtures|
Now WaferSlim knows where to find our code, so let's take a look at another table.
Here's what the Slim protocol tells us: it can send Make instructions and Call instructions. A Make instruction tells the server to create an object; a Call instruction tells it to call a method on that object. I'd hoped to go through the Fitnesse code to determine exactly how that works, but I didn't. We'll take it on faith that the first line of a table sends a Make instruction and further lines send Call instructions. So, in order to make calls into our object, we write:
!|SomeDecisionTable|
|input|get output?|
|1 |2 |
!|SomeDecisionTable|
|input|get output?|
|1 |2 |
|10   |11         |
From the first line of the table, Fitnesse sends a Make: SomeDecisionTable command to the Slim server. Because of our Import statements, the server searches the /path/to/fixtures directory for the fixtures.py file, finds the SomeDecisionTable class, and instantiates it. 
The second line of the table tells Fitnesse what Call statements to make. It will call setInput and then getOutput. Each further line of the table gives arguments for the input and output. So the sequence of events from the Slim server's perspective is as follows:
Make: SomeDecisionTable
Call: setInput(1)
Call: getOutput()
Call: setInput(10)
Call: getOutput()
Sunday, July 06, 2014
Fitnesse and Slim
Fitnesse is a rather nice testing tool. It's been around since what seems like the beginning of the integration testing movement. It's based around HTML tables that are translated into application code. The wiring that translates tables into code are called fixtures. Since Fitnesse was written in Java, it was mostly useful for testing Java code, although many translations and tricks were devised to allow tests of other languages. Fitnesse also includes a wiki to make the process of creating the test tables easier.
A few years after Fitnesse was introduced, an alternate translation tool, SLIM, was created. The idea behind Slim was to allow testers to implement tests in their language of choice, with the communication between Fitnesse and the tests taking place over a socket. This allows any application that implements the correct protocol to run tests written in Fitnesse tables.
Several applications were created to support the protocol in various languages, including RubySlim for Ruby and WaferSlim for Python. I had occasion to write a set of integration tests recently, so I thought it would be simple to set up a Slim server to get the tests running. Turned out, I was wrong.
For my tests, I wanted to set up the simplest thing that could possibly work. So what is the minimum requirement for a Slim test page? If you look at the RubySlim documentation you end up with a page that looks something like this:
What all these things do is not clearly explained. Click the test button; you get a bunch of cryptic error messages. So what do they mean?
Well, the first line is pretty clear: if you want to use Slim, you need to set the test system. For the rest of it, we pretty much need to go directly to the source code. It turns out that the heart of Slim is in the Java class ProcessBuilder. In the CommandRunner class of Fitnesse we find:
ProcessBuilder processBuilder = new ProcessBuilder(command);
The command argument is no more than the COMMAND_PATTERN variable defined on the page, with a single argument appended, a port number. So if you wanted to use WaferSlim (a Python Slim server), you might say:
!define COMMAND PATTERN {python /home/benfulton/slim-init.py}
A few years after Fitnesse was introduced, an alternate translation tool, SLIM, was created. The idea behind Slim was to allow testers to implement tests in their language of choice, with the communication between Fitnesse and the tests taking place over a socket. This allows any application that implements the correct protocol to run tests written in Fitnesse tables.
Several applications were created to support the protocol in various languages, including RubySlim for Ruby and WaferSlim for Python. I had occasion to write a set of integration tests recently, so I thought it would be simple to set up a Slim server to get the tests running. Turned out, I was wrong.
For my tests, I wanted to set up the simplest thing that could possibly work. So what is the minimum requirement for a Slim test page? If you look at the RubySlim documentation you end up with a page that looks something like this:
!define TEST_SYSTEM {slim}
!define TEST_RUNNER {rubyslim}
!define COMMAND_PATTERN {rubyslim}
!path your/ruby/fixtures
!|import|
|<ruby module of fixtures>|
|SomeDecisionTable|
|input|get output?|
|1 |2 |
What all these things do is not clearly explained. Click the test button; you get a bunch of cryptic error messages. So what do they mean?
Well, the first line is pretty clear: if you want to use Slim, you need to set the test system. For the rest of it, we pretty much need to go directly to the source code. It turns out that the heart of Slim is in the Java class ProcessBuilder. In the CommandRunner class of Fitnesse we find:
ProcessBuilder processBuilder = new ProcessBuilder(command);
The command argument is no more than the COMMAND_PATTERN variable defined on the page, with a single argument appended, a port number. So if you wanted to use WaferSlim (a Python Slim server), you might say:
!define COMMAND PATTERN {python /home/benfulton/slim-init.py}
(Since we need to understand how Slim works, let's get the WaferSlim source from Github rather than installing it via Pip. The slim-init source is from this gist.)
So, after downloading WaferSlim and the code from the gist, we should be able to get the Simplest Possible Page to work. Here it is:
!define TEST_SYSTEM {slim}
!define COMMAND_PATTERN {python /home/benfulton/slim-init.py }
!|import |
|RubySlim|
|app|
Click the test button, and you should get a green test.
If you don't, you may have a Slim versioning error. Go to protocol.py and change the version number from 0.1 to 0.3. Or Python may not be in your path, or the init-script may not be found. Or - here's a tricky one - you put a space before the python command. If you did that, the first command that gets passed to the ProcessBuilder class will be an empty string, and it won't run. Still, this will run green without too much effort.
But wait! What's the use of this import statement that went green before we wrote anything?
Sunday, June 02, 2013
Suffix Arrays
In our quest to win a million dollars, we're trying to Index the Human Genome. A typical index, like in the back of a book, is composed of an entry, say "Sheep", and a list of page numbers, say "88,121,265". But when we went to try this idea on the 3,000,000,000 characters in the genome, we found that it took up significantly more space than the genome itself, which was unworkable. Surely we can find a more efficient solution to the problem. (Or, we could throw up our hands, call it "Big Data", and put a supercomputer to work on it. Let's not do that.)
 Actually, there's a fair amount of redundancy in an index. If there are three instances of the word "Sheep" in our book, and we add a fourth instance of it to the index, we're increased the number of sheep, so to speak, by 33%. If we do that for every word in the book we'll add a significant chunk to the overall size.
Actually, there's a fair amount of redundancy in an index. If there are three instances of the word "Sheep" in our book, and we add a fourth instance of it to the index, we're increased the number of sheep, so to speak, by 33%. If we do that for every word in the book we'll add a significant chunk to the overall size.
But suppose, rather than just give the user the page numbers where the word "Sheep" is, we provided them with with the line and word numbers as well. "Sheep" is on page 88, line 14, word six. Now, the index is in alphabetical order of course, so what we can do is simply eliminate all those redundant words from the index. So the index entry for "Sheep" would just say "88,14,6". Say readers want to find the word "Streams". They would go and find an entry "88,14,6" and look up the word in the book. Finding that it's "Sheep", they realize the word is later in the index, since "Streams" is alphabetically after "Sheep". They go to the next entry in the index, maybe "81,19,11", and look up that word in the book. It's "Streams", so they've found their word, and the index didn't require any of those annoying, redundant words in it!
OK, not a very simple operation for a human reader. But easy enough for a computer. And since our genome doesn't have pages or lines, we could simply record the location of each individual 20-mer and put the locations in alphabetical order. We can even take it one step further: since 20 is just an arbitrary length that we chose, we'll remove it from the solution and just say that we'll take as many characters as we need to get a unique ordering of all of the strings. Notice that you might need to take all of the remaining characters of the genome to alphabetize it correctly, and if you do, you have a suffix of the genome. If you don't need all the characters, it doesn't matter if you add them or not, so you might as well, and therefore we have an array of the suffixes of the genome. A suffix array.
I won't go into the details of how to create such an index right now, but it can be done in relatively few lines of code. (One easy-to-use library is called SAIS.) Now, the simplest way to write out a suffix array is a text file of the alphabetized suffix indices, in ASCII format, one number per line. This, unfortunately, brings us straight back to the size problem - let's say eight bytes on average, per line, with one line per character in the genome, and we end up with 24 gigabytes worth of index. But at least it's a workable index. If we split the index into several files to keep it manageable, it even suggests a refinement of our attack on the overall problem. We'll see how another time.
(Update: I wrote an article on suffix array algorithms.)
Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome
Part V: Suffix Arrays
 Actually, there's a fair amount of redundancy in an index. If there are three instances of the word "Sheep" in our book, and we add a fourth instance of it to the index, we're increased the number of sheep, so to speak, by 33%. If we do that for every word in the book we'll add a significant chunk to the overall size.
Actually, there's a fair amount of redundancy in an index. If there are three instances of the word "Sheep" in our book, and we add a fourth instance of it to the index, we're increased the number of sheep, so to speak, by 33%. If we do that for every word in the book we'll add a significant chunk to the overall size.But suppose, rather than just give the user the page numbers where the word "Sheep" is, we provided them with with the line and word numbers as well. "Sheep" is on page 88, line 14, word six. Now, the index is in alphabetical order of course, so what we can do is simply eliminate all those redundant words from the index. So the index entry for "Sheep" would just say "88,14,6". Say readers want to find the word "Streams". They would go and find an entry "88,14,6" and look up the word in the book. Finding that it's "Sheep", they realize the word is later in the index, since "Streams" is alphabetically after "Sheep". They go to the next entry in the index, maybe "81,19,11", and look up that word in the book. It's "Streams", so they've found their word, and the index didn't require any of those annoying, redundant words in it!
|  | 
| Sheep. By a stream. | 
I won't go into the details of how to create such an index right now, but it can be done in relatively few lines of code. (One easy-to-use library is called SAIS.) Now, the simplest way to write out a suffix array is a text file of the alphabetized suffix indices, in ASCII format, one number per line. This, unfortunately, brings us straight back to the size problem - let's say eight bytes on average, per line, with one line per character in the genome, and we end up with 24 gigabytes worth of index. But at least it's a workable index. If we split the index into several files to keep it manageable, it even suggests a refinement of our attack on the overall problem. We'll see how another time.
(Update: I wrote an article on suffix array algorithms.)
Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome
Part V: Suffix Arrays
Sunday, May 19, 2013
Indexing the human genome
Last time I had decided that to efficiently analyze the reads, we had to make an index of the human genome. So how do we go about that?
What we need is an efficient way to access any given substring in the genome. It's not quite the same as indexing a book; rather than determining the locations of "dynamo", "father", and "pseudopodia" in the book, we need to be able to find the location of EVERY substring. It's as if, in our book, we had to find instances of "dynamo", "ynamo", "namo" and so on. Not only that, but if the book had the sentence, "A dynamo has unlimited duration." we have to find instances of "namo h", "namo ha" "namo has" and so on.
So we can't just split the genome by word boundaries like we would for a book. Can we split it into even-sized chunks and index those? For example, could we choose a chunk size of five and split every ten characters into two index entries?
It won't work. For example, if the genome was "AGACTTGCTG", we might choose to index every five characters (called a 5-mer). This would give us two strings, "AGACT" and "TGCTG", which is fine, but if we come along later and try to search for "CTTGC", we're out of luck - that's not in our index. But it is in the string.
So we have to choose an index size and go through the genome character by character. In our ten-character genome, indexing by 5-mer, we get the strings:
AGACT
GACTT
ACTTG
CTTGC
TTGCT
TGCTG
(and a few shorter strings at the end, if we so desire.)
To be useful, we'll have to store our index as a dictionary of 5-mers to an array of integers, representing the locations in the genome where that string was found. For our sample, we have 30 characters worth of 5-mers, and just six integers to save, for a total of 54 bytes. What happens when we index the whole genome?
It's a fair assumption that every possible string will be in the genome, so we'll have 45 or 1024 entries in our index. The number of values, though...we have to have an index value for every single character in the genome. If our genome was the book Moby Dick, we'd have around a million characters to index. Each index value would have, on average, about 1000 items in it. If we're really hoping to match all of our reads to each item in the index, we're going to go through 1000 entries for each read, which might be a bit slow. Unfortunately, our genome isn't Moby Dick. It's roughly the size of 6,000 Moby Dicks. Each index is going to be in just about six million locations. It can't possibly read all those at the speed we require.
Okay, so if we have too many items per index value, we just have to make our index larger. What if we do 10-mers, or even 20-mers? Well, 10-mers gives us 410 entries - call it a million. That means each index will be found in about 6,000 locations. Kind of a lot, but maybe doable. If we do 20-mers - well, 420 is 1,099,511,627,776. This is more on the lines of what we need in an index - it's well beyond the number of characters in the genome, so each index shouldn't show up in more than one or two locations. There's just one small problem: We now have six billion entries with twenty-character identifiers, and the space our index needs is now up to 120 gigabytes!
Maybe we could tweak and tune and find a sweet spot, but instead we'll try a different approach entirely to indexing. Next time.
Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome
What we need is an efficient way to access any given substring in the genome. It's not quite the same as indexing a book; rather than determining the locations of "dynamo", "father", and "pseudopodia" in the book, we need to be able to find the location of EVERY substring. It's as if, in our book, we had to find instances of "dynamo", "ynamo", "namo" and so on. Not only that, but if the book had the sentence, "A dynamo has unlimited duration." we have to find instances of "namo h", "namo ha" "namo has" and so on.
So we can't just split the genome by word boundaries like we would for a book. Can we split it into even-sized chunks and index those? For example, could we choose a chunk size of five and split every ten characters into two index entries?
It won't work. For example, if the genome was "AGACTTGCTG", we might choose to index every five characters (called a 5-mer). This would give us two strings, "AGACT" and "TGCTG", which is fine, but if we come along later and try to search for "CTTGC", we're out of luck - that's not in our index. But it is in the string.
So we have to choose an index size and go through the genome character by character. In our ten-character genome, indexing by 5-mer, we get the strings:
AGACT
GACTT
ACTTG
CTTGC
TTGCT
TGCTG
(and a few shorter strings at the end, if we so desire.)
To be useful, we'll have to store our index as a dictionary of 5-mers to an array of integers, representing the locations in the genome where that string was found. For our sample, we have 30 characters worth of 5-mers, and just six integers to save, for a total of 54 bytes. What happens when we index the whole genome?
It's a fair assumption that every possible string will be in the genome, so we'll have 45 or 1024 entries in our index. The number of values, though...we have to have an index value for every single character in the genome. If our genome was the book Moby Dick, we'd have around a million characters to index. Each index value would have, on average, about 1000 items in it. If we're really hoping to match all of our reads to each item in the index, we're going to go through 1000 entries for each read, which might be a bit slow. Unfortunately, our genome isn't Moby Dick. It's roughly the size of 6,000 Moby Dicks. Each index is going to be in just about six million locations. It can't possibly read all those at the speed we require.
Okay, so if we have too many items per index value, we just have to make our index larger. What if we do 10-mers, or even 20-mers? Well, 10-mers gives us 410 entries - call it a million. That means each index will be found in about 6,000 locations. Kind of a lot, but maybe doable. If we do 20-mers - well, 420 is 1,099,511,627,776. This is more on the lines of what we need in an index - it's well beyond the number of characters in the genome, so each index shouldn't show up in more than one or two locations. There's just one small problem: We now have six billion entries with twenty-character identifiers, and the space our index needs is now up to 120 gigabytes!
Maybe we could tweak and tune and find a sweet spot, but instead we'll try a different approach entirely to indexing. Next time.
Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome
Subscribe to:
Comments (Atom)


