Saturday, March 23, 2013

Analyzing DNA with BWA

Last time I discussed a million-dollar challenge posted by Innocenture. The challenge is to analyze a series of DNA reads and determine the source. My professor suggested using a program named BWA, a Burrows-Wheeler Aligner. Burrows-Wheeler alignment is an algorithm for matching up two sequences which may not be the same length. Perhaps one sequence has two extra nucleotides in the middle but the entire rest of the sequences match. These kinds of sequences are very difficult to find naively, and which match is better can come down to a judgment call.

So the goal is to align each sequence in the example file to the human genome. BWA, for speed, asks that you index the target genome - the human genome, in this case - which takes a fair amount of time. Then, you pass it each sequence on the command line, with a command something like:

>bwa align AAGCTCTA human_genome

and it does its magic and returns the location in the genome of the best several matches.

So I worked up a Python script to analyze the input file and pass each sequence to BWA in that format, and let it run for a few hours. (I'm never sure how many because I never remember to change the machine's power settings to not switch itself off after a few minutes of no UI activity). But it eventually completed. I went to look at the results, and I found them very intriguiing. Although, according to the challenge, no less than 90% of the DNA was human, the BWA program only managed to match about 50%, about 150,000 reads.

I took a closer look at the BWA manual, and found this option:

-o INTMaximum number of gap opens [1]

If I understand this correctly, it means that if the alignment has more than one gap in it, BWA will discard it as not being a match. You can change the value of this parameter, but when I did, it seemed to slow down the analysis quite a bit. I didn't let it complete, but based on the portions I did run I suspect it would have gone over the three-hour limitation - at least on my workstation, which I'm sure is underpowered compared to the target hardware.

So I started to think about what sort of coding would need to be done to meet this challenge. Next time I'll think about Analyzing DNA Programmatically

Friday, March 22, 2013

A million dollars up for grabs

Of course, you have to do a few things to earn it. Innocenture is running a challenge, with a million dollar prize, for the ability to take a DNA read, analyze it, and determine the precise species of each read. Here's the kicker: it has to be done rapidly.

They have some examples available. The input is in a large XML file that contains the DNA reads and some information on the quality of the reads, and they provide you with the output file they would be looking for. One example has a little more that 300,000 reads of between 50 and 200 nucleotides, putatively taken from a human. According to the output, at least 90% of the reads are from human DNA.

So how do we duplicate this output file? There's a program called BLAST available on the web for bioinformatic analysis - you give it a sequence of DNA and it almost immediately comes back with the closest matches across their entire, huge DNA database.

So, we might be able to slam that database with the reads and get back the results. There's just one problem - notice I said 300,000 reads? Suppose we could get back each one in one tenth of a second. That makes 30,000 seconds, or a total of eight hours and 20 minutes of runtime. Sadly, the million dollars probably won't be given away unless you can get the reads done in under three hours. Oh, and did I mention the application won't actually have internet access?

So the BLAST site is out, which is unfortunate, because it really does an amazing job at matching sequences. What do we do instead?

Due to the nature of science in the United States - "publish or perish" - there are a whole lot of little bioinformatics applications around. Mostly, people will write one, publish a paper about it, and then forget about it. There's no point in maintaining it or going back and improving it since there's no chance of writing another paper about it unless you change the algorithms significantly.

Still, a few of these applications manage to have some shelf life. I'll look at Analyzing DNA with BWA next.

Also see:
Analyzing DNA Programmatically

Thursday, March 21, 2013

RNA Polymerase ||| and the RIG-I pathway

A little story about immune responses in cells.

Type-I interferons (IFNs) are important for antiviral and autoimmune responses.  They interfere with viruses as the viruses try to borrow the cell's replication mechanism to reproduce themselves.

The cell will produce interferons due to a couple of proteins: the retinoic acid induced gene I (RIG-I) and mitochondrial antiviral signaling (MAVS) proteins.

These, in turn, start the production process when cytosolic double-stranded RNA or single-stranded RNA containing 5′-triphosphate (5′-ppp) are nearby.

Here's a surprising thing: Cytosolic B-form double-stranded DNA can also induce IFN-β. For example, a DNA sequence of repeating AT can induce it (It’s known as poly(dA-dT). But no one knew how. Until a paper came out in 2009 by Yu-Hsin Chiu and a couple of other people. It turned out that inside the cell, the poly(dA-DT) was actually being converted into 5′-ppp. 

But how? It turns out that an enzyme uses the poly(dA-dT) as a template to synthesizes 5′-ppp RNA. The enzyme is DNA-dependent RNA polymerase III (Pol-III). This was interesting because it was known that the Pol-III had a role in the nucleus of the cell, but not that it had to do with the immune system.

If you inhibit the working of Pol-III in a cell, and then introduce a bacteria like Legionella pneumophil, the bacteria grows in the cell. The implication is that Pol-III senses the DNA of the bacteria and triggers the IFN process.

How did they do it? 

In a cell, they attached a luciferase reporter to the IFN-β promoter, so if the cell creates IFN-β, it would bioluminesce.

Then, they put different things in the cell. Of all the things tested, only poly(dA-dT) activated the IRF3.

To ensure that there wasn’t something going on at another step in the path, some other things were tried: A silencing RNA strand was introduced into the cell that would stop the production of RIG-I and MAVS. No IFN-β was produced. DNASE-I is an enzyme that breaks down DNA. When that was introduced, no IFN-β was produced. On the other hand, IFN-β was produced in the presence of RNASE-I, so breaking down RNA had no effect.

Nucleic acids from the poly(dA-dT) cells were able to induce IFN- β, even in the presence of DNase I, so it wasn’t DNA that was causing it. Production stopped in the presence of RNase I though, so it must have been RNA that was being produced.

Similar tests were done to determine the exact length of the poly. As few as 30 base pairs were able to trigger the IFN. But, longer sequences with G’s and C have failed to trigger anything.

RNA Characteristics

 Two enzymes, polynucleotide kinase (PNK) and shrimp alkaline phosphatase (SAP) are used by chemists: the former adds a phosphate group to a DNA or RNA molecule, the latter removes one. A third enzyme, Terminator Exonuclease, or Ter Ex, breaks apart RNA with exactly one phosphate at the 5’ end.

When the SAP was used to remove the phosphate, the RNA no longer induced IFN- β (the PNK had no effect). Even when the PNK was used to add back the phosphate that was removed, there was still no induction, implying that a single phosphate was inadequate. Similarly, treating the RNA with Ter Ex also made no difference.

Another pair of RNase enzymes break apart specifically single stranded RNA (ssRNA) or double stranded RNA (dsRNA). RNase III breaks apart dsRNA, while RNase T1 breaks apart ssRNA. RNase III turned out to inhibit the IFN- β, indicating that dsRNA was required.

Put all these together and it seems that the trigger is dsRNA with multiple phosphate groups attached.

So the chain takes you from the poly(dA-dT) to a 5′-ppp.


Other tests bring you to the conclusion that Pol-III is the enzyme that triggers this conversion. Thus, Poly-III, in the cytoplasm of a cell, actually acts as a DNA sensor that will trigger an immune response. An entirely different function from the one it has inside the cell nucleus. Quite a surprise!