Ramblings of a software developer with a degree in bioinformatics. Agile development mixed with DNA sequencing - what could go wrong?
Sunday, May 19, 2013
Indexing the human genome
What we need is an efficient way to access any given substring in the genome. It's not quite the same as indexing a book; rather than determining the locations of "dynamo", "father", and "pseudopodia" in the book, we need to be able to find the location of EVERY substring. It's as if, in our book, we had to find instances of "dynamo", "ynamo", "namo" and so on. Not only that, but if the book had the sentence, "A dynamo has unlimited duration." we have to find instances of "namo h", "namo ha" "namo has" and so on.
So we can't just split the genome by word boundaries like we would for a book. Can we split it into even-sized chunks and index those? For example, could we choose a chunk size of five and split every ten characters into two index entries?
It won't work. For example, if the genome was "AGACTTGCTG", we might choose to index every five characters (called a 5-mer). This would give us two strings, "AGACT" and "TGCTG", which is fine, but if we come along later and try to search for "CTTGC", we're out of luck - that's not in our index. But it is in the string.
So we have to choose an index size and go through the genome character by character. In our ten-character genome, indexing by 5-mer, we get the strings:
AGACT
GACTT
ACTTG
CTTGC
TTGCT
TGCTG
(and a few shorter strings at the end, if we so desire.)
To be useful, we'll have to store our index as a dictionary of 5-mers to an array of integers, representing the locations in the genome where that string was found. For our sample, we have 30 characters worth of 5-mers, and just six integers to save, for a total of 54 bytes. What happens when we index the whole genome?
It's a fair assumption that every possible string will be in the genome, so we'll have 45 or 1024 entries in our index. The number of values, though...we have to have an index value for every single character in the genome. If our genome was the book Moby Dick, we'd have around a million characters to index. Each index value would have, on average, about 1000 items in it. If we're really hoping to match all of our reads to each item in the index, we're going to go through 1000 entries for each read, which might be a bit slow. Unfortunately, our genome isn't Moby Dick. It's roughly the size of 6,000 Moby Dicks. Each index is going to be in just about six million locations. It can't possibly read all those at the speed we require.
Okay, so if we have too many items per index value, we just have to make our index larger. What if we do 10-mers, or even 20-mers? Well, 10-mers gives us 410 entries - call it a million. That means each index will be found in about 6,000 locations. Kind of a lot, but maybe doable. If we do 20-mers - well, 420 is 1,099,511,627,776. This is more on the lines of what we need in an index - it's well beyond the number of characters in the genome, so each index shouldn't show up in more than one or two locations. There's just one small problem: We now have six billion entries with twenty-character identifiers, and the space our index needs is now up to 120 gigabytes!
Maybe we could tweak and tune and find a sweet spot, but instead we'll try a different approach entirely to indexing. Next time.
Part I: A Million Dollars Up For Grabs
Part II: Analyzing DNA with BWA
Part III: Analyzing DNA Programmatically
Part IV: Indexing the Human Genome
Monday, May 13, 2013
Analyzing DNA programmatically
Let's review the problem first. You have a machine that has analyzed some DNA, and it's given you some sequences it's found. Roughly 300,000 of them. Some with maybe only five characters, some with a few hundred. We call them "reads".
Recall that a DNA sequence, from a programming standpoint, consists of any number of any of four characters: A,C,G, and T. A human genome, in the end, is just a DNA sequence. So if the human sequence was "AAGGTTTCC" and your sequence is "AGGT", bing! You've got a match.
Now, here's the catch: a read, instead of being four characters, is reasonably a minimum of fifty characters. The human genome, instead of being nine characters, is roughly six billion characters. 6,000,000,000 characters. If your first thought is to fire up a text editor and do a search, you had better make sure that (a) the editor is capable of handling a six gigabyte file, and (b) that your computer has six gigs available to throw around (In 2013, a high-end laptop will probably have eight gigs, so just barely enough space to hold the genome and run a couple of other programs.)
Another, minor issue is that most genome files are formatted with sixty-character lines. If your sequence is split across two lines, a simple text search won't find it. You can preprocess the file to remove all the carriage returns, but if you do, your text editor had better be able to handle one line with 6 billion characters on it!
Your next idea: grep. Will grep find a 50-character match in a genome? I couldn't find a good way to do it reliably, since, again, genome files tend to be formatted into 60-character lines, and grep isn't really capable of finding a string that breaks across a line. Moreover, grep is a line-based tool, so odds are if you try preprocessing the line, grep will think (rightly) that you have a file with one six-billion character line in it, try to read in the line, and run out of memory.
Here's the other catch: Suppose you could get grep to work, and it managed to search the whole genome in three seconds. But we have 300,000 reads to get through! That's about 250 hours worth of search time and we're under a three-hour deadline. So we can't really afford to search the entire genome for each read.
So we'll write a script. A good processing plan might be this: index the genome, index the reads, then march down the two together to find our list of best prefixes. Sounds great!
But it turns out that even indexing a file the size of the human genome is a pain. I'll think about why next time.
Also see: A Million Dollars Up For Grabs
Saturday, March 23, 2013
Analyzing DNA with BWA
So the goal is to align each sequence in the example file to the human genome. BWA, for speed, asks that you index the target genome - the human genome, in this case - which takes a fair amount of time. Then, you pass it each sequence on the command line, with a command something like:
>bwa align AAGCTCTA human_genome
and it does its magic and returns the location in the genome of the best several matches.
So I worked up a Python script to analyze the input file and pass each sequence to BWA in that format, and let it run for a few hours. (I'm never sure how many because I never remember to change the machine's power settings to not switch itself off after a few minutes of no UI activity). But it eventually completed. I went to look at the results, and I found them very intriguiing. Although, according to the challenge, no less than 90% of the DNA was human, the BWA program only managed to match about 50%, about 150,000 reads.
I took a closer look at the BWA manual, and found this option:
-o INT | Maximum number of gap opens [1] |
If I understand this correctly, it means that if the alignment has more than one gap in it, BWA will discard it as not being a match. You can change the value of this parameter, but when I did, it seemed to slow down the analysis quite a bit. I didn't let it complete, but based on the portions I did run I suspect it would have gone over the three-hour limitation - at least on my workstation, which I'm sure is underpowered compared to the target hardware.
So I started to think about what sort of coding would need to be done to meet this challenge. Next time I'll think about Analyzing DNA Programmatically
Friday, March 22, 2013
A million dollars up for grabs
They have some examples available. The input is in a large XML file that contains the DNA reads and some information on the quality of the reads, and they provide you with the output file they would be looking for. One example has a little more that 300,000 reads of between 50 and 200 nucleotides, putatively taken from a human. According to the output, at least 90% of the reads are from human DNA.
So how do we duplicate this output file? There's a program called BLAST available on the web for bioinformatic analysis - you give it a sequence of DNA and it almost immediately comes back with the closest matches across their entire, huge DNA database.
So, we might be able to slam that database with the reads and get back the results. There's just one problem - notice I said 300,000 reads? Suppose we could get back each one in one tenth of a second. That makes 30,000 seconds, or a total of eight hours and 20 minutes of runtime. Sadly, the million dollars probably won't be given away unless you can get the reads done in under three hours. Oh, and did I mention the application won't actually have internet access?
So the BLAST site is out, which is unfortunate, because it really does an amazing job at matching sequences. What do we do instead?
Due to the nature of science in the United States - "publish or perish" - there are a whole lot of little bioinformatics applications around. Mostly, people will write one, publish a paper about it, and then forget about it. There's no point in maintaining it or going back and improving it since there's no chance of writing another paper about it unless you change the algorithms significantly.
Still, a few of these applications manage to have some shelf life. I'll look at Analyzing DNA with BWA next.
Also see:
Analyzing DNA Programmatically
Thursday, March 21, 2013
RNA Polymerase ||| and the RIG-I pathway
Type-I interferons (IFNs) are important for antiviral and autoimmune responses. They interfere with viruses as the viruses try to borrow the cell's replication mechanism to reproduce themselves.
The cell will produce interferons due to a couple of proteins: the retinoic acid induced gene I (RIG-I) and mitochondrial antiviral signaling (MAVS) proteins.
These, in turn, start the production process when cytosolic double-stranded RNA or single-stranded RNA containing 5′-triphosphate (5′-ppp) are nearby.
Here's a surprising thing: Cytosolic B-form double-stranded DNA can also induce IFN-β. For example, a DNA sequence of repeating AT can induce it (It’s known as poly(dA-dT). But no one knew how. Until a paper came out in 2009 by Yu-Hsin Chiu and a couple of other people. It turned out that inside the cell, the poly(dA-DT) was actually being converted into 5′-ppp.
But how? It turns out that an enzyme uses the poly(dA-dT) as a template to synthesizes 5′-ppp RNA. The enzyme is DNA-dependent RNA polymerase III (Pol-III). This was interesting because it was known that the Pol-III had a role in the nucleus of the cell, but not that it had to do with the immune system.
If you inhibit the working of Pol-III in a cell, and then introduce a bacteria like Legionella pneumophil, the bacteria grows in the cell. The implication is that Pol-III senses the DNA of the bacteria and triggers the IFN process.
How did they do it?
In a cell, they attached a luciferase reporter to the IFN-β promoter, so if the cell creates IFN-β, it would bioluminesce.Then, they put different things in the cell. Of all the things tested, only poly(dA-dT) activated the IRF3.
To ensure that there wasn’t something going on at another step in the path, some other things were tried: A silencing RNA strand was introduced into the cell that would stop the production of RIG-I and MAVS. No IFN-β was produced. DNASE-I is an enzyme that breaks down DNA. When that was introduced, no IFN-β was produced. On the other hand, IFN-β was produced in the presence of RNASE-I, so breaking down RNA had no effect.
Nucleic acids from the poly(dA-dT) cells were able to induce IFN- β, even in the presence of DNase I, so it wasn’t DNA that was causing it. Production stopped in the presence of RNase I though, so it must have been RNA that was being produced.
Similar tests were done to determine the exact length of the poly. As few as 30 base pairs were able to trigger the IFN. But, longer sequences with G’s and C have failed to trigger anything.
RNA Characteristics
When the SAP was used to remove the phosphate, the RNA no longer induced IFN- β (the PNK had no effect). Even when the PNK was used to add back the phosphate that was removed, there was still no induction, implying that a single phosphate was inadequate. Similarly, treating the RNA with Ter Ex also made no difference.
Another pair of RNase enzymes break apart specifically single stranded RNA (ssRNA) or double stranded RNA (dsRNA). RNase III breaks apart dsRNA, while RNase T1 breaks apart ssRNA. RNase III turned out to inhibit the IFN- β, indicating that dsRNA was required.
Conclusion
Friday, December 21, 2012
Unit Test Coverage in Biopython
This fall, for the first time, I've had the professional opportunity to work on some Python code. I find the language quite elegant and simple to write, much more so than Ruby, although I couldn't really say why. Along with my work on a Master's degree in bioinformatics, I decided to look at the Biopython project and see about adding some code coverage statistics to it.
Biopython is an open-source software project that provides Python libraries for a variety of bioinformatics purposes. It includes, among other features, a variety of file parsers, alignment algorithms, and interfaces with various common bioinformatics tools and databases. To assure quality, a battery of automated tests is run against the Biopython source code before every release. While Biopython has a substantial amount of automated tests, no statistics are gathered concerning the code coverage of the tests.
It took several steps. The first was simply to run the existing test suite; I downloaded the codebase to my Windows machine, and tried to build it, but it turned out to be really difficult. Biopython has a dependency on NumPy as well as several other packages. I shaved several yaks attempting to get it running, but gave up eventually. Instead, I created a virtual Linux machine and installed it on that.
This went much more smoothly. (Note that I don't find Linux to have any natural advantage here. People who play with the Biopython source code run it on Linux, so that install gets a lot more attention.) I was able to run the test suite after a fashion. Not all of the tests ran the first time, which is a bit scary - it's hard to modify code with which you're not familiar without passing tests. But I realized after a while that the suite does some things like connect to remote services to verify that the connections work, and those services were down. There's an option to run only tests that don't attempt connections, and that worked - again, after a fashion. There are so many external dependencies that the test suite is set up to skip tests that require packages that aren't installed. But as all I was looking for was a passing test suite, I was OK with that.
The next step was a coverage tool. I chose Ned Batchelder's Coverage which worked nicely. All I had to do was replace the line "python run_tests.py --offline" with "coverage run run_tests.py --offline" and the coverage ran. I found the architecture of the tool a little strange; you don't specify any kind of output format. Instead, the tool creates its own binary data file (prefixed with a '.' to hide it on Linux systems) and you run a separate command to generate output in various formats. But this worked well and I had my report, in a nice HTML format so I could look at it in a browser.
Coverage is nice, but I don't know that a single report is very useful. Comparing coverage between builds is where the true benefit lies. Are some tests no longer covering code that they were meant to? Were tests disabled for some reason? Why did the coverage percentage drop? Ideally, each time the code is built, a new coverage report would be generated. So, I went on to look at the Biopython build process.
Biopython uses Travis for its builds. Travis is a rather nice public continuous integration server that integrates with Github, where the Biopython source code lives. You provide a well-named build script in your source tree and Travis monitors your repository for changes. It was easy enough to incorporate the coverage tool, but Travis didn't provide a good mechanism for reporting. The simplest thing, I decided, was to use a script to automatically upload a file back to Github. I jumped through several more hoops to get this working - authentication, permissions - and today I find, first, that Github is deprecating file uploads and, second, that Travis is supporting a new artifact system . So I think most of that work is out the window.
Bother.
So I'll have to rework some of that. Nonetheless, the coverage effort gave me some interesting results. Biopython consists of 298 source files comprising 39,805 statements. Automated unit tests covered all but 12,499 of those statements, for a coverage percentage of 69%. 33 files (11%) had no coverage at all, while 60 files (20%) were fully covered. Based on these numbers, Biopython has at this moment an almost acceptable level of coverage, as it is a truism in the software development world that coverage rates above 70% tend to lead to diminishing returns of value.
However, as a code library, Biopython is free from the problems of automated user interface testing, which is one of the more difficult areas of automated testing. For this reason, one might expect a somewhat higher coverage rate. There is reason to suspect, however, that as time goes by, the percentage will go higher. An examination of some of the source files with zero coverage reveal deprecation warnings, i.e. indications to the user that the module in question should not be used for new development. At some point it is to be assumed that these files will no longer be part of Biopython, which will drive the code coverage percentage upwards.
The introduction of a code coverage measuring tool to the Biopython build process is an important step, but only a first one. The code coverage tool selected is particular to Python, while a certain amount of the Biopython source code is written in C, a much more difficult language to instrument even for automated unit testing, much less code coverage reporting on that testing. However, useful work could be done in attempting to add this coverage. The techniques used here only monitor statement counts – perhaps branching counts would be a valuable addition. Biopython also has many dependencies on third-party products, some of which are installed during unit testing and some are not. Tests around these integration packages might prove useful. Finally, some work could be done on analyzing differences in builds, sending out alerts to interested parties if a given build should have a sudden drop in the code coverage percentage, for example.
Code coverage is a useful tool for long-running projects. Hopefully some of this effort will make it into the main Biopython codebase!
Thursday, September 13, 2012
Denisovan Gene Sequencing
This evidence comes from the Denisova Cave in Siberia.Scientists found the finger, toe, and tooth, which came from three different individuals, in different levels of the cave, and after doing a DNA analysis determined that their last common ancestor with humans lived about a million years ago. (The fossils dated from about 50,000 years ago). Of course, with such a minimal amount of material available, it's a bit tricky to get a complete gene sequence. The fact that the cave is in Siberia helps some (the average temperature in the cave is right around freezing) but some nice work on sequencing from a group led by Matthias Meyer helped as well.
Here's what they did to the source material:
DNA is dephosphorylated, heat denatured, and ligated to a biotinylated adaptor oligonucleotide, which allows its immobilization on streptavidincoated beads.
I'm sure you're kicking yourself for not thinking of it first. At any rate, the immobilization of the DNA on the beads seems to be the important part, as it allows for copying of the sequence thus creating extra source material to work with. We now know more about Denisovan gene seqences than we do about Neanderthal sequences, as the quality of these Denisovan genes is better, less contaminated, than anything we have from the Neanderthals. Pretty cool stuff! Here's an article from Ars Technica if you don't feel like wading through the original paper.
Monday, September 03, 2012
Coding errors in DNA analysis software
The problem:
A method for analysing similarities in protein sequences is to use a substitution scoring matrix. The matrix will assign a specific score to each individual protein match, so you can compare the sequences by looking at each individual pair of proteins in the sequence, looking in the matrix to determine the compatibility score for the pair, and adding up (or otherwise aggregating) the total score. The higher the score, the more likely it is, presumably, that the two sequences are actually related.The program:
So why did it take so long to find?
But there's no question...
Whatever the reason, it's hard to argue with Styczynski's conclusion: "there is significant room for improvement in our understanding of protein evolution.".
Tuesday, August 21, 2012
On the teaching of genetics
The problem is, that strategy doesn't work.
It reminds me of astronomy classes both in high school and college. Now, astronomy is an awesome and fascinating subject. Go pick up any popular science magazine with an article on astronomy and just check out the language that they use: "Quasar". "Black Hole". "Dark Energy". "Strange Planet". It's like the whole subject was created just to appeal to teenagers. There is a podcast dedicated to astronomy called AstronomyCast that goes over a lot of this stuff, and my eleven-year-old son cannot go to sleep at night without listening to at least a few episodes.
But I hope that the interest isn't torn out of him in high school. If his courses are anything like mine, they will discuss: Stonehenge. Galileo. How, if you stay up night after night, you can see the position of the planets change slightly in relation to the stars. How an optical telescope works.
Genetics is just the same. Engage the student's interest by hitting them with the cool stuff first. Don't try to emulate the thinking of Mendel, because the instruments and techniques we use today are so much more powerful than Mendel ever dreamed of, and students know that, and often know what the techniques are. Redfield suggests starting with personal genomics, which seems like a good plan. Students, who know that they have a unique genetic makeup, should be interested in knowing what that makeup is, or at least how to find out. This would lead directly to the ethical questions surrounding that knowledge, and the course is off and running. Redfield is on to something.
Saturday, August 18, 2012
DNA as storage mechanism
The innate four bases (A,G,C,T) of DNA seem to lend themselves to some interesting storage techniques. The authors used simple redundancy for their storage - A and C both represented 0, G and T were 1, which was apparently a departure from earlier attempts which encoded each pair of bits into a single base. This made it easier to construct more robust sequences. I wonder if additional error-handling could have been done by placing checksum bases at intervals along the strand? Two bases would provide a range of 16 possible checksum values which seems it would handle a nice string of bits.
The book that was encoded had 50,000 words and eleven pictures. With an average code space of 40 bits per word, the text should have taken a tiny fraction of the total space, with the images providing the majority. Suppose that all ten bit errors were in one picture? It would be interesting to know how tightly compressed the images were. With a high compression factor, some of the bit errors might be substantial, but small changes to the compression might make a large difference in the visibility of any bit errors.
The authors say that DNA storage is dense, stable, and energy-efficient, but prohibitively expensive and slow to read and write compared to more standard storage. It will be fun to see how this technology evolves!
Tuesday, July 03, 2012
Back to school!
Wednesday, May 14, 2008
DevExpress Appointment Template Exception: The file 'MyControl.ascx.cs' does not exist.
The control allows full customization of the appointment display. On the page that holds the calendar, you define, for example, a "VerticalAppointmentTemplate" item for the daily view, and give it the name of a user control you've defined in order to display an appointment in that particular view. Then the user has the ability to drag the appointment around and do other clever things with it for rescheduling, etc, and the calendar control handles the placement of your user control at the correct time on the calendar in the web page. Pretty nice!
So I set up my controls the way I wanted them, tested to make sure it worked, checked the code in, and sent it to QA to look at. Response: "The page errors out as soon as we navigate to it."
Huh?
Further investigation revealed that an exception was being thrown, with the message "The file 'MyVerticalAppointment.ascx.cs' does not exist". For some reason, it wanted the source code for my user control, and I had no idea why. Like all of our other codebehinds, the code is compiled into an assembly that is published on the web site. No source code is put out there.
If you're an ASP.Net veteran from way back, this is probably throwing up all kinds of red flags for you, but I'm not. Googling for various terms in the exception didn't really turn up much, except that most of the similar solutions seemed to involve converting the file or the application to a web application, something I vaguely remember from around the time we upgraded to VS2005, but never really had to deal with. Besides, I knew that our application was already set up the way we needed it. There was no conversion to be done as far as I could tell.
So after futzing around with it for a while, my coworker pointed out an oddity in the user control. Instead of using a CodeBehind declaration to point to the code, it was using a CodeFile declaration.
That was the problem, of course. It wasn't that I had converted from a Web Application to Web Project or back again, it was simply that I had borrowed a piece of sample code from a DevExpress project that was using a CodeFile declaration, inappropriately for my project. Switched it to CodeBehind, didn't even have to recompile, and everything worked properly.
If it's useful, here's the stack trace of the exception that was thrown:
at System.Web.UI.TemplateParser.ProcessException(Exception ex) at System.Web.UI.TemplateParser.ParseStringInternal(String text, Encoding fileEncoding) at System.Web.UI.TemplateParser.ParseString(String text, VirtualPath virtualPath, Encoding fileEncoding) at System.Web.UI.TemplateParser.ParseFile(String physicalPath, VirtualPath virtualPath) at System.Web.UI.TemplateParser.ParseInternal() at System.Web.UI.TemplateParser.Parse() at System.Web.Compilation.BaseTemplateBuildProvider.get_CodeCompilerType() at System.Web.Compilation.BuildProvider.GetCompilerTypeFromBuildProvider(BuildProvider buildProvider) at System.Web.Compilation.BuildProvidersCompiler.ProcessBuildProviders() at System.Web.Compilation.BuildProvidersCompiler.PerformBuild() at System.Web.Compilation.BuildManager.CompileWebFile(VirtualPath virtualPath) at System.Web.Compilation.BuildManager.GetVPathBuildResultInternal(VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile) at System.Web.Compilation.BuildManager.GetVPathBuildResultWithNoAssert(HttpContext context, VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile) at System.Web.Compilation.BuildManager.GetVirtualPathObjectFactory(VirtualPath virtualPath, HttpContext context, Boolean allowCrossApp, Boolean noAssert) at System.Web.Compilation.BuildManager.CreateInstanceFromVirtualPath(VirtualPath virtualPath, Type requiredBaseType, HttpContext context, Boolean allowCrossApp, Boolean noAssert) at System.Web.UI.PageHandlerFactory.GetHandlerHelper(HttpContext context, String requestType, VirtualPath virtualPath, String physicalPath) at System.Web.UI.PageHandlerFactory.System.Web.IHttpHandlerFactory2.GetHandler(HttpContext context, String requestType, VirtualPath virtualPath, String physicalPath) at System.Web.HttpApplication.MapHttpHandler(HttpContext context, String requestType, VirtualPath path, String pathTranslated, Boolean useAppConfig) at System.Web.HttpApplication.MapHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
Wednesday, April 16, 2008
Dreaming in Code, Scott Rosenberg
I look at this book and remember the many times I was in the same situation as the Chandler folks were in: A product with lots of amorphous design that you can't get coding on because every time a little code gets written, the designers pull back and say, "Whoa, whoa! That's not what we meant at all!" Then you get pulled back into another design meeting and the process starts all over again.
I'm over that now. The company that I work for understands that the important thing is to get something out there that people can get their teeth into, to figure out whether it's any good or not. I think to myself that the bad old days are over...but it scares me to think that a lot of other coders still have to cope with the same sort of situation: the eternal debate between the designers, architects, and coders over who's fault it is that the code is buggy and the users hate the application.
Whether it was his original intention or not, Rosenberg brought back the intense frustration of these times with his description of the flailing of the Chandler product. I suspect a non-coder, and maybe even a lot of coders would look at it differently, thinking that hey, they were trying to do it right for a change, get the design down before they do the coding so the rest of it is just simple plugging and chugging, stuff any code monkey could do. It never works that way, though.
I get the feeling that what Rosenberg was really looking for was a happy ending. You spend lots of money, do the project right, maybe have some interesting pitfalls along the way, then you release the application, everyone loves it, the world changes, and the book ends. It didn't work that way, unfortunately, and a lot of the second half of the book leaves the realm of Chandler to discuss the philosophy of coding, bringing up agile development and the mythical man-month.
But in the end, it's very difficult to separate the software application from the book, and ultimately, since the ending of the one was vague and ambiguous, not with a bang but with a whimper, the ending of the other is too. Still, the book is one-of-a-kind; a detailed, unflinching look at a single software development effort. Every development team should be so lucky as to have a retrospective like this to look back on.
Wednesday, February 27, 2008
Blog anniversary!
Tuesday, February 12, 2008
Startup secrecy
A note I got today warning us of the terrific need for secrecy around the company that was created at the Bloomington Startup Weekend this past weekend. I ended up not being able to participate in any meaningful sense, except maybe for a few hours on Friday night, so I don't know what any of the big secrets are that they need to keep, but one thing I do know is that
there is no business model that is so unique and different that no one has ever thought about it before.
Creating a successful business is about execution, and sweat equity, not about the new and exciting business model. All this sort of insistence on secrecy does is shut down any potential buzz that would be created. I mean, you've got 75 or so people who are probably, or hopefully, really excited about the application they've put together. They should be blogging, twittering, discussing how excited they are about the company. That is a lot of people for a small town like Bloomington - the buzz would probably have a multiplicative effect and people might even start up a buzz about a buzz, so to speak. But they're blowing it by telling everyone that they can't post, can't talk, can't even email.
The PR people and/or the lawyers are probably telling them that they need to present a consistent message, need to prevent any chance of being sued for patent infringement, need to be safe, need to be careful. Sorry, folks, being careful isn't how you create a successful startup. That comes from being bold and taking chances.
I got a separate note telling me I needed to fill out more forms in order to claim the share of the company that I qualified for on Friday night. Meh. I don't think I'll bother.
Wednesday, January 16, 2008
Bloomington Startup Weekend
The week before that we'll have a geek dinner, so I'm guessing the Startup Weekend will be a topic of conversation there too. Hope to see you at El Norteno!
Thursday, December 06, 2007
A Facebook feed for the open web
I do like the Facebook minifeeds, though. A minifeed, if I understand correctly, is an aggregation of all the things that a Facebook user is doing on Facebook - updating status, adding friends, using applications. For each friend, getting updates on what they're doing moment-by-moment on Facebook is interesting, and the Facebook homepage aggregates all my friends' feeds into a single one and sorts it by time. So when I do log on to Facebook, I can see at a glance what all these people are doing, at least in the last few hours.
But there's plenty of stuff on the open web that could go into a minifeed just as easily. A lot of sites are making sure they have Facebook applications now, but not every one, and
who wants to rely on a Facebook app for something that isn't really anything more than an RSS feed?
I ended up creating a web page directly rather than creating a feed - I didn't feel like learning all the ins-and-outs of RSS or Atom. So, if you want to follow my life, almost minute by minute, check out this page - or just check out my home page, which has a small iframe in it with that page in it, which is how I intended to use the feed anyway. You can't subscribe to my life just yet, but maybe that will be coming soon!
Along with my feeds mentioned above, the page aggregates Twitter posts, and soon I'll add my Flickr pictures and maybe Delicious , Coastr, or Zelky if they have the feeds in the format I need. I'm looking forward to having my own life feed!
Monday, November 19, 2007
Pair Programming vs. Code Reviews
The comments are already coming in complaining about pairing. I noticed these two particularly:
the obvious conclusion to this is double the hours per project, at minimum (and I'd expect that you would work slower if you had to discuss or explain stuff to someone else the whole day).
I would freak out if someone would watch me every the time I code (and also has a keyboard to interupt me lol)
Sort of the standard responses to pair programming. I'm not so experienced at the art that I can really say the hours don't double, maybe they do - but what I can say is even if the hours are doubling, the code quality is squared. Maybe it's just a commentary on what lousy code I produce by myself, but there is a big difference when someone else is there looking at the code, even if it's only the "navigator" effect, where the person who isn't actually at the keyboard can allocate the memory space to go back and remember any refactorings or other cleanup that needs to be done. As far as working slower, there are only two possibilities: first, that the other person doesn't know about the code as well as you do, in which case the knowledge transfer makes the whole thing worthwhile, or second, that there are a few ways of doing things and you need to decide which way is best. The selection you make when coding by yourself might easily not be that one.
Insofar as code reviews go, I find them almost unnecessary when pairing. Some teams do peer-review-before-checkin, which I don't really care for - I just can't grok the concept the code is trying to get across just from staring at it for a few seconds while someone explains it to me, but I suppose some people can do that. But we do code reviews for two things: first, to go over legacy code - we have plenty of that in our application - and second, to go over code that's just been checked in. This isn't 100% useful either, but on the other hand we have very few development meetings, and sometimes it's worth it just so someone can point out, "Oh, this should have been done using this brand new language feature" or, "we have a custom library that already handles exactly this case, can we use it here?"
So code reviews can be worthwhile, and they are absolutely necessary in a non-pairing environment. The big thing to watch out for is that you don't spend a lot of time discussing what your internal coding standards are, as I've written about before. But my feeling is that it is not as useful as pair programming.
Sunday, November 11, 2007
iFrame scroll to anchor problem
But it's not like the schedule display needs to be real complicated. I tossed it into an iFrame, stuck the schedule on a separate page, and wrote some simple Javascript to scroll to a specific game's anchor based on the current date.
But what's this? When the iFrame scrolls, the entire page jumps down to the iFrame to display it. That's not what I wanted, but I couldn't for the life of me figure out a way to stop it from happening, until I finally ran across Jim Epler's blog entry explaining how he simply scrolled the main page back to the top after setting the anchor. So you set the location in the iFrame, the main page jumps down, then you set it back to the top. It's not pretty, but it works. Here's the code in the iFrame:
location.replace(location.href+anchorname);
parent.window.scrollTo(0,0);
Thanks, Jim!
Thursday, November 08, 2007
Some test code smells
There are a couple of competing dynamics you get when writing tests: the first is that, in general, less code is better than more code. You want the code to express the concepts you need to express without any extra cruft, without huge globs of copy/pasted code here, there, and everywhere that is a nightmare to maintain. But you need tests as well, and tests are either more code, or more people, and people are one heck of a lot more expensive than code. So it's not at all unlikely that you would have as much test code as production code.
But those tests have to be maintained, so it behooves us to figure out the best way to do that, bearing in mind that the goals of test code are not the same as those of production code. So, here are some problems, or code smells, specific to test code that you might run into:
1. Conditional test logic . A lot of people like to say that one assertion per test is plenty. It seems like unreasonable test gold-plating to me, but if you're going so far as to put an if statement in the middle of the test, you need to have multiple tests to test each branch of the statement, each time. Or, the condition might simply be an assertion, where you're saying if (x) keep testing; else assert false. No point in that, just assert x at the beginning and let the code blow up if it needs to. This really helps with the readability of the test report, too, expecially if all the report says at the end is "The assertion FALSE occurred". Not helpful, whereas knowing directly from the report that X failed is much more useful.
2. Hardcoded test data
This problem is related to the principle that computer science profs have been tossing around since the beginning of time: that you never want to put numbers or strings in code; always define them somewhere as constants instead, so they can be easily changed. Not a bad principle; certainly it's ideal that anything that needs to be displayed to the user can be easily modified to use another language, so you don't want a bunch of MessageBox.Show( "Are you sure you want to do this?" ) statements scattered through the code. For numbers, though, my general rule is that it doesn't need to be a constant value unless it shows up more than once.
But in a sense, just about every number shows up more than once if written properly: at least once in the code, and at least once in the test for that code. Say for example you're testing a Price object with this line of code:
Assert.Equal( $14, CreatePrice().Retail );
CreatePrice() is part of your fixture, and it sets the list price to $20. Your Price object knocks off 30% to come up with the Retail number.
But now you've got the same number in there twice! See it? $14 is in there, and so is 70% of $20. The same number.
One fix is to move everything to constants. Presumably the Price object has a GetDiscount() method, so you could make the $20 into a constant ListPriceInDollars, and change the expected amount to ListPriceInDollars * GetDiscount. Still pretty verbose, but not really bad for this small example. A better solution can be to create an ExpectedObject to compare what comes back from CreatePrice to. In the ideal case, your test would then simplify to
Assert.Equal( ExpectedPrice, CreatePrice() );
Which would cover a boatload of other comparisons as well as your Retail value.