First: Principles

Tuesday, October 20, 2020

Julia on a multi-user system

Had occasion to install Julia on a multi-user system today. I downloaded the tarball to my own directory and ran Make. The instructions say that the install is fully contained in the single directory, so you don't have to worry about files being installed in different locations on the system. Once it finished, I moved the directory to a globally accessible location and tried it out. It worked mostly, but nothing about the package

manager would run properly. Eventually I realized that if the directory that I had initially created existed, the package manager worked, but if I deleted it, the package manager stopped working.

I deleted everything, recreated the directory in its final, globally available location, and ran Make again. Success! Apparently something in that compile process is looking to see what directory it is in and going back to it for data. I'd like to know what that is.

Tuesday, October 13, 2020

WSL permissions bits

Conflicts between Linux permissions and Windows permissions are a perennial problem for people who switch back and forth between WSL and Windows. One thing that helps is, when mounting a drive, to provide the -o metadata parameter to make sure that files have both Windows and Linux permission bits:

$ sudo mount -t drvfs g: /mnt/g -o metadata

Here's some good information about how WSL permissions work:

https://devblogs.microsoft.com/commandline/chmod-chown-wsl-improvements/

Thursday, September 03, 2020

Research Computing and Data Capabilities Model

The capabilities model allow an institution to evaluate how well it supports various data and research requirements.

https://carcc.org/2020/07/09/announcing-the-rcd-cm-2020-community-data-participation-window/

What I found most interesting is what it refers to as the "five facings"

Researcher Facing Roles
Includes research computing and data staffing, outreach, and advanced support, as well as support in the management of the research lifecycle.

Examples: Research IT User Support, Research Facilitators, CI engineers, etc.

Data Facing Roles
Includes data creation; data discovery and collection; data analysis and visualization; research data curation, storage, backup, and transfer; and research data policy compliance.

Examples: Research Data Management specialists, Data Librarians, Data Scientists, etc.

Software Facing Roles
Includes software package management, research software development, research software optimization or troubleshooting, workflow engineering, containers and cloud computing, securing access to software, and software associated with physical specimens.

Examples: Research Software Engineers, Research Computing support, etc.

Systems Facing Roles
Includes infrastructure systems, systems operations, and systems security and compliance.

Examples: HPC systems engineers, Storage Engineers, Network specialists, etc.

Strategy and Policy Facing Roles
Includes institutional alignment, culture for research support, funding, and partnerships and engagement with external communities.

Examples: Research IT leadership

Which, at the risk of sounding like a Myers-Briggs evaluation, seems to sum up nicely the important categories of staff in research computing.

Friday, August 21, 2020

Traversing a graph database with Gremlin

This is an invaluable tutorial on how to use the Gremlin query language to get results from a graph database. For some reason, all the internal links seem to be broken, but it's a ten-part series I think. Lesson six on projection and selection is particularly useful.

https://www.datastax.com/blog/2017/09/gremlin-recipes-1-understanding-gremlin-traversals

Thursday, July 23, 2020

GDB in threaded code

GDB (https://www.gnu.org/software/gdb/ ) is a handy way to debug command line applications. But in the case of applications that are running many threads, it doesn't by default follow a single thread, so as you step through the code it jumps between threads and it's easy to lose track of where you are. The solution is the scheduler-locking command, which forces the stepper to only step through one active thread at a time.

(gdb) set scheduler-locking on
See here for details: https://sourceware.org/gdb/current/onlinedocs/gdb/All_002dStop-Mode.html

Sunday, July 19, 2020

Creating a Hyperbook in Microsoft Word

When someone creates a document they'll possibly set up a table of contents which conveniently links to the chapter headings they've created. They'll very likely provide hyperlinks to their sources or references so it's easy to go out on the web and find the sources. But it's pretty rare to provide handy links inside the document pointing to other places in the document - a hyperbook. Now, there's no reason not to keep providing outside links as well, but a good hyperbook is like a self-contained Wikipedia - lots of good information and lots of links to related subjects of interest directly on the page.

To insert internal links into a document in Microsoft Word, do the following. On the Insert tab, there's a Links panel. Click that, then Link, then Insert Link. The dialog that comes up offers a variety of ways to insert internal links. Very nice for creating hyperbooks.

Thursday, July 02, 2020

Force-closing an SSH connection

On occasion when I'm using SSH to connect to a remote server, I run an application that hangs. If the terminal's running inside a GUI, you can always close down the entire terminal and restart it, but there's an easier way: hit Enter, then type "~." (squiggle dot) That force-closes the SSH session leaving your terminal intact. I learned this from SuperUser:

https://superuser.com/q/467398

Friday, June 05, 2020

Downloading files from Google Cloud

In doing some testing with GATK 4, I found myself in need of downloading files from Google Cloud. Google Cloud likes to use URL's
that start with gs: For example, the URL for some tumor data is

gs://gatk-best-practices/somatic-b37/HCC1143.bam .

You can't just visit that URL in your browser though; or at least I couldn't. I had to install gsutil as described here: https://cloud.google.com/storage/docs/gsutil_install#linux . This is one of those weird installs where they provide a script online that you can run; a bit dangerous, but at least they don't ask for sudo. It downloads about a gazillion files then asks permission to muck with your settings. I said no, of course, and it gave me a couple of files to source if I wanted. One of them had to do with providing autocomplete, but the other one simply added a directory to the path, so I created
an environment module to do that work. Now I can download the files I need:

$ gsutil cp gs://gatk-best-practices/somatic-b37/HCC1143.bam .

Friday, May 15, 2020

Github offers successor options

Github is putting some thought into what happens to to your repositories if you're "unable" to manage them - a kind way of saying if you die. Nothing I have would have any consequence, but certainly I'm involved with some organizations that would need taking over. Here's how you name a successor for your repositories:

https://github.blog/changelog/2020-05-11-account-successors/

Thursday, April 30, 2020

Covid-19 infections by county, rate of increase

So this image shows rate of increase of the number of cases on a daily basis. It's not as smooth as I would like.

Wednesday, April 29, 2020

Generating sequentially increasing values in C++

Say you need to generate a sequence, in C++, simply consisting of the first so many integers, like, 1,2,3,4,5.

With a little programming experience, you can come up with a dozen different ways to do this, but here's an obscure one:

std::list<int> l(5);
std::iota(l.begin(), l.end(), 1);

According to CppReference, the function is named after the integer function ⍳ from the programming language APL.

The more you know.

Saturday, April 11, 2020

Command Line tips from CLI Magic

Tips for working with the command line from CLI magic. I like this one to show all listening TCP/UDP ports for the current user:

$ lsof -Pan -i tcp -i udp
https://www.patreon.com/posts/climagic-003-5-35703693

Wednesday, April 01, 2020

Covid-19 infections by county, over time

This is a visualization choropleth I put together of Covid-19 infections for each county over time. It's based on the NY Times data set and I built the images in Python based off a very good, if ancient, tutorial I found at FlowingData.

Edit: Updated through Apr. 13 data, and also made the original larger. Not sure if that matters for this web page or not.

This image is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. I would appreciate attribution if you care to use it!

Tuesday, March 31, 2020

BeautifulSoup4 in Python

A lot of code available that uses BeautifulSoup tells you to call it like so:


from beautifulsoup import BeautifulSoup

If you've installed BeautifulSoup4, though, this won't work. The main module name has been changed to bs4. So change the code that does that to

from bs4 import BeautifulSoup

Please make a note of it.

Thursday, February 27, 2020

Pre-defined compiler macros

If you're working in a multiple-compiler environment, you may wish to get information at compile time about which compiler you're building with. Here's a list of macros defined by many compilers:

https://sourceforge.net/p/predef/wiki/Compilers/

Friday, February 14, 2020

Enabling wind power nationwide

A report from the Obama Department of Energy in 2015.

https://www.energy.gov/sites/prod/files/2015/05/f22/Enabling%20Wind%20Power%20Nationwide_18MAY2015_FINAL.pdf

Thursday, February 13, 2020

Take a newick tree and stre-e-etch the leaves to make it ultrametric

I needed to generate some ultrametric Newick trees for some simulations. There's a nice Newick tree generator online at http:/Trex/trex.uqam.ca/index.php?action=randomtreegenerator&project=trex, but it doesn't have any way to make an ultrametric tree. (Ultrametric means that the lengths from root to tip are all of the same.) So I put together a short script using BioPython to replace the length of each leaf node to make the lengths all identical to the longest length.

https://gist.github.com/benfulton/a10583c8ae41897489d415c066c7579f

How to delete empty rows in an Excel spreadsheet

How to delete empty rows in an Excel spreadsheet:

https://www.itsupportguides.com/knowledge-base/office-2016/excel-2016-how-to-delete-empty-rows/

Monday, August 10, 2015

How Installing a Random Tool Caused Ruby on Rails to Stop Working

At work I maintain a little Ruby on Rails site. Not a particularly interesting site, just a set of pages of statistics gathered from various databases around the organization. It runs on a RHEL6 virtual machine and I write most of the code on my Windows laptop using RubyMine from JetBrains - a product I wholeheartedly endorse, by the way.

It does take a little patience to run Rails on Windows - not due to any deficit in either product, just because the vast majority of people who work with Rails are running it on some other platform, so when you have problems, a lot of times you're on your own.

Take this problem I ran into recently, for example. I launched RubyMine to check out a bug, fired up the ol' testing server, browsed to the web page, and was greeted with this:

JSON::ParserError in Statistics#index
Showing /app/views/layouts/application.html.erb where line #5 raised:

757: unexpected token at 'Node Commands

Syntax:
node {operator} [options] [arguments]

Parameters:
/? or /help - Display this help message.
list - List nodes or node history or the cluster
listcores - List cores on the cluster
view - View properties of a node
online - Set nodes or node to online state
offline - Set one or more nodes to the offline state

For more information about HPC command-line tools,
see http://go.microsoft.com/fwlink/?LinkId=120724.

(in /app/assets/javascripts/accounts.js.coffee)

Now, of all possible error messages to get, one involving HPC command-line tools was not one that I had ever considered. What's going on here?

The line that gives the error is one that occurs in every Rails application:

<%= javascript_include_tag 'application' %>

What it really comes down to saying is, take all of the Javascript and Coffeescript code in the application, and include it here. A little wasteful, perhaps, but in a production environment all of the code will get glommed into a single file appropriate for caching in the browser, so it's not a bad system. Still, where's the error?

Obviously it must be in the Coffeescript file referenced at the end of the error message. But why is it suddenly complaining about HPC? I went and commented that whole file out - no joy. The same error got reported with a different file. I could find no information anywhere about what the cause might have been, and I didn't expect searching for the error message itself would do any good. I work in an HPC - which stands for High  Performance Computing, incidentally - environment so I was fairly sure the error was something very specific to my machine.

I tried, of course. Several different Google searches failed - but I finally hit on the right combination of keywords that led me to a Google Plus comment on a YouTube tutorial.

It turns out that Rails, in its infinite struggle for flexibility, relies on a gem called ExecJS to decide how it's going to convert Coffeescript to Javascript. ExecJS, in turn, is of the opinion that Node.js is a really good way to do the conversion. And it determines if Node.js is available by trying to run an application called Node.

Which is fine, unless you happen to run Windows; and you happen to work in an HPC environment; and you happen to have gone to a conference the other week where you decided to install Microsoft's HPC tools; which happen to include a nice command for managing compute nodes; called - you guessed it - Node. ExecJS was trying to use Microsoft's Node command to convert Coffeescript to Javascript, resulting in the error you see. Hats off to Jonathan McDonald for nailing the issue.

Wednesday, July 16, 2014

Trinity RNA-Seq

In January a site went live called readingroom.info. I suspect due to the timing and the subject matter it was a student project. The idea was to write summaries - no more than 500 words - of scientific papers and allow people to comment on and discuss them. I thought it was a neat idea. They had some ideas for incentivizing writers and so forth, but I didn't have time to contribute anything until this summer, by which time the authors had apparently lost interest. I sent in a summary of a paper, but after several weeks it had not been approved by the moderators. Maybe it just wasn't that good!

The paper is Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.

This is what I sent them:

The reconstruction of a transcriptome from the short reads generated by RNA-Seq techniques presents many challenges, particularly in the absence of an existing reference genome with which to compare the reads. Challenges include: uneven coverage of the various transcripts; (ii) uneven coverage inside each transcript; (iii) sequencing errors in highly expressed transcripts; (iv) transcripts encoded by adjacent loci can overlap and thus can be erroneously fused to form a chimeric transcript; (v) data structures need to accommodate multiple transcripts per locus, owing to alternative splicing; and (vi) sequences that are repeated in different genes introduce ambiguity. The Trinity pipeline leverages several properties of transcriptomes in its assembly procedure: it uses transcript expression to guide the initial transcript assembly procedure in a strand-specific manner, it partitions RNA-Seq reads into sets of disjoint transcriptional loci, and it traverses each of the transcript graphs systematically to explore the sets of transcript sequences that best represent variants resulting from alternative splicing or gene duplication by exploiting pairs of RNA-Seq reads. The series of steps performed by the pipeline correctly reconstructs a significant percentage of the transcripts without relying on the existence of a reference genome.

A major data structure used the pipeline is the de Bruijn graph. A de Bruijn graph places each k-mer in a node, and has connected nodes if the k-mers are identical in all but the first or last position. While an efficient structure for representing heavily overlapping sequences, there are challenges in the usage of these graphs: (i) efficiently constructing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. Sequencing errors would introduce false nodes to the graph, potentially resulting in a great deal of wasted memory.

The Trinity pipeline consists of the following steps: it first analyzes the short reads to create a dictionary of all sequences of length 25 in the reads, indexing the locations where each sequence may be found. After removing likely errors, the unique k-mers are recombined, starting with the most frequently occurring sequences and extending the combination until no more k-mers can be matched. Each contig is then added to a cluster based on potential alternative spliced transcripts or otherwise unique portions of paralogous genes. Then, a de Bruijn graph is generated from each cluster with the weight of each edge assigned from the number of k-mers in the original read set that support the connection. In the final phase, a merge-and-prune operation on each graph, for error correction, is performed, followed by an enumeration of potential paths through the graph with a greater likelihood placed on paths with greater read support.

The authors built transcriptomes from both original data and reference sets, having a great deal of success in either case.