13/05: emacs packages considered useful

Category: workflows
Posted by: dmeliza
Having recently upgraded emacs to the latest version (I was running the last release of CarbonEmacs), I took the opportunity to clean up my .emacs file and add/enable a few new packages:

Packages I've been using for some time:

Deprecated packages:

swbuff: cycles between buffers with a single keystroke, similar to how you can move between tabs in a browser with C-pagedown and C-pageup. I now use ido for this, using the following lines in my .emacs file:


(add-hook 'ido-setup-hook 'ido-my-keys)
(defun ido-my-keys ()
  (define-key ido-completion-map [C-next] 'ido-next-match)
  (define-key ido-completion-map [C-prior] 'ido-next-match)
)
(global-set-key [C-next] 'ido-switch-buffer)
(global-set-key [C-prior] 'ido-switch-buffer)
 

28/01: a modest proposal

Category: General
Posted by: dmeliza
Why are (neuro)scientists such terrible programmers? I just spent most of the morning trying to add some simple functionality to some code from a fairly prominent computational neuroscience lab, and the code was so bad, I began to entertain serious doubts about the method and a whole string of papers that have been published using the method. I briefly debated rewriting the whole thing from scratch, but I have about twelve other projects on my plate now, and the benefits of challenging a fairly accepted method are pretty low, unless you can also upset a few cherished hypothesis at the same time. So I crammed my patches into the spaghetti, prayed it didn't break anything, and started the analysis, which only makes a rather minor contribution to my results anyway. No doubt better programmers than me have felt and done similar things, and in all likelihood the problem is much larger: science is so competitive now, with everyone looking for big, new, flashy results, that verification has become a really low priority. There's not a lot I can do about that, but at least in the realm of computational techniques it may be worthwhile to outline the problem.

What makes code good, and why is it important? Good code is first and foremost, easy to interpret by someone who didn't write it. Anyone with a basic understanding of programming, and the scientific question being addressed, should be able to tell what's going on in every line. This goes far beyond commenting; in fact, really good code doesn't need to have a lot of comments, because the expressions are largely self-explanatory. But regardless of how it's achieved, when code is readable the methodology is clear. It should be obvious that this is important to good science, because it allows your colleagues to evaluate what you did and extend your work. But let's be honest: not everyone shares those goals. A cynical explanation for why there's so much bad scientific code is that the people releasing it want to maintain control over the method, for commercial or competitive reasons. I don't think this motive is all that common in neuroscience, so I prefer to think that it stems from a lack of education, which I am willing to do my part to try to dispel. Toward that end, I'll be posting articles here detailing some common mistakes and mistaken attitudes.

In the long run, as computational techniques become more complex and more common, I would propose that code needs to be treated as a part of a paper's Methods section. If a program hasn't been used previously, it should be subject to review, and even if the method isn't new, authors should be required to deposit the code they used in a given paper somewhere online. Many journals make similar requirements for gene sequences. I think this would do a lot to improve the quality of programming, and it would keep researchers from purposefully obfuscating code to keep one foot in the door for commercial exploitation. But of course there is a much larger discussion about how well scientific and pecuniary motives can coexist in research.

02/07: dynamic coroutines

Category: python
Posted by: dmeliza

Generators are Python functions that have a yield statement instead of a return statement. Or rather, calling a function that has a yield statement returns a generator, which is an iterable object that uses the underlying function to produce values. Each time you call next() on the generator, it yields the next value in the sequence, or throws StopIteration, which indicates that it's out of values. Python's famous list comprehensions are actually generators. For example, you can create a list of square numbers with a command like [x**2 for x in range(5)]. If you leave out the brackets you get a generator:


>>> g = (x**2 for x in range(5))
>>> g
<generator object at 0x2aaaae169f38>
>>> g.next()
0
>>> g.next()
1
 


Python 2.5 added some more sophistication to generators that allow them to be used as coroutines (PEP 342). For instance, generators now have a close() method that causes the generating function to throw a GeneratorExit, which can act like a signal to tell the generator to clean itself up. It's also possible to pass an arbitrary exception to the generator, or to pass values back to the generator.

The advantage of using coroutines is that it allows you to modularize your code without having to break out a lot of heavy object-oriented features. For example, I often have to write scripts that plot a bunch of little figures. I often want to break the figure across multiple pages, so I generate axes until the figure is full, then write the figure to disk and start a new one. There's some bookkeeping involved in this; I have to create the initial figure, then check after every plot if the figure is full and take the appropriate actions if it is. All of that is extraneous to what the script is actually doing, so we want to factor that out. There are many ways of solving this, but a generator is remarkably simple:


def axgriditer(grid=(1,1), figfun=None, **figparams):
    """
    Generates axes for multiple gridded plots.  Initial call
    to generator specifies plot grid (default 1x1).  Yields axes
    on the grid; when the grid is full, opens a new figure and starts
    filling that.

    Arguments:
    grid -    specify the grid layout. Can be a tuple or a function that
              yields a series of axes [signature grid(fig)]

    figfun -  called when the figure is full or the generator is
              closed.  Can be used for final figure cleanup or to save
              the figure.  [signature: figfun(fig)]

    additional arguments are passed to the figure() function
    "
""
    if len(grid)==2:
        nx,ny = grid
        grid = lambda fig: (fig.add_subplot(nx,ny,i) for i in range(1,nx*ny+1))
    elif not callable(grid):
        raise ValueError, "Grid argument must be length 2 or a function"

    fig = figure(**figparams)
    axg = grid(fig)
    try:
        while 1:
            for ax in axg:
                yield ax
            if callable(figfun): figfun(fig)
            fig = figure(**figparams)
            axg = grid(fig)
    except Exception, e:
        # cleanup and re-throw exception
        if callable(figfun): figfun(fig)
        raise e
 


Using this generator is fairly simple. The following code plots 4 figures with 5 panels each, with only a single line of extra code to initialize the coroutine.


axes_grid = axgriditer((5,1))
for i in range(20):
    axes_grid.next().plot(randn(25))
 


Several things to note about this function:


24/06: R tools for scons

Category: workflows
Posted by: dmeliza

The motivation for using scons to automate analysis workflows is to formally specify dependencies in how the analysis gets done. This is the idea behind "reproducible research" (a term clearly coined by non-biologists; I prefer "reproducible analysis"). You run one command and it turns your raw data into figures that are nearly ready for publication, removing a lot of potential sources of error. I've used Sweave for a couple of projects. It's a system that allows you to embed R code chunks in a LaTeX document, and the calculations and figures generated from the embedded code get placed directly into the resulting PDF file. It's a powerful idea, but it doesn't lend itself well to the development of multi-stage analysis workflows where the output of one step in data analysis flows into the next step. Change one thing and you have to rerun the entire script. It makes a lot more sense to design each step as a modular component and formally specify the dependencies between the steps.

In the previous post, I described how to use a custom builder to add python scripts to a scons workflow. You can do the same thing for R scripts and Sweave documents. With a little regular expression kung-fu you can even get scons to recognize which R scripts and data sources are being imported.


import os,re,itertools

source_re = re.compile(r'^source\([\'\"](\S+)[\'\"]\)',re.M)
load_re = re.compile(r'^load\([\'\"](\S+?)[\'\"]\)',re.M)
table_re = re.compile(r'read.\S+\([\'\"](\S+?)[\'\"]',re.M)

def fix_rel(f):
    return f if f.startswith('/') else ('#' + f)
       
def rfile_scan(node, env, path):
    txt = node.get_contents()
    return [fix_rel(f) for f in itertools.chain(source_re.findall(txt),
                                                load_re.findall(txt),
                                                table_re.findall(txt))]

rbuild = Builder(action='R -q --vanilla $SCRIPTOPTS < $SOURCE')

sweavebuild = Builder(action='R CMD Sweave $SOURCE',
                      suffix = '.tex',
                      src_suffix = '.Rnw')

rscan  = Scanner(function = rfile_scan,
                  skeys = ['.R','.Rnw'])
 


You still have to manually specify what each script outputs (except for the Sweave builder, which knows the output will be a .tex file), for instance:


env = Environment()
env.Append(BUILDERS = {'RBuild' : rbuild,
                       'SWeave' : sweavebuild})
env.Append(SCANNERS = rscan)

unit_tbl = env.RBuild('unit_stats.tbl','unit_analysis.R')
 

24/06: python tools for Scons

Category: workflows
Posted by: dmeliza
I've started using scons to manage my data analysis workflows. A lot of the work is done by python scripts that read in a bunch of data from one more sources, crunch it, and spit out a new table. So there is a dependency not only on the input data but on the script and any modules it imports. You can get scons to run a script using a simple Command builder. For instance, you have some "script.py" that expects a command line like "script.py input1 input2 output".


env.Command('output.tbl', ['script.py', 'input1.tbl', 'input2.tbl'], 'python $SOURCES $TARGETS')
 


This works fairly well, but if script.py imports another module (e.g. with functions common to a bunch of scripts) you have to manually specify that dependency. But with a little extra code you can get scons to automatically scan python scripts for import statements and include any imports from the local directory as dependencies. I also like to use an Emitter that will move the script to the front of the list of dependencies, so I don't have to worry about what order I specify them in.


import os,re

import1_re = re.compile(r'^from\s+(\S+)\s+import',re.M)
import2_re = re.compile(r'import\s+(.+)$',re.M)

def pyfile_scan(node, env, path):
    imports = []
    search_path = os.path.join(*os.path.split(str(node))[:-1])
    text = node.get_contents()
    for item in (import1_re.findall(text) + import2_re.findall(text)):
        for x in item.split(','):
            test_file = x.strip() + '.py'
            if os.path.exists(os.path.join(search_path, test_file)): imports.append(test_file)
    return imports

def py_targets(target,source,env):
    """ pulls out the python script from the source list and generates a call to the script """
    out = []
    for x in source:
        if str(x).endswith('.py'):
            out.insert(0,x)
        else:
            out.append(x)
    return target,out

pybuild = Builder(action='python $SOURCES $TARGETS $SCRIPTOPTS',
                  emitter=py_targets)
pyscan  = Scanner(function = pyfile_scan,
                  skeys = ['.py'])

 


Now you can use the custom builder as follows, and scons will recognize any modules script.py depends on.


# add the python builder to the environment
env = Environment()
env.Append(BUILDERS = {'PyBuild' : pybuild})
env.Append(SCANNERS = pyscan)

env.PyBuild('output.tbl',['script.py','intput1.tbl','input2.tbl'])
 

07/05: permutation magic

Category: R
Posted by: dmeliza
R doesn't have great string or set manipulation functions, but you can accomplish a lot by using factors and apply(). For instance, I had a column in a data frame that consisted of 3 two-character tokens concatenated together (e.g. 'abdqbk','cbdabb') that represented sets of 3 sounds played in order. I needed to recode this as a set of two factors, one indicating which set of symbols had been used in a particular trial, and another indicating the order they had been played. Solving the problem turned out to be an interesting illustration of how your tools constrain your thinking, and in this case the result wound up being fairly elegant (IMHO). Whereas if I had been working in Python I probably would have written a couple of loops to run through the strings, one to find the unique sets of symbols, and the other to assign unique values to each permutation.


permutations <- function(stimnames, toklen=2) {
  # assume all stims have the same length name but otherwise stay flexible
  stimlen <- nchar(as.character(stimnames[1]))
  split.points <- rep(codelen, stimlen/toklen)
  toks <- strmsplit(stimnames, split.points)

  # sort tokens and use factor to compute unique sets
  sets <- factor(apply(toks,1,function(x) {paste(sort(x), collapse='')}))

  perms <- lapply(split(stimnames,sets),
                   function(x) {as.numeric(factor(as.character(x)))})

  data.frame(sets, perms=unsplit(perms,sets))
}
 


strmsplit() is another little bit of apply() magic that will split a string at a set of fixed cut points. The code is from one R tip a day.


strmsplit <- function(s, pos){
  # split a string into multiple bits based on cut points
  # e.g. strmsplit('st378akbkzk',c(5,2,2,2)) = c('st378','ak','bk','zk')
  # from http://onertipaday.blogspot.com/2007/06/string-manipulation-insert-delim.html
  start <- head(cumsum(c(1, pos)), -1) # delete last one
  sel <- cbind(start=start,end=start + pos -1)
  apply(sel, 1, function(x) substr(s, x[1], x[2]))
}
 

23/10: numpy array broadcasting

Category: numpy
Posted by: dmeliza

One of the features I appreciate most about numpy is array broadcasting, probably because I hated having to use repmat() in my MATLAB days. With MATLAB, if you want to add the values in a vector to each column of a matrix, you have to replicate the vector to the same size as the matrix first, which is a colossal waste of memory (MATLAB does support something called lazy evaluation, but I'm not sure it applies to these sorts of expressions). In numpy you can simply add the vector to the matrix; however, the rules are a little tricky. There's an excellent FAQ on broadcasting here - how to specify which dimension should be matched between the two arrays, and when broadcasting is actually more inefficient.

21/10: structured data in numpy

Category: numpy
Posted by: dmeliza
Numpy supports structured arrays, which are the nearest thing to R's data.frame class. Data are organized into fields and records. Each field (column) has a name and data type, and each record (row) has a value for all the fields. Columns are indexed by name, and rows are indexed by integers. Recarray objects can be generated from nested Python iterable objects using numpy.rec.fromrecords:


>>> D = [('fair',6.0,1), ('good',12,2)]
>>> D = numpy.rec.fromrecords(D, names='quality,price,size')
>>> D

rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
      dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])

>>> D['quality']

rec.array(['fair', 'good'],
      dtype='|S4')

>> D[0]

('fair', 6, 1)
 


Note that the 'price' field has a float data type because one of the records has a float value, and the field is promoted to the most general data type. For more precise control over field data types, fromrecords() takes a format argument, which is a comma-delimited list of format strings. For instance, to force 'price' to be an integer, call D = numpy.rec.fromrecords(D, names='quality,price,size', formats='S4,i4,i4')

For reading and writing recarrays, use matplotlib.mlab.rec2csv() and matplotlib.mlab.csv2rec(). The format of each field can be specified using a dictionary. There are a number of arguments to both functions that can be used to control how the data is read in (e.g. delimiter, is the first row a list of field names, etc), most of which are documented. The rec2csv() function always outputs field names as headers. To avoid this behavior, or to avoid having a dependency on matplotlib, use numpy.savetxt()


>>> from matplotlib import mlab

>>> formatd = {'quality' : mlab.FormatString(), 'price' : mlab.FormatFloat(2),}
>>> mlab.rec2csv(D, 'test.csv', formatd=formatd)
>>> mlab.csv2rec('test.csv')

rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
      dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])

>>> numpy.savetxt('test.csv', D, delimiter=',', fmt=('%s','%3.2f','%d'))
>>> numpy.loadtxt('test.csv', delimiter=',', dtype={'names': ('quality','price','size'), 'formats' : ('S4', 'f8', 'i4')})

array([('fair', 6.0, 1), ('good', 12.0, 2)],
      dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])

 

25/08: accumarray in numpy

Category: numpy
Posted by: dmeliza
A useful and underappreciated MATLAB function is accumarray(), which aggregates a vector of data elements into a multi-dimensional array based on a separate set of index vectors. It's essentially the same as the R function xtabs(), except that the indices in accumarray() have to be numeric. I found myself needing something like this function in python when I was writing an algorithm for calculating reassigned spectrograms (more on this later; I'm working on a little monograph on time-frequency representations of birdsong). I couldn't find anything, and wound up writing the algorithm in weave/blitz -- which I probably would have done anyway to tweak the performance.

Later I discovered that there is an equivalent to accumarray() in scipy: it's the scipy.sparse.coo_matrix() constructor. Here's the example from the docstring:


row  = array([0,0,1,3,1,0,0])
col  = array([0,2,1,3,1,0,0])
data = array([1,1,1,1,1,1,1])
coo_matrix( (data,(row,col)), shape=(4,4)).todense()

matrix([[3, 0, 1, 0],
            [0, 2, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 1]])

 


The main downside to this method is that it only works with 2D arrays. I also don't know anything about the performance. Probably if your matrix is fairly sparse (i.e. lots of zeros) this is the way to go. If you've got more or fewer dimensions, or the output is dense, you're probably better off just writing the loop in C++.

23/04: discrete contours

Category: matplotlib-0.91
Posted by: dmeliza
Contour plots are useful for visualizing 2D arrays of numbers where the values vary relatively smoothly over space. A less common use is to annotate regions in a plot, in which case the identity of each region is typically coded by an integer. If you have multiple regions, the values aren't going to vary smoothly, and the built-in matplot contour command won't look quite right: higher-valued regions will have multiple borders around them. The solution is to make a separate plot call for each unique value of the label. An optional gaussian filter can be used to smooth off region edges.

dcontour example


import numpy as nx
def dcontour(ax, *args, **kwargs):
    """
    Discrete contour function. Given a matrix I with a discrete number
    of unique levels, plots a contour at each unique level. Values less than 0
    are ignored.
   
    DCONTOUR(axes, I) plots the unique levels in I
    DCONTOUR(axes, X,Y,I) - X,Y specify the (x,y) coordinates of the points in I

    Optional arguments:

    smooth - specify a float or 2-ple of floats, which are used to gaussian filter
             each data level prior to contouring (which gives smoother contour lines)
             
    Other keyword arguments are passed to contour()
    "
""
    from scipy.ndimage import gaussian_filter

    smooth = kwargs.get('smooth', None)
   
    I = args[0]
    if len(args) > 1:
        (X, Y) = args[1:3]
    else:
        (Y, X) = (nx.arange(I.shape[0]), nx.arange(I.shape[1]))
   
    labels = nx.unique(I[I>-1])

    h = []

    kwargs['hold'] = 1
    for i in labels:
        if smooth!=None:
            data = gaussian_filter((I==i).astype('d'), smooth)
        else:
            data = I==i
        hh = ax.contour(X, Y, data,1, colors=colorcycle(i), **kwargs)
        h.append(hh)

    return h
 


The following code, or something equivalent, is necessary to ensure that the contour colors cycle through different values for each level. I find these particular colors pleasing and fairly easy to distinguish, but your mileage may vary, especially if you need to annotate more than 15 different regions.


_manycolors = ['b','g','r','#00eeee','m','y',
               'teal',  'maroon', 'olive', 'orange', 'steelblue', 'darkviolet',
               'burlywood','darkgreen','sienna','crimson',
               ]
   
def colorcycle(ind=None, colors=_manycolors):
    """
    Returns the color cycle, or a color cycle, for manually advancing
    line colors.
    "
""
    return colors[ind % len(colors)] if ind!=None else colors