Software engineer and research scientist with recent experience in computational biology, metagenomics, machine learning, natural language processing, bibliometrics, and open access advocacy. Expert in large-scale data management, database design, and cluster computing. Experienced in project management and system administration. Proficient with a wide variety of computing technologies and platforms. Effective team player; also able independently to complete entire projects from conception through launch.
A software engineering position involving design and implementation of systems for data management and analysis, ideally regarding natural sciences, environmental conservation, or renewable energy. Alternatively, an academic position in computer science, particularly involving computational biology and large-scale computing.
University of California, Berkeley (2003 - 2010)
Stanford University (1995 - 1998)
California Institute of Technology (1993 - 1995)
Effective communicator in spoken and written English. Skilled at discussing technology projects with non-technical clients. Fluent in German.
Software Engineer, Google, Inc.
Large-scale machine learning infrastructure for natural language processing. Advocate for open access and open evaluation of research literature; principal architect, openreview.net.
Lead Bioinformatics Developer, The Molecular Sciences Institute.
Summer Intern, Deutsches Elektronen-Synchrotron (DESY), Hamburg, Germany
iesl-sbt-base. SBT plugin providing all manner of boilerplate, so that the Build.scala file for a project can be trivially short. Includes simplified dependency resolution with automatic updating; clarity on what transitive dependencies are used; and unified logging configuration.
namejuggler. Normalizer for person names, parsing a wide variety of name formats into constituent components.
bibmogrify (open-source release planned). A general framework for translating and processing citation metadata, with plugins for reading and writing a variety of formats. Streaming and concurrent operation allows rapid processing of very large datasets (commonly runs on tens of millions of records).
pdf2meta (open-source release planned). Extracts citation metadata from PDFs of scholarly articles, on the basis of numerous text and layout features.
jLibSVM. Heavily refactored Java port of LIBSVM, providing efficient training of Support Vector Machines. Provides many new features, including a fully generified API; the ability to add custom kernels for arbitrary data types; and integrated scaling and normalization.
ml. Generic machine learning package. Provides a framework for supervised and unsupervised clustering (both online and batch), and currently implements naive Bayesian, k-NN, K-means, and Kohonen SOM clustering. Computes Variable Memory Markov models (aka Probabilistic Suffix Trees) on strings. Also, implements various Monte Carlo methods, including Metropolis-coupled MCMC.
conja. Library providing functional concurrency in Java. Conja lets code take advantage of multicore processors with no configuration and minimal code changes. Schedules nested concurrent tasks in a memory-efficient depth-first manner.
phyloutils. Provides data structures for weighted phylogenetic trees, and various operations on such trees. Includes phylogenetic alpha and beta diversity measures such as Weighted UniFrac.
pdftank. Automatically navigate journal web sites to download and cache full-text PDFs.
Research Mentor for graduate rotation students, undergraduate research assistants, and software developers. (5 total, 2005-2010)
Graduate Student Instructor for Microbial Genetics and Genomics, U.C. Berkeley (2007)
Chang-Lin Tien Scholar in Environmental Sciences and Biodiversity, UC Berkeley. (2008-2010)
Contributing author to a successful NIH R01 grant to Rob Knight. (2011)
Predoctoral Fellow, Howard Hughes Medical Institute. (2003-2008)
National Defense Science and Engineering Graduate Fellowship. (2003, declined)
Caltech and Stanford Summer Undergraduate Research Fellowships. (1994, 1995, 1997)
Caltech Merit Awards. (1994, 1995)
Robert Andrews Millikan Scholar, Caltech. (1993)
Travel Awards. NAS Sackler Colloquium on Tapestry of Life, Irvine, CA (2005); 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006).
Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, et al. (2007). The Sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biology 5: e16. Full text
Lareau LF, Brooks AN, Soergel DAW, Meng Q, Brenner SE. (2007). The coupling of alternative splicing and nonsense-mediated mRNA decay. In Blencowe B and Graveley B, ed., Alternative splicing in the post-genomic era (pp. 190-211), Landes Bioscience. PDF
Soergel DAW, Lareau LF, Brenner SE. (2006). Regulation of gene expression by the coupling of alternative splicing and nonsense-mediated mRNA decay. In Maquat L, ed., Nonsense-mediated mRNA decay (pp. 175-196), Landes Bioscience. PDF
Soergel DAW, Choi K, Thomson T, Doane J, George B, Morgan-Linial R, Brent R, Endy D. (2004). MONOD, a collaborative tool for manipulating biological knowledge. Working paper
Computational approaches to evaluating microbial diversity. UC Berkeley campus seminar in environmental microbiology. (2008)
Sequence compositional biases and microbial diversity. UC Berkeley Graduate Group in Genomic and Computational Biology Retreat. (2008)
Explorations in environmental and medical metagenomics. Metagenomics 2007, San Diego, CA. (2007)
Explorations in environmental and medical metagenomics. HHMI predoctoral fellows meeting, Chevy Chase, MD. (2006)
Interpreting metagenomic data using oligonucleotide signatures. Metagenomics 2006, San Diego, CA. (2006)
Interpreting metagenomic data using oligonucleotide signatures. California Metagenomics Workshop, Berkeley, CA. (2006)
Interpreting environmental sequence data using oligonucleotide signatures. 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006)
MONOD, a collaborative tool for manipulating biological knowledge. Formal Languages for Biological Processes, CSHL Banbury Center, Cold Spring Harbor, NY. (2003)
MONOD, the modeller's notebook and datastore. DARPA BioComp PI meeting, Washington, DC. (2002)
Human Gene Geography: a database of human genome variation. CSHL Conference on Human Evolution, Cold Spring Harbor, NY. (1999)