Résumé: David Soergel, Ph.D.

Work Experience

2013 -

Software Engineer, Google, Inc.

2011 - 2013

Research Scientist and Software Engineer, Information Extraction and Synthesis Laboratory (McCallum lab), School of Computer Science, University of Massachusetts Amherst.

Large-scale machine learning for natural language processing. Open access and open evaluation of research literature; project lead, OpenReview.net.

2001 - 2003

Lead Bioinformatics Developer, The Molecular Sciences Institute.

Databases and web applications supporting basic research in biology.

1999 - 2003

Founder and Principal, Asha Technologies.

Consulting firm focussing on database-driven web applications for socially beneficial purposes.

2000 - 2001

Co-founder and Director of Research and Development, Little Engine, Inc.

Information technology for preschool teachers and parents.

1999 - 2000

Research Associate, Cavalli-Sforza lab, Department of Genetics, Stanford University School of Medicine.

Databases and software for analyzing the geographic distributions of human genes.

1998 - 1999

Vice President for Technology, Padra.org

1997 - 1999

Software Developer, Science and Technology in the Making, Stanford University Libraries

Summer 1996

Research Assistant, Institute for Scientific Computing Research, Lawrence Livermore National Laboratory

Summer 1994

Samuel P. And Frances Krown Summer Undergraduate Research Fellow, San Onofre/Palo Verde Neutrino-Oscillation Experiment, Caltech

Summer 1992

Summer Intern, Deutsches Elektronen-Synchrotron (DESY), Hamburg, Germany

Research Projects

Open scholarship and reproducible research.

OpenReview.net, a platform for open evaluation of scholarly articles. OpenReview.net aims to promote openness in scientific communication, particularly regarding the peer review process. We are implementing a platform for peer review that generalizes over many subtle gradations of openness, allowing conference organizers, journals, and other “reviewing entities” to configure the specific policy of their choice. We intend to act as a testbed for different policies, to help scientific communities experiment with open scholarship while addressing legitimate concerns regarding confidentiality, attribution, and bias. We are collaborating with sociologists in this investigation. Our initial focus is on computer science conferences; to date our system has provided paper submission, reviewing, and public discussion for ICLR 2013, ICLR 2014, ICML/Inferning 2013, ICML/Peer Review 2013, and AKBC 2013.
WorldMake.org, a versioned data analysis tool for reproducible research. WorldMake is a system for describing, sharing, and executing computational workflows in a manner that guarantees reproducible results. It provides a means of ensuring that a set of computational results are up-to-date with respect to the inputs, that they are internally consistent, and that their provenance is rigorously tracked. It also provides a means of sharing inputs, intermediate results, and final outputs, so as to facilitate collaboration while avoiding redundant computation. A predecessor of this system drove all of the computations for my dissertation, involving on the order of one million digital artifacts (i.e., files containing inputs, intermediate results, and outputs), and requiring weeks of computation on a large cluster.
MONOD, a collaborative tool for manipulating biological knowledge. MONOD (for “Modeler's Notebook and Datastore”) was a web application designed to capture and communicate knowledge generated during the process of building models of many-component biological systems. We used MONOD to construct a model of the pheromone response signaling pathway of Saccharomyces cerevisiae. MONOD allowed the accumulation, documentation, and exchange of data, valuations, assumptions, and decisions generated during the model building process. MONOD thus helped preserve a record of the steps taken on the path from the experimental data to the computable model. Our goals were to streamline the processes of building models, communicating with other researchers, and managing and manipulating biological knowledge. Once fully realized, “collaborative annotation”—fine-grained, structured, searchable communication enabled by software tools of this type—promises to enhance the practice of research in every field of science and engineering.

Information extraction from scholarly literature.

Extraction of citation metadata from PDFs of scholarly articles, using numerous text and layout features.
Frameworks for translating and processing citation metadata, with plugins for reading and writing a variety of formats. Streaming and concurrent operation allows rapid processing of very large datasets (commonly runs on tens of millions of records).
Normalizing person names, parsing a wide variety of name formats into constituent components.

Concurrent programming.

conja, a library providing functional concurrency in Java. Conja lets code take advantage of multicore processors with no configuration and minimal code changes. Schedules nested concurrent tasks in a memory-efficient depth-first manner.

Microbial ecology and metagenomics.

Microbial species identification from environmental shotgun sequencing. A foundational problem in metagenomics is the assignment of short next-generation sequencing reads to known microbial taxa, and the clustering of sequences into potentially unknown taxa. The surprising finding that sequence composition (i.e., statistical descriptions of the distribution of short words) can be discriminative of species identity has led to a wide range of proposed methods for both the supervised and the unsupervised variants of this “binning” problem, but the evaluation procedures applied to them have been both inconsistent and unrealistic. It has thus not been clear which method is best, or what performance can be expected in classifying real data. I reimplemented nearly all of the methods in the literature as special cases of a more general framework, allowing me to compare them on a common footing designed to mirror real circumstances.
Microbial community composition using the 16S ribosomal RNA sequence. PCR amplification and sequencing of the gene for the 16S ribosomal RNA subunit directly from environmental samples is a long-standing method of measuring species richness and relative abundance. I demonstrated that the use of sequencing reads that are much shorter than the gene itself (as has recently become economical and thus popular) has the potential to introduce substantial error in such studies. However, I also established, through exhaustive computational experiments, that a judicous choice of PCR and sequencing primers can avoid these errors.
RTAX. Rapid and accurate taxonomic classification of short paired-end sequence reads from the 16S ribosomal RNA gene. Available as part of the QIIME microbial ecology pipeline.

Teaching and Mentoring

Research Mentor for graduate rotation students, undergraduate research assistants, and software developers. (5 total, 2005-2010)

Graduate Student Instructor for Microbial Genetics and Genomics, U.C. Berkeley (2007)

Grants and Awards

Chang-Lin Tien Scholar in Environmental Sciences and Biodiversity, UC Berkeley. (2008-2010)

Contributing author to a successful NIH R01 grant to Rob Knight. (2011)

Predoctoral Fellow, Howard Hughes Medical Institute. (2003-2008)

Caltech and Stanford Summer Undergraduate Research Fellowships. (1994, 1995, 1997)

Caltech Merit Awards. (1994, 1995)

Robert Andrews Millikan Scholar, Caltech. (1993)

Travel Awards. NAS Sackler Colloquium on Tapestry of Life, Irvine, CA (2005); 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006).

Publications

Soergel DAW. (2015). Rampant software errors may undermine scientific results. F1000Research 3: 303. Full Text, PDF, Reviews and Discussion

Soergel DAW, Saunders AC, McCallum A. (2013). Open Scholarship and Peer Review: a Time for Experimentation. ICML Workshop on Peer Reviewing and Publishing Models (WPEER). PDF, Discussion

Soergel DAW, Dey N, Knight R, Brenner SE. (2012). Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. The ISME Journal 6: 1440-1444. Full text, PDF

Yooseph S, Sutton G, Rusch DB, … Soergel DAW, … Venter JC. (2007). The Sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biology 5: e16. Full text, PDF

Lareau LF, Brooks AN, Soergel DAW, Meng Q, Brenner SE. (2007). The coupling of alternative splicing and nonsense-mediated mRNA decay. In Blencowe B and Graveley B, ed., Alternative splicing in the post-genomic era (pp. 190-211), Landes Bioscience. PDF

Soergel DAW, Lareau LF, Brenner SE. (2006). Regulation of gene expression by the coupling of alternative splicing and nonsense-mediated mRNA decay. In Maquat L, ed., Nonsense-mediated mRNA decay (pp. 175-196), Landes Bioscience. PDF

Soergel DAW, Choi K, Thomson T, Doane J, George B, Morgan-Linial R, Brent R, Endy D. (2004). MONOD, a collaborative tool for manipulating biological knowledge. Working paper

Posters and Presentations

Computational approaches to evaluating microbial diversity. UC Berkeley campus seminar in environmental microbiology. (2008)

Sequence compositional biases and microbial diversity. UC Berkeley Graduate Group in Genomic and Computational Biology Retreat. (2008)

Explorations in environmental and medical metagenomics. Metagenomics 2007, San Diego, CA. (2007)

Explorations in environmental and medical metagenomics. HHMI predoctoral fellows meeting, Chevy Chase, MD. (2006)

Interpreting metagenomic data using oligonucleotide signatures. Metagenomics 2006, San Diego, CA. (2006)

Interpreting metagenomic data using oligonucleotide signatures. California Metagenomics Workshop, Berkeley, CA. (2006)

Interpreting environmental sequence data using oligonucleotide signatures. 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006)

MONOD, a collaborative tool for manipulating biological knowledge. Formal Languages for Biological Processes, CSHL Banbury Center, Cold Spring Harbor, NY. (2003)

MONOD, the modeller's notebook and datastore. DARPA BioComp PI meeting, Washington, DC. (2002)

Human Gene Geography: a database of human genome variation. CSHL Conference on Human Evolution, Cold Spring Harbor, NY. (1999)

David Soergel, Ph.D.

Research Interests

Education and Appointments