Internship Mentors 2005

Cecilie Boysen & Jim Breaux (ViaLogy) | Bruce Hoff (BioDiscovery) | Kenneth Lange (UCLA) | York Marahrens (UCLA) | Matteo Pellegrini (UCLA) | Bruce Shapiro (Cal Tech/JPL) | Janet Sinsheimer (UCLA) | Mike Thompson & Todd Yeates (UCLA) | Robert Vellanowth (CSULA) | Barbara Wold (Cal Tech)

Cecilie Boysen, Ph.D & Jim Breaux, Ph.D (ViaLogy)

www.vialogy.com

What research does ViaLogy conduct?

ViaLogy develops and markets software solutions for active signal processing using its proprietary technology, Quantum Resonance Interferometry (QRI). ViaLogy's technology is applicable to a broad range of measurements, including DNA and protein microarrays, mass spectrometers, elemental analyzers, electron microscopes, wafer inspection systems, medical imaging, optical and microwave communications, high content screening, and financial analysis. In its initial target market, ViaLogy Microarray Analysis Service (VMAxS) provides order-of-magnitude increases in detection sensitivity, and significantly improved reproducibility and specificity through an on-line user-driven re-analysis of raw data files from market-leading DNA microarray systems, without changing laboratory protocols and without new instrumentation.

ViaLogy is in the process of expanding application of QRI technology into the field of mass spectrometry (MS). We anticipate that the summer projects will thus include both MS and novel microarray applications. Projects will involve analysis of incoming client data file formats, analysis of project studies using currently available (passive analysis) software for image and data processing, and adaptation of the QRI technology to the new data formats. Analysis will include comparison of active (QRI) and passive analysis results according to criteria of sensitivity, specificity, and assay reproducibility.

What does ViaLogy look for in their interns?

Note that all activities at ViaLogy are computational in nature; ViaLogy does not operate a wet lab, and relies on incoming data from clients and prospective customers. Thus, the desired intern skills and interests include familiarity with programming (C++ and/or R in particular), image processing, and data analysis.

Back to Top



Bruce Hoff, Ph.D (BioDiscovery)

www.biodiscovery.com

What is BioDiscovery?

Gene microarrays have become recognized as powerful tools for providing a global view of gene expression regulation for a biological condition of interest. The other side to this double-edged sword is that such studies produce large amounts of interesting numerical results, making it difficult to get an intuitive grasp on what is happening biologically: Our company, BioDiscovery, is dedicated to providing researchers useful software tools for gleaning biological meaning from large data sets. One area of interest is the discovery of gene interactions, useful in elucidating novel biological mechanisms. The sum me r internship project will involve applying tools such as clustering analysis, self-organizing maps, and genomic pathways to the discovery of new biological interactions between genes.

What would an intern work on?

Interns will be involved in researching and prototyping novel analytic and statistical tools for processing microarray data and biological information. The ideal candidate is an adept Java programmer, knowledgeable in mathematics, statistics, biology, and publicly available (web-based) Bioinformatics resources (e.g. those of NCBI). Particularly valuable knowledge includes an understanding of gene metabolic and signaling pathways, an understanding of multi-variate and multi-factor statistics, and exposure to existing microarray data analysis software tools.

Back to Top



Kenneth Lange, Ph.D (UCLA)

www.ucla.edu

What is our research group interested in?

Complex calculations underline many of the recent advances and accomplishments in human genetics. Mathematical models are the lenses through which biomathematicians view and interpret genetic data. We design such models and implement them in computer software to understand the genetic basis of common diseases, such as cancer, Alzheimer's disease and schizophrenia.

What would an intern work on?

We developed the Mendel software program, which performs statistical analysis to solve a variety of genetic problems. Implementations are included for all common, and several novel statistical genetic tests. These tests include: gene mapping using SNPs or microsatellite markers, identifying potential genotyping errors, performing genetic risk calculations; and testing for paternity or other pedigree relationships. Although Mendel is very versatile, there is still room for improvement. Some possible projects include: (1) Many applications of the software need better example data sets to help users with their own research questions. An intern would learn to use Mendel by beta testing the software, systematically search the web for public domain genetic data sets, and test these data sets using Mendel. (2) The output needs a graphical interface to make interpreting the results more intuitive for users. An intern would help develop this graphical interface.

Back to Top



York Marahrens, Ph.D (UCLA)

www.ucla.edu

What type of research does our group do?

We examine the role of repetitive sequences in determining how our chromosomes are packaged in protein coats. More than half of our genome consists of a small number of DNA sequences that are present in many copies that are scattered throughout the chromosome. These "repeated sequences" are packaged in protein coats that by necessity are tighter than the protein coats of the rest of the chromosome. The tight protein packaging of repeated sequences prevents the repeated sequences from recombining with each other and thus destabilizing the chromosomes. The tight protein coat of a repeated sequence also influences the expression of genes that are in close proximity to the repeated sequence. The Marahrens lab is also interested in a number of genes that are frequently mutated in human cancers and are also mutated in genetic diseases characterized by genome instability and predisposition to cancer. The Marahrens lab is trying to prove that mutations in these cancer genes cause the protein coats of repeated sequences to be abnormally loose which in results in unstable chromosomes and abnormal gene expression which lead to cancer.

What would an intern work on?

Each person has inherited a set of chromosomes from his/her mother and another set of chromosomes from his/her father. Each gene is therefore present in two copies and for most genes both copies are simultaneously expressed ("biallelically expressed" genes). However, a sizeable subset of our genes are expressed from only one copy while the other is packaged in a tight protein coat ("monoallelically expressed" genes). Two bioinformatics summer students in the Marahrens lab used data mining in conjunction with statistical methods to show that monoallelically expressed genes are surrounded by markedly higher concentrations of certain repetitive sequence called LINE-1 than genes that are biallelically expressed (Allen, E et al. 2003, High concentrations of long interspersed nuclear elements sequence distinguish monoallelically-expressed genes." Proc Natl Acad Sci U S A. 100(17): 9940-9945). They also identified a second set of monoallelically expressed genes that are not flanked by high concentrations of LINE-1 sequence and whose flanking sequence characteristics has not yet been determined. The Marahrens lab seeks a summer student to help define the flanking sequence characteristics of this second group of monoallelically expressed genes.

Some regions of our chromosomes, called fragile sites, are inherently unstable and can be the sites of chromosome stretching and breakage. Some fragile sites are associated with certain human diseases (eg fragile X) or with cancers while others only break when cells grown in the laboratory are stressed with certain chemicals. Many fragile sites are can be localized to the sites of unusually large (>1Mb) genes. The Marahrens lab suspects that the repetitive sequence content in and around these huge genes is responsible for the large size of the genes and for the chromosome fragility. We seek a summer student to perform data mining in conjunction with statistical methods to determine whether this is the case.

Back to Top



Matteo Pellegrini, Ph.D (UCLA)

www.ucla.edu

What does the Pellegrini lab study?

Our lab is interested in the development of computational approaches to study protein interaction networks. These networks allow us to elucidate the components of protein complexes along with signal transduction and metabolic pathways . Our approach is to integrate varied data that sheds light on protein interactions. This data includes scientific literature, _expression microarrays, high throughput protein-protein interaction screens and assays that measure cellular phenotypes. Our research focuses on the development of techniques to integrate, store and present this data along with algorithms to analyze molecular profiling data using networks. Our long term goal is to develop models that quantitatively predict the outcome of perturbations in cells.

What might an intern in the Pellegrini lab do?

An intern would work with existing network datasets to develop novel analysis techniques. This analysis would be applied to relevant molecular profiling datasets. The techniques would be aimed at elucidating which pathways and complexes are activated in a particular experiment.

Back to Top



Bruce Shapiro, Ph.D (Cal Tech/JPL)

www.caltech.edu

I. What does the Computable Plant Research Group do?

The computable plant project (http://www.computableplant.org) is developing an end-to-end research and modeling framework for the Arabidopsis SAM (shoot apical meristem), the growing tip of a plant stem. We observe several cell type specific markers for growth and differentiation in real-time in live plants with a dedicated confocal laser scanning microscope. Using a combination of computational modeling and image processing techniques we then infer specific transduction pathway data and fit mathematical models to produce two- and three- dimensional visualizations of the growing SAM, including phyllotactic and leaf vein development. Our aim is to determine the spatial and temporal relationships between different genes in an effort to understand how primordial cells are progressively specified.

What would an intern work on?

(1) Computational Biology. We have developed and are developing several different signal-transduction and gene-regulatory network models of meristem development. To determine the efficacy of these models, various parameters need to be tuned, simulations run, and the resulting predictions compared with observed data. The networks may need to be modified by adding, removing, or changing some of the biochemical reactions involved. Students would coordinate their work with wet-bench researchers at Caltech, but no actual lab work would be involved. Students should have the following background: calculus through partial derivatives; some understanding of differential equations (a full course is not necessary, just knowing what they are and having an interest in solving them numerically); a desire to work intensively doing computer modeling. Modeling will be done with Cellerator (http://www.cellerator.info) but no prior knowledge of Mathematica is required; enough background in biology to know what a signal transduction network is. The work would be done at CalTech.

(2) Bioinformatics. Help us to develop the first database of computational models of plant development. Do a literature search of existing models of meristem development and implement them in Sigmoid (http://sigmoid.sourceforge.net/ ), SBML (http://sbml.org), and/or Cellerator (http://www.cellerator.info) -compatible format.

The student would work in coordination with researchers at CalTech as well as other researchers in the lab of Dr. Eric Mjolsness of UC Irvine, although this project is separate from the Sigmoid projects that Dr. Mjolsness is mentoring at Irvine. This work would also be done at Caltech, because the student would need to work closely with Plant biologists to identify and understand the models he/she is collecting and implementing. Students should have the following background: some upper-level course work in biology, some prior programming experience helpful.

II. What does the SBML Project Research Group do?

The Systems Biology Markup Language (SBML) Project (http://sbml.org) is an ongoing effort to develop a common format for software tools to exchange models of biological networks. SBML is a lingua franca, an intermediate format that is not necessarily used internally in any software package, but instead is used to communicate models between different software tools. SBML is today supported by over 75 software packages and databases worldwide.

What would an intern work on?

(1) We have recently created a prototype database of models in SBML format. It is a project in collaboration between the SBML Team at Caltech, and the University of Hertfordshire and the European Bioinformatics Institute in England. We need to populate this database with models. These models need to be created by hand using software that is SBML compatible, based on published articles describing the models and the behavior expected from simulating them. This is a good opportunity to gain basic experience in creating and simulating computational models of biological networks. The work would involve reading journal articles describing mathematical models of biological systems (these models are usually expressed as sets of differential equations) and recreating the models using a tool such as MathSBML, a Mathematica-based SBML model development and simulation environment. No prior knowledge of Mathematica is necessary. A suitable student would have had upper level course work in biology and have a basic understanding of biochemistry, as well as calculus.

(2) We have been developing several software tools for working with SBML files. One of these is libSBML, an open-source software library for programming SBML support into software applications. LibSBML is written in C++ and is portable to Linux, Windows, and MacOS. There are several fairly self-contained projects that could be undertaken by an intern with C++ programming experience. No experience with biology is necessary for these projects:

a) One of our goals for libSBML is to provide support for checking units on quantities in SBML models. For example, a correct model in SBML should be such that the units on different quantities and mathematical expressions should work out to be consistent. There exist software libraries for performing unit translations, and so this work would involve incorporating such a library into libSBML, and adding code to perform unit conversions and checking on models in SBML.

b) XML files such as SBML are basically ASCII text, and they can become quite large. One way to reduce the size of these files is to compress them. We would like to implement \ SBML compression into libSBML, such that an application could read and write compressed SBML files. This work would involve taking a compression library (such as those in gzip or similar tools), incorporating it into libSBML, and providing facilities for compressing and uncompressing SBML data streams.

c) (This one is shorter than the others and could be combined with item 'd' below.) We currently do not provide LibSBML installations in the various formats such as RPM, MacOS .dmg, and Debian .deb formats. Another project would be to develop scripts and procedures for creating packages in the different format easily, so that when new libSBML versions are released, we can quickly produce installations in the many different formats and for the different architectures supported by libSBML.

d) LibSBML currently does not support the Cygwin environment on Windows. We would like to provide Cygwin support, by modifying the existing Makefile system and making other changes as necessary in order to support compiling and installing libSBML under Cygwin.

Back to Top



Janet Sinsheimer, Ph.D (UCLA)

www.ucla.edu

What type of research does our group do?

We develop statistical methodology for mapping complex trait and disease genes. Our research has shown that specific interactions between maternal and fetal genes may produce an adverse prenatal environment that increases the risk of complex diseases in later life. For example, we found that matching between maternal and fetal human leukocyte antigen (HLA) genes can lead to increased risk of schizophrenia. This maternal-fetal genotype interaction is consistent with the immunological intolerance hypothesis that posits that HLA similarity between mother and fetus fails to stimulate the blocking antibodies that normally protect the fetus from the mother's immune response.

What would an intern work on?

The major histocompatibility complex region is made up of many genes that play crucial roles in the immune system. One set of genes in this region are the HLA genes that encode cell-surface antigen-presenting proteins. Currently our models treat the effect of each HLA gene independently. HLA genes in different locations on the chromosome often code for proteins that have similar epitopes, amino acid sequences that are targets for antibodies, and so they should not be treated as independent. This same issue of shared epitopes arising from different genes occurs when matching an organ transplantation recipient to a donor. Models and software, such as HLAMatchMaker, have been developed to predict the best matches based on determining the amino acid polymorphisms in the protein that are accessible to antibodies. The better the matching, the lower the chances of organ rejection. You would help us apply and adapt the HLAMatchMaker algorithm to our study of maternal-fetal genotype incompatibility so that we can better predict which mother-child genotypes combinations lead to the greatest risk of disease.

Back to Top



Mike Thompson, Ph.D & Todd Yeates, Ph.D (UCLA)

www.ucla.edu

Summer Research Projects

One potential summer research project would involve extending the phylogenetic profile method to analyze gene duplication events and their correspondence to function. As gene duplication is often posited to be a source of evolutionary novelty in expanding the functional repertoire of organisms, it would be interesting to construct and analyze phylogenetic profiles for duplicated pairs of genes. The duplicated gene in a given pair may have gained a new function (and so the pair is retained in some

present day genomes) or it may not have gained a new function and subsequently been lost (the pair is not retained in some organisms). This pattern of co-occurrence of the gene pair, when compared to those of other single proteins, may prove useful in understanding the evolution of protein function via this duplication mechanism.

A second potential summer research project would involve a survey and analysis of Q/N-rich proteins. These are proteins or subsegments of proteins that contain an abundance of glutamine and asparagine residues. In yeast, some of these proteins have an experimentally demonstrated capacity to become prions. In human, in the case of Huntington, there is an association of these proteins with neurodegenerative disorders. While some work has aimed at identifying these Q/N-rich proteins, it would interesting to see if they can be classified into subtypes based on composition or possible repeating motifs within the Q/N-rich region.

Back to Top



Robert Vellanoweth, Ph.D (CSULA)

www.calstatela.edu

What type of research does our group do?

The long-range goal of our research is to understand the biochemical and genetic changes in Arabidopsis leaves that accompany the initiation of flowering. The genetic and biochemical changes in leaves eventually progress to senescence, or programmed aging, the ordered dismantling of leaf macromolecules and their recovery by developing seeds. It is unclear what mechanism(s) alter leaf gene expression after the transition to flowering nor is the biochemical basis for leaf-flower communication known. Lack of such knowledge is an important problem because it impedes our ability to intervene in the process in order to improve the quality of the agricultural products upon which we all depend for healthful living. The objectives of our work are to determine the role of an oxylipin pathway in mediating floral transition-associated leaf gene expression changes, to investigate the transcriptional regulation of novel lipid transfer protein genes that are induced at the floral transition, and to identify the potential lipid signal bound by the lipid transfer proteins in vivo. Our central hypothesis is that an oxylipin signal, with a capacity to alter gene expression, is synthesized in leaves at the time of the floral transition and is bound by an extracellular lipid transfer protein found in leaves, stems and flowers that mediates communication between leaves and reproductive tissues. We formulated this hypothesis on the basis of our previous work and strong preliminary data. We have demonstrated a floral transition-associated transient decline in leaf ascorbate peroxidase (APx) activity, a transient increase in H2O2, and a concomitant increase in 13-lipoxygenase-catalyzed chloroplast lipid peroxidation, the first enzymatic step in the oxylipin biosynthetic pathway. These biochemical events are associated with the induction and repression of several dozen genes, a number of which share a statistically significant, novel, cis-regulatory element in their promoters. Two of these genes encode lipid transfer proteins that are transiently expressed in leaves, and continuously expressed in reproductive structures. At least one member of the Arabidopsis lipid transfer protein family has been shown to be involved in systemic signaling in response to pathogenic infection. The rationale for our research is that by focusing on oxylipin regulators and the proteins that potentially transport them, we can begin to uncover the molecular basis for leaf-flower communication.

Lipid-based signaling pathways are known in plant and animal systems. In plants, they control responses to wounding and pathogen infection and in animals they generate regulatory molecules that govern many cellular processes, including the inflammatory response. Lipid signals also appear to coordinate developmental events, including the fundamental transition from vegetative to reproductive growth in plants. The molecules may be transported via lipid binding proteins, several plant forms of which contribute to adverse allergic responses in humans. An analysis of the molecular details of lipid signaling important to the early post-reproductive genetic program in Arabidopsis leaves can inform our understanding of the sequence of events leading finally to programmed aging, or senescence. Senescence in field-grown, economically important crops decreases seed yield significantly. Genetic interventions to delay the senescence of photosynthetic structures could result in higher yields per plant. Thus, the study proposed here is valuable in multiple ways to human health. First, a mechanistic elucidation of the age-related process that couples reproduction with altered gene expression in somatic plant tissues will give direction to researchers analyzing post-reproductive, age-dependent changes in humans. Second, studies on the in vivo function and binding properties of plant lipid transfer proteins will add to our understanding of this important group of food allergens and may lead to pharmaceutical strategies to prevent an overactive allergic response in affected individuals. Finally, by piecing together the series of gene expression changes that lead to leaf senescence, a purposeful delay of senescence during grain development in agricultural crops becomes possible in order to improve the quality of the worldís food and therefore the health of the worldís people.

What would an intern work on? (Background & overview)

In collaboration with B Wold and JL Riechmann at Caltech, a former graduate student worked as a SoCalBSI fellow to develop Cistematic, a bioinformatics tool to allow for the detection and identification of novel cis-regulatory elements. We used Cistematic to analyze our microarray data in order to identify both genes and potential cis-elements that would be good candidates for in vivo validation. The rationale was to use the tools to minimize the chances that we will be fruitlessly looking at the wrong genes in vivo. We found a number of motifs in both the up-regulated and down-regulated genes in our microarray data. We have focused on a particular motif (BOLT1), with consensus TCMYCAYYTCCMMC, produced by the program meme that shows up in front of an unknown protein (At1g59640), a hypothetical protein (At3g42250), and two non-specific lipid-transfer protein precursors (At5g59310 & At5g59320). The position specific probability matrix for BOLT1 matched an additional two genes in our dataset of top twenty up-regulated genes - a leucoanthocyanidin dioxygenase (At3g55970, in its 5' UTR) and the vegetative storage protein (At5g24780); interestingly, the second half of BOLT1 matches At3g55970 perfectly, while the first half of BOLT1 matches At5g24780 perfectly. This suggests that we may be witnessing two separate motifs that bind to proteins cooperatively. Cistematic also fetched a list of the 37 other genes that have a perfect match to the BOLT1 consensus, which we are currently reviewing.

A SoCalBSI fellow will continue studies on co-regulated Arabidopsis genes using Cistematic to uncover conserved cis-regulatory DNA elements that contribute to the observed co-regulation. We plan to extend our studies beyond the LTPs described above to the large set of duplicated genes in Arabidopsis.

Back to Top



Barbara Wold, Ph.D (Cal Tech)

www.caltech.edu

Information on two internships in the Wold lab:

1. A mix of biology and computing around the software package called Cistematic that would involve definition of cis-regulatory cohorts - protein binding sites for proteins that work together to regulate transcription of a given gene or genes. Some introductory information regarding Cistematic is available at: http://cistematic.caltech.edu

2. Working on further development of BioHub or CompClust.

Information for BioHub can be found at: http://woldlab.caltech.edu/biohub

Information for CompClust can be found at: http://nar.oxfordjournals.org/cgi/content/full/33/8/2580

Back to Top