Curriculum 2007
Curriculum updates are made throughout the program.
Last Update: 07/06/07
| Lecture Title | Date & Time | html | .ppt | Workshop | References |
| Program Overview | Mon 6/18 9:00-Noon |
view | download | ||
| Molecular life Science Review | Mon 6/18 1:00-2:30pm |
||||
| Python I | Mon 6/18 2:30-4:00pm |
view | download | ||
| Literature Databases | Tues 6/19 9:00-Noon |
view | download | ||
| Sequence Comparisons | Tues 6/19 1:00-4:00pm |
view | download | workshop | reference |
| Python II | Wed 6/20 9:00-Noon |
view | download | ||
| Python III | Wed 6/20 1:00-4:00pm |
view | download | ||
| Scoring Matrices | Thurs 6/21 9:00-Noon |
view | download | workshop | reference |
| Sequence Databases | Thurs 6/21 1:00-4:00pm |
view | download | workshop | |
| Professional Development | Fri 6/22 9:00-Noon |
||||
| Research Site Visit | Fri 6/22 1:00-4:00pm |
||||
| Statistics I | Mon 6/25 9:00-Noon |
assignment | solution | ||
| Statistics II | Mon 6/25 1:00-4:00pm |
assignment | solution | ||
| Statistics III | Tues 6/26 9:00-Noon |
references | |||
| Alignment Methods | Tues 6/26 1:00-4:00pm |
view | download | workshop | |
| Longest Common Substrong Algorithm (LCS) | Wed 6/27 9:00-Noon |
view | download | ||
| Python IV | Wed 6/27 1:00-4:00pm |
view | download | ||
| Local and Global Alignment | Thurs 6/28 9:00-Noon |
view | download | project | |
| Space Efficient Alignment Algorithms | Thurs 6/28 1:00-4:00pm |
view | download | ||
| Multiple Sequence Alignment | Fri 6/29 9:00-Noon |
view | download | workshop | references |
| Research Seminar | Fri 6/29 1:00-5:00pm |
||||
| Protein Structure Prediction | Mon 7/2 9:00-Noon |
view | download | workshop | references |
| Ethics of the Human Genome | Mon 7/2 1:00-4:00pm |
||||
| Protein Structure Manipulation | Tues 7/3 9:00-Noon |
view | download | workshop | reference |
| Proteome Analysis | Tues 7/3 1:00-4:00pm |
view | download | ||
| Holiday | Wed 7/4 Entire Day |
||||
| Microarrays I | Thurs 7/5 9:00-Noon |
view | download | ||
| Microarrays II | Thurs 7/5 1:00-4:00pm |
view | download | ||
| Microarrays III | Fri 7/6 9:00-Noon |
view | download | ||
| Sequence Alignment Project Programming | Fri 7/6 1:00-5:00pm |
Workshop
Literature Databases - Workshop
Workshop A:
Set up a cubby account on a biological topic of interest. Show the instructor the cubby account you set up. Subscribe to NCBI News.
Workshop B:
Go to OMIM Website and type "Breast cancer". Link to MIM#113705. What does the light bulb represent? What do the links with the numbers lead to?
Obtain the following information on BRCA1:
Sequence Comparisons - Workshop
Workshop 2A
Consider the sequence GAACTCATACGAATTCACGTCAGCCCATCG
Use a window of 3 nucleotides and slide the window 1 nucleotide at a time. Calculate the %GC as a function of nucleotide number. Draw a graph. Change the window to 2 nucleotides. Then overlap the two plots. You may use Excel. Print out the spread sheet and the graph.
Given the following sequence: PLSQETFSDLWKLLPENNVLSP use the Kyte/Doolittle Hydropathy scale and a sliding window of 7 amino acids to construct a hydropathy plot.
Find the protein sequence for bacteriorhodopsin. Make sure you obtain the full-length sequence. Find the Kyte-Doolittle Hydropathy program software at the Expasy Tools website (TGREASE). Perform Kyte-Doolittle analysis of bacteriorhodopsin. Compare the plot to the one displayed in lecture today. Are there differences in the two plots? If so, why?
Workshop 2B
A. Download PAM250 and PAM40 from internet and print out. What are the differences between the two matrices? Why do you see these differences?
B. Download BLOSUM80 and BLOSUM45. What are the differences between the two matrices? Why do you see these differences? Download the BLOSUM45 matrix as a text file onto the C drive and name the file BLOSUM45.
C. Import the human p53 (Accession number AAH03596) and squid p53 (Accession number AAA98563) sequences from the protein databases at NCBI onto your hard drive in FASTA format. This can be accomplished by changing the display format on the ENTREZ screen to FASTA. Then highlight the entry and copy onto clipboard. Open NotePad on your local hard drive. Paste each sequence into a separate document and save them in a folder named "sequence" on C drive. Name the documents p53_human and p53_squid.
Type dotter c:\sequence\p53_human.txt c:\sequence\p53_human.txt RETURN
Do you detect some parallel lines? Why? What does the greyramp tool do?
Capture the image and save..
D. Type dotter c:\sequence\p53_human.txt c:\sequence\p53_squid.txt RETURN
What is the difference between the human vs. human comparison and the human vs. squid comparison?
E. Change the matrix from BLOSUM62 (default) to BLOSUM45 by typing in the following command: dotter -M c:BLOSUM45.txt a:\sequence\p53_human.txt c:\sequence\p53_squid.txt. Which scoring matrix produces more lines? Why?
F. If you have time, obtain the mouse p53 sequence and compare it to human p53. According to your analysis, do you detect more similarity with human or squid. Are some subregions within the human p53 protein more conserved than others?
Scoring Matrices - Workshop
A. Download PAM250 and PAM40 from internet and print out. What are the differences between the two matrices? Why do you see these differences?
B. Download BLOSUM80 and BLOSUM45. What are the differences between the two matrices? Why do you see these differences?
C. Obtain the mouse p53 sequence and compare it to human p53. According to your analysis, do you detect more similarity between with human and mouse or human and squid? Are some subregions within the human p53 protein more conserved than others?
D. he p53 protein is known to have a certain number of conserved domains. Use the Dotter program and a series of p53 proteins from different species (at least 5 proteins) to determine the number of conserved domains and the boundaries of the conserved domains. In reporting the boundaries, use the human sequence number as the standard.
E. Create your own scoring matrix that shows fairly good results with simple sequences using the Dotter program. For example, you may choose to create a scoring matrix that only gives high marks for charged residues similarities but low marks for other amino acid similarities. An easy way to do this is to use one of the BLOSUM or PAM matrices as a template and change the numbers slightly. Show that your scoring matrix works for some simple polypeptides that are 50 amino acids in length (you may choose your own sequences). Explain the purpose of your scoring matrix and explain your justification for the numbers you choose.
Workshop A:
Workshop B:
Genes in eukaryotes are often organized into exons and introns, which require post-transcriptional splicing to produce a mature mRNA with a contiguous open reading frame for translation. This broken organization can make gene identification difficult in eukaryotes and particularly in higher eukaryotes with complex gene organization. Prediction of many genes and their organization has been based on similarity searches between genomic sequence and known protein amino acid sequences and/or genomic sequence and the corresponding full-length cDNAs or even ESTs.
Below is a small portion (~1,500 bp) of the C. elegans genome:
ATTTTTAAAAATGTACAAAATCAAACGCCCTACAAATCATGTGTGTGAAGAAGAATAATAACTAACATAT CTATTTATATTTACCGAATAAATATATATTCATCAATTAACCTGAAGAACAAACGAATTCGGCTACAGGC GTCGATCAGTCTCGAATCTAGTAACAACAAGAGAGCAATACGAAAACCGGTAAATCAATAGGGGGAAGCG AAACAGTAGGTACAAATTGGAGGGGAAGCACCAATACATTAGGTGGGGGGTACGACTTGAAAAATGAGCT GATTTTCGAATAGTTAAAGCGATGATCGTGTCCGAAAAACAGTTCATTTTTCAAGACAACATTGAGACTG GGAGTACGGGGAAGCTCATTTACGGTGAGAGGAATTGGTGAGATCTTTAGAATATGCTTAAGGAGTTGGG GTGGCTGGAGAAGTTCCTGTAGCCTCCGTGCCGGGATTCGATGGAGAAGTCGTTGCGGCTGGTCCCTTTT CCTTCACTGGTGCTGGATCCTTGGCTGGAAGACATATGCGTGGCTTGACAGTCGATGAGGTGCGAGCCGA CGAGTCCTTGTGAACTTCGTATCTGGAAATATTTTACTTAGATAGCAAATACTAAAATTGTAAAATTACC TCAAAATCTCAGTATCCGGAATGCTCAATTTCTGCTTCAAAACCTGTCCGATGCGAAGATTGACATCATC GCGAGTAGCATCACGAGTCCACAAGGAAACCTTGTCACCCTTTTGACGAACATTCACGACAGCTCCGCAG ATGTAGTCTCCGTACTCGTCGAATTGCTCTCCAACAATAGCCATCAACAGCTCCAACCAGTAGTGATCGA GCAATTGCGTTCTTCTCTGAAGCTTCTATGATTCATTGAATAAAATATATTTCTCAAAACGTACTTGCTT ATCGACAACAACCAACCAACGTCCACCTTGAACGTTGTTGACGTCCTCCCACATTGGCTTGATTCCTTCC TTGAACAAGTAATAATCGGATCCCCAGTTCAATCCTCCGGCAGACTGAATGTGATTGTACAGCGACCAGA AGTCCTCGACAGTGTCGAAAAGTGAAACCATCTGGAAAAAATCGATAAAAGACGTATTTAAAAATCTTCT ACCTTCAGACAATCCTCCCATTCCTTGTTACGGTCAGCTTTCAAGTACCAGAGAGCCCAGCGATTCTGGA GGGGGTGTCTGGTGAGAAGCTCTGGAGGAACTGAAGCATCGGACGCATTCACATCGCCGGAAGCTGACAA TGCTTTGTTTTCCGCTACGGATGTGCTCATTTAGCTGAAAATAGGTAATATTATATACGATTAGAGCTCG GAAAACGATAAAATAGAGAAGAGTATGAATTTGGTTCAAATAACTCGGATTTTATAGGAAATTTTGTTTT ACTGCACATTTTCGGCTAGTTTCCAAGCTTTTTAGATTTTTCAAGTGTAATTGGTAACATCGGGCACAAT AAATTGATATTAAAGCTTGGAAAACAATAAA
Use this sequence to carry out the following:
Conduct a blastx search (BLAST) of the Swiss Protein database to attempt to identify regions in this sequence encoding amino acids with similarity to known proteins in this database. You can get to blastx software through Expasy: http://www.expasy.org/sprot/.
Write down the answers to the following questions.
B1a. What does the blastx algorithm do with your nucleotide sequence before searching SwissProt for matches?
B1b. Why is the difference between the numbers at the left and right ends of any “Query” line always larger than the difference between the numbers at the left and right ends of the corresponding “Sbjct” line.
B2. From the blastx output, to what protein does this region of genomic DNA have significant similarity?
B3. How can you tell that the coding regions for the amino acids within the matched protein are not located within a single contiguous region of the genomic DNA? (There is more than one way to tell.)
B4. How many separate regions of the genomic DNA align with the highest scoring match in the output?
B5. What essential feature of the organization of the gene does the above information provide?
B6. Note the numbering of the sequences in the alignments. Does the database genomic sequence progress in the same direction as the database amino acid sequences in the alignments? In other words is it the same orientation (below):
1.................................114 = query
61...............................98 = subject
or opposite orientation (below):
1.................................114 = query
98...............................61 = subject
B6. What does the orientation of the sequences in the alignment relative to each other tell you about the gene orientation relative to the sequence that was used as the query sequence?
http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html
Reverse complement
ATATGTTAGTTATTATTCTTCTTCACACACATGATTTGTAGGGCGTTTGATTTTGTACATTTTTAAAAAT GCCTGTAGCCGAATTCGTTTGTTCTTCAGGTTAATTGATGAATATATATTTATTCGGTAAATATAAATAG CGCTTCCCCCTATTGATTTACCGGTTTTCGTATTGCTCTCTTGTTGTTACTAGATTCGAGACTGATCGAC AGCTCATTTTTCAAGTCGTACCCCCCACCTAATGTATTGGTGCTTCCCCTCCAATTTGTACCTACTGTTT CAGTCTCAATGTTGTCTTGAAAAATGAACTGTTTTTCGGACACGATCATCGCTTTAACTATTCGAAAATC CCCAACTCCTTAAGCATATTCTAAAGATCTCACCAATTCCTCTCACCGTAAATGAGCTTCCCCGTACTCC AAAAGGGACCAGCCGCAACGACTTCTCCATCGAATCCCGGCACGGAGGCTACAGGAACTTCTCCAGCCAC TCGGCTCGCACCTCATCGACTGTCAAGCCACGCATATGTCTTCCAGCCAAGGATCCAGCACCAGTGAAGG GGTAATTTTACAATTTTAGTATTTGCTATCTAAGTAAAATATTTCCAGATACGAAGTTCACAAGGACTCG GATGATGTCAATCTTCGCATCGGACAGGTTTTGAAGCAGAAATTGAGCATTCCGGATACTGAGATTTTGA CTGCGGAGCTGTCGTGAATGTTCGTCAAAAGGGTGACAAGGTTTCCTTGTGGACTCGTGATGCTACTCGC TCGATCACTACTGGTTGGAGCTGTTGATGGCTATTGTTGGAGAGCAATTCGACGAGTACGGAGACTACAT AAGCAAGTACGTTTTGAGAAATATATTTTATTCAATGAATCATAGAAGCTTCAGAGAAGAACGCAATTGC GGAAGGAATCAAGCCAATGTGGGAGGACGTCAACAACGTTCAAGGTGGACGTTGGTTGGTTGTTGTCGAT TCTGGTCGCTGTACAATCACATTCAGTCTGCCGGAGGATTGAACTGGGGATCCGATTATTACTTGTTCAA AGAAGATTTTTAAATACGTCTTTTATCGATTTTTTCCAGATGGTTTCACTTTTCGACACTGTCGAGGACT TCCAGAATCGCTGGGCTCTCTGGTACTTGAAAGCTGACCGTAACAAGGAATGGGAGGATTGTCTGAAGGT TTGTCAGCTTCCGGCGATGTGAATGCGTCCGATGCTTCAGTTCCTCCAGAGCTTCTCACCAGACACCCCC CGAGCTCTAATCGTATATAATATTACCTATTTTCAGCTAAATGAGCACATCCGTAGCGGAAAACAAAGCA AAAACAAAATTTCCTATAAAATCCGAGTTATTTGAACCAAATTCATACTCTTCTCTATTTTATCGTTTTC ATTGTGCCCGATGTTACCAATTACACTTGAAAAATCTAAAAAGCTTGGAAACTAGCCGAAAATGTGCAGT TTTATTGTTTTCCAAGCTTTAATATCAATTT
By hand, perform local alignment on the following two sequences:
Use the Blosum 45 matrix for scoring with the default gap penalty value of -5. Determine highest path score and the percent similarity for the local alignment of the highest path score.
Go to NCBI Website. Open BLAST Website. Perform BLASTP on the sequence:
SSSVPSQKTYQGSYGFRLGFLHSGTAKSVT
Use default settings. Record the total number of letters in the NR database and the E-value for the top hit. Next, change the database to SwissProt and repeat. Record the total number of letters in the SwissProt database and the E-value for the top hit. Compare the E-values for the two searches and explain why they are different. Do you think that you obtained these hits by chance? Find another database so that the search would give you a hit with a lower E-value?
What other parameters can you change to give you a score with a lower E-value?
Multiple Sequence Alignment - Workshop
Workshop A
Obtain the following protein sequences from any public database you wish and align using CLUSTALW program. The protein sequences are: Human MDM2, Hamster MDM2, Murine MDM2, Xenopus MDM2 and Zebrafish MDM2. The human MDM2 sequences is approximately 490 amino acids in length. Name three areas that are structually conserved amongst these orthologs.
According to the Guide Tree, which two sequences have the highest similarity?
Perform CLUSTALW just using Human MDM2 and Zebrafish MDM2 sequences. Does the human/zebrafish alignment in this run differ from the human/zebrafish alignment obtained in the first run?
Explain why.
Workshop B
There exists a paralog of MDM2. Obtain the sequences of the human paralog MDMX and the mouse paralog MDMX and perform multiple sequence alignment again together with the original five MDM2 sequences. Give the domains (in amino acid number ranges) that are highly conserved within sequences of this entire family. Use the human MDM2 amino acid numbers as the reference when explaining the ranges that are conserved.
Workshop C
The quagga was an African animal that is now extinct. It looked partly like a horse and partly like a zebra. In 1872, the last living quagga was photographed. More recently, mitochondrial DNA was obtained from a museum quagga specimen and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?
Protein Structure Prediction - Workshop
Workshop A-Check to see if the BLIMPs program in the BLOCK searcher can predict the function of PTEN (NP_000305). PTEN is an abbreviation for a protein called the phosphatase and tensin homolog. Obtain the protein sequence from protein database at NCBI. Convert the sequence to FASTA format. Paste sequence into window in BLOCK Searcher (http://blocks.fhcrc.org/blocks_search.html). Determine the major function based on the BLOCK Searcher output. Find out the actual function of PTEN by performing a text search for PTEN in the OMIM database. Did this BLOCK searcher help assess the function of PTEN?
Workshop B-Find the complete amino acid sequence of human p53 and perform a secondary structure prediction with Psi-PRED, GOR, Chou-Fasman, or another secondary structure prediction algorithm.
Workshop C-Calculation of Q3 value of secondary structure prediction program. Go to the Protein Data Bank and obtain the record for the p53 crystal structure (1TSR). There are three identical p53 polypeptides in the record named A, B and C. Choose one of the polypeptides for this exercise. In the remarks section of the record you will observe an assignment of secondary structure for many of the amino acids. These will either be named "helix" or "sheet". For amino acids in the structure that were not assigned to "helix" or "sheet" class assume that they adopt a "coil" structure. Create a line graph that places the amino acid sequence in one row and the known secondary structure from the PDB record that amino acid in the next row. Next, use the predicted structure from Workshop B. Create a third row on the line graph that shows the predicted structure. The 1TSR file only contains the DNA binding domain of p53 so you will only be able to cover about half of the protein. If you can, obtain other portions of p53 where the structure has been solved from the Protein Data Bank (in different records) and fill in those regions in the second row that were not obtained in the 1TSR record. Show the instructor the line figure and calculate the percent accuracy of the Psi-PRED prediction. A hypothetical example is shown below
Percent accuracy: 14/15 X 100
Protein Structure Manipulation - Workshop
Download structure coordinates for 1HEW protein from PDB onto your hard drive.
Follow the tutorial for viewing protein structures at:
http://www.usm.maine.edu/~rhodes/SPVTut/text/SPdbVTut.html
You can start the tutorial at Section 2--Windows and help. If you are already advanced in the manipulation of protein structures you may attempt to predict a 3D model of the protein given a primary sequence. The tutorial is located at:
http://expasy.org/spdbv/text/modeling.htm
Choose a disease-related protein that you studied. Obtain its amino acid sequence. Estimate its location on a 2-D-gel with the following tool:
http://www.expasy.ch/cgi-bin/2dregion-for-seq.pl
Can you justify its position on the gel? Show the instructor. Perform an in silico trypsin-mediated digest of your protein. Obtain monoisotopic masses of peptides and feed data to Mascot server. Determine the minimum number of peptide masses necessary to give a correct identification of your protein. Is there a relationship between the number of peptide masses used and the accuracy of the prediction? What is the number of significant figures needed to get a positive identification? Is there an optimal mass range for positive identification?
Repeat this exercise using Protein Prospector. Which software program is better at retrieving your results?
References
Literature Databases - References
http://adonis.creighton.edu/hsl/Searching/Medline-Fields.html
http://cmgm.stanford.edu/classes/csuh/literature/
http://hml.org/WWW/class/help/medcite.html
http://www.nlm.nih.gov/mesh/meshhome.html
Sequence Comparisons - References
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 2001
http://www.infobiogen.fr/doc/dotter.html
Segurado et al., EMBO Reports, 4 1048-1053, 200
Scoring Matrices - References
http://cnx.org/content/m11062/latest/
http://life.umd.edu/labs/delwiche/bsci348s/lec/PAMmatrices.html
Sequence Databases - References
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Pevsner, J., Bioinformatics and Functional Genomics, Wiley-Liss, Hoboken, NJ, 2003
Baxevanis and Ouellette, Bioinformatics 2nd Ed, Wiley-Interscience, New York, 2001
Misener and Krawetz, Bioinformatics Methods and Protocols, Humana Press, Totowa, NJ, 2000
Multiple Sequence Alignment - References
Pevsner, Bioinformatics and Functional Genomics, Wiley-Liss, Hoboken, 2003.
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 1998.
Feng and Doolittle, J. Mol. Evol. 25, 351-360, 1987.
Thompson et al., Nuc. Acids Res. 22, 4673-4690, 1994.
Protein Structure Manipulation - References
http://swissmodel.expasy.org//course/text/chapter6.htm
Wang et al., Nucleic Acids Research 28, 243-245, 2000.
http://www.iucr.org/iucr-top/comm/ccom/School96/pdf/sb.pdf
http://www.usm.maine.edu/~rhodes/SPVTut/index.html
Back to Top