Curriculum 2007

Curriculum updates are made throughout the program.

Last Update: 07/06/07

Lecture Title Date & Time html .ppt Workshop References
Program Overview Mon 6/18
9:00-Noon
view download    
Molecular life Science Review Mon 6/18
1:00-2:30pm
   
Python I Mon 6/18
2:30-4:00pm
view download    
Literature Databases Tues 6/19
9:00-Noon
view download    
Sequence Comparisons Tues 6/19
1:00-4:00pm
view download workshop reference
Python II Wed 6/20
9:00-Noon
view download    
Python III Wed 6/20
1:00-4:00pm
view download    
Scoring Matrices Thurs 6/21
9:00-Noon
view download workshop reference
Sequence Databases Thurs 6/21
1:00-4:00pm
view download workshop  
Professional Development Fri 6/22
9:00-Noon
       
Research Site Visit Fri 6/22
1:00-4:00pm
       
Statistics I Mon 6/25
9:00-Noon
  pdf assignment solution
Statistics II Mon 6/25
1:00-4:00pm
  pdf assignment solution
Statistics III Tues 6/26
9:00-Noon
  pdf   references
Alignment Methods Tues 6/26
1:00-4:00pm
view download workshop  
Longest Common Substrong Algorithm (LCS) Wed 6/27
9:00-Noon
view download    
Python IV Wed 6/27
1:00-4:00pm
view download    
Local and Global Alignment Thurs 6/28
9:00-Noon
view download project  
Space Efficient Alignment Algorithms Thurs 6/28
1:00-4:00pm
view download    
Multiple Sequence Alignment Fri 6/29
9:00-Noon
view download workshop references
Research Seminar Fri 6/29
1:00-5:00pm
       
Protein Structure Prediction Mon 7/2
9:00-Noon
view download workshop references
Ethics of the Human Genome Mon 7/2
1:00-4:00pm
       
Protein Structure Manipulation Tues 7/3
9:00-Noon
view download workshop reference
Proteome Analysis Tues 7/3
1:00-4:00pm
view download    
Holiday Wed 7/4
Entire Day
       
Microarrays I Thurs 7/5
9:00-Noon
view download    
Microarrays II Thurs 7/5
1:00-4:00pm
view download    
Microarrays III Fri 7/6
9:00-Noon
view download    
Sequence Alignment Project Programming Fri 7/6
1:00-5:00pm
       

Workshop

Literature Databases - Workshop

Workshop A:
Set up a cubby account on a biological topic of interest. Show the instructor the cubby account you set up. Subscribe to NCBI News.

Workshop B:
Go to OMIM Website and type "Breast cancer". Link to MIM#113705. What does the light bulb represent? What do the links with the numbers lead to?

Obtain the following information on BRCA1:

Back to Top

Sequence Comparisons - Workshop

Workshop 2A

Consider the sequence GAACTCATACGAATTCACGTCAGCCCATCG

Use a window of 3 nucleotides and slide the window 1 nucleotide at a time. Calculate the %GC as a function of nucleotide number. Draw a graph. Change the window to 2 nucleotides. Then overlap the two plots. You may use Excel. Print out the spread sheet and the graph.

Given the following sequence: PLSQETFSDLWKLLPENNVLSP use the Kyte/Doolittle Hydropathy scale and a sliding window of 7 amino acids to construct a hydropathy plot.

Find the protein sequence for bacteriorhodopsin. Make sure you obtain the full-length sequence. Find the Kyte-Doolittle Hydropathy program software at the Expasy Tools website (TGREASE). Perform Kyte-Doolittle analysis of bacteriorhodopsin. Compare the plot to the one displayed in lecture today. Are there differences in the two plots? If so, why?

 

Workshop 2B

A. Download PAM250 and PAM40 from internet and print out. What are the differences between the two matrices? Why do you see these differences?

B. Download BLOSUM80 and BLOSUM45. What are the differences between the two matrices? Why do you see these differences? Download the BLOSUM45 matrix as a text file onto the C drive and name the file BLOSUM45.

C. Import the human p53 (Accession number AAH03596) and squid p53 (Accession number AAA98563) sequences from the protein databases at NCBI onto your hard drive in FASTA format. This can be accomplished by changing the display format on the ENTREZ screen to FASTA. Then highlight the entry and copy onto clipboard. Open NotePad on your local hard drive. Paste each sequence into a separate document and save them in a folder named "sequence" on C drive. Name the documents p53_human and p53_squid.

Type dotter c:\sequence\p53_human.txt c:\sequence\p53_human.txt RETURN

Do you detect some parallel lines? Why? What does the greyramp tool do?

Capture the image and save..

D. Type dotter c:\sequence\p53_human.txt c:\sequence\p53_squid.txt RETURN

What is the difference between the human vs. human comparison and the human vs. squid comparison?

E. Change the matrix from BLOSUM62 (default) to BLOSUM45 by typing in the following command: dotter -M c:BLOSUM45.txt a:\sequence\p53_human.txt c:\sequence\p53_squid.txt. Which scoring matrix produces more lines? Why?

F. If you have time, obtain the mouse p53 sequence and compare it to human p53. According to your analysis, do you detect more similarity with human or squid. Are some subregions within the human p53 protein more conserved than others?

Back to Top

Scoring Matrices - Workshop

A. Download PAM250 and PAM40 from internet and print out. What are the differences between the two matrices? Why do you see these differences?

B. Download BLOSUM80 and BLOSUM45. What are the differences between the two matrices? Why do you see these differences?

C. Obtain the mouse p53 sequence and compare it to human p53. According to your analysis, do you detect more similarity between with human and mouse or human and squid? Are some subregions within the human p53 protein more conserved than others?

D. he p53 protein is known to have a certain number of conserved domains.  Use the Dotter program and a series of p53 proteins from different species (at least 5 proteins) to determine the number of conserved domains and the boundaries of the conserved domains.  In reporting the boundaries, use the human sequence number as the standard.

E. Create your own scoring matrix that shows fairly good results with simple sequences using the Dotter program. For example, you may choose to create a scoring matrix that only gives high marks for charged residues similarities but low marks for other amino acid similarities.  An easy way to do this is to use one of the BLOSUM or PAM matrices as a template and change the numbers slightly.  Show that your scoring matrix works for some simple polypeptides that are 50 amino acids in length (you may choose your own sequences).  Explain the purpose of your scoring matrix and explain your justification for the numbers you choose.

Back to Top

Sequence Databases - Workshop

Workshop A:

  1. Use the following accession number to find a sequence in the nucleotide database: Z68198
  2. Print out the first 1000 nucleotides of the sequence and decipher the open reading frame(s) in this segment from the annotations listed in the flat file.
  3. Underline the open reading frame(s) and use an arrow to give the direction of the coding strand (5' to 3'). (Remember to distinguish between template strand and coding strand.) Show to the instructor.

Workshop B:

Genes in eukaryotes are often organized into exons and introns, which require post-transcriptional splicing to produce a mature mRNA with a contiguous open reading frame for translation. This broken organization can make gene identification difficult in eukaryotes and particularly in higher eukaryotes with complex gene organization. Prediction of many genes and their organization has been based on similarity searches between genomic sequence and known protein amino acid sequences and/or genomic sequence and the corresponding full-length cDNAs or even ESTs.

Below is a small portion (~1,500 bp) of the C. elegans genome:

ATTTTTAAAAATGTACAAAATCAAACGCCCTACAAATCATGTGTGTGAAGAAGAATAATAACTAACATAT CTATTTATATTTACCGAATAAATATATATTCATCAATTAACCTGAAGAACAAACGAATTCGGCTACAGGC GTCGATCAGTCTCGAATCTAGTAACAACAAGAGAGCAATACGAAAACCGGTAAATCAATAGGGGGAAGCG AAACAGTAGGTACAAATTGGAGGGGAAGCACCAATACATTAGGTGGGGGGTACGACTTGAAAAATGAGCT GATTTTCGAATAGTTAAAGCGATGATCGTGTCCGAAAAACAGTTCATTTTTCAAGACAACATTGAGACTG GGAGTACGGGGAAGCTCATTTACGGTGAGAGGAATTGGTGAGATCTTTAGAATATGCTTAAGGAGTTGGG GTGGCTGGAGAAGTTCCTGTAGCCTCCGTGCCGGGATTCGATGGAGAAGTCGTTGCGGCTGGTCCCTTTT CCTTCACTGGTGCTGGATCCTTGGCTGGAAGACATATGCGTGGCTTGACAGTCGATGAGGTGCGAGCCGA CGAGTCCTTGTGAACTTCGTATCTGGAAATATTTTACTTAGATAGCAAATACTAAAATTGTAAAATTACC TCAAAATCTCAGTATCCGGAATGCTCAATTTCTGCTTCAAAACCTGTCCGATGCGAAGATTGACATCATC GCGAGTAGCATCACGAGTCCACAAGGAAACCTTGTCACCCTTTTGACGAACATTCACGACAGCTCCGCAG ATGTAGTCTCCGTACTCGTCGAATTGCTCTCCAACAATAGCCATCAACAGCTCCAACCAGTAGTGATCGA GCAATTGCGTTCTTCTCTGAAGCTTCTATGATTCATTGAATAAAATATATTTCTCAAAACGTACTTGCTT ATCGACAACAACCAACCAACGTCCACCTTGAACGTTGTTGACGTCCTCCCACATTGGCTTGATTCCTTCC TTGAACAAGTAATAATCGGATCCCCAGTTCAATCCTCCGGCAGACTGAATGTGATTGTACAGCGACCAGA AGTCCTCGACAGTGTCGAAAAGTGAAACCATCTGGAAAAAATCGATAAAAGACGTATTTAAAAATCTTCT ACCTTCAGACAATCCTCCCATTCCTTGTTACGGTCAGCTTTCAAGTACCAGAGAGCCCAGCGATTCTGGA GGGGGTGTCTGGTGAGAAGCTCTGGAGGAACTGAAGCATCGGACGCATTCACATCGCCGGAAGCTGACAA TGCTTTGTTTTCCGCTACGGATGTGCTCATTTAGCTGAAAATAGGTAATATTATATACGATTAGAGCTCG GAAAACGATAAAATAGAGAAGAGTATGAATTTGGTTCAAATAACTCGGATTTTATAGGAAATTTTGTTTT ACTGCACATTTTCGGCTAGTTTCCAAGCTTTTTAGATTTTTCAAGTGTAATTGGTAACATCGGGCACAAT AAATTGATATTAAAGCTTGGAAAACAATAAA

Use this sequence to carry out the following:

Conduct a blastx search (BLAST) of the Swiss Protein database to attempt to identify regions in this sequence encoding amino acids with similarity to known proteins in this database. You can get to blastx software through Expasy: http://www.expasy.org/sprot/.

Write down the answers to the following questions.

B1a. What does the blastx algorithm do with your nucleotide sequence before searching SwissProt for matches?

B1b. Why is the difference between the numbers at the left and right ends of any “Query” line always larger than the difference between the numbers at the left and right ends of the corresponding “Sbjct” line.

B2. From the blastx output, to what protein does this region of genomic DNA have significant similarity?

B3. How can you tell that the coding regions for the amino acids within the matched protein are not located within a single contiguous region of the genomic DNA? (There is more than one way to tell.)

B4. How many separate regions of the genomic DNA align with the highest scoring match in the output?

B5. What essential feature of the organization of the gene does the above information provide?

B6. Note the numbering of the sequences in the alignments. Does the database genomic sequence progress in the same direction as the database amino acid sequences in the alignments? In other words is it the same orientation (below):

1.................................114 = query

61...............................98 = subject

or opposite orientation (below):

1.................................114 = query

98...............................61 = subject

B6. What does the orientation of the sequences in the alignment relative to each other tell you about the gene orientation relative to the sequence that was used as the query sequence?

http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html

Reverse complement

ATATGTTAGTTATTATTCTTCTTCACACACATGATTTGTAGGGCGTTTGATTTTGTACATTTTTAAAAAT GCCTGTAGCCGAATTCGTTTGTTCTTCAGGTTAATTGATGAATATATATTTATTCGGTAAATATAAATAG CGCTTCCCCCTATTGATTTACCGGTTTTCGTATTGCTCTCTTGTTGTTACTAGATTCGAGACTGATCGAC AGCTCATTTTTCAAGTCGTACCCCCCACCTAATGTATTGGTGCTTCCCCTCCAATTTGTACCTACTGTTT CAGTCTCAATGTTGTCTTGAAAAATGAACTGTTTTTCGGACACGATCATCGCTTTAACTATTCGAAAATC CCCAACTCCTTAAGCATATTCTAAAGATCTCACCAATTCCTCTCACCGTAAATGAGCTTCCCCGTACTCC AAAAGGGACCAGCCGCAACGACTTCTCCATCGAATCCCGGCACGGAGGCTACAGGAACTTCTCCAGCCAC TCGGCTCGCACCTCATCGACTGTCAAGCCACGCATATGTCTTCCAGCCAAGGATCCAGCACCAGTGAAGG GGTAATTTTACAATTTTAGTATTTGCTATCTAAGTAAAATATTTCCAGATACGAAGTTCACAAGGACTCG GATGATGTCAATCTTCGCATCGGACAGGTTTTGAAGCAGAAATTGAGCATTCCGGATACTGAGATTTTGA CTGCGGAGCTGTCGTGAATGTTCGTCAAAAGGGTGACAAGGTTTCCTTGTGGACTCGTGATGCTACTCGC TCGATCACTACTGGTTGGAGCTGTTGATGGCTATTGTTGGAGAGCAATTCGACGAGTACGGAGACTACAT AAGCAAGTACGTTTTGAGAAATATATTTTATTCAATGAATCATAGAAGCTTCAGAGAAGAACGCAATTGC GGAAGGAATCAAGCCAATGTGGGAGGACGTCAACAACGTTCAAGGTGGACGTTGGTTGGTTGTTGTCGAT TCTGGTCGCTGTACAATCACATTCAGTCTGCCGGAGGATTGAACTGGGGATCCGATTATTACTTGTTCAA AGAAGATTTTTAAATACGTCTTTTATCGATTTTTTCCAGATGGTTTCACTTTTCGACACTGTCGAGGACT TCCAGAATCGCTGGGCTCTCTGGTACTTGAAAGCTGACCGTAACAAGGAATGGGAGGATTGTCTGAAGGT TTGTCAGCTTCCGGCGATGTGAATGCGTCCGATGCTTCAGTTCCTCCAGAGCTTCTCACCAGACACCCCC CGAGCTCTAATCGTATATAATATTACCTATTTTCAGCTAAATGAGCACATCCGTAGCGGAAAACAAAGCA AAAACAAAATTTCCTATAAAATCCGAGTTATTTGAACCAAATTCATACTCTTCTCTATTTTATCGTTTTC ATTGTGCCCGATGTTACCAATTACACTTGAAAAATCTAAAAAGCTTGGAAACTAGCCGAAAATGTGCAGT TTTATTGTTTTCCAAGCTTTAATATCAATTT

Back to Top

Alignment Methods - Workshop

By hand, perform local alignment on the following two sequences:

Use the Blosum 45 matrix for scoring with the default gap penalty value of -5. Determine highest path score and the percent similarity for the local alignment of the highest path score.

Go to NCBI Website. Open BLAST Website. Perform BLASTP on the sequence:

SSSVPSQKTYQGSYGFRLGFLHSGTAKSVT

Use default settings. Record the total number of letters in the NR database and the E-value for the top hit. Next, change the database to SwissProt and repeat. Record the total number of letters in the SwissProt database and the E-value for the top hit. Compare the E-values for the two searches and explain why they are different. Do you think that you obtained these hits by chance? Find another database so that the search would give you a hit with a lower E-value?

What other parameters can you change to give you a score with a lower E-value?

Back to Top

Multiple Sequence Alignment - Workshop

Workshop A
Obtain the following protein sequences from any public database you wish and align using CLUSTALW program. The protein sequences are: Human MDM2, Hamster MDM2, Murine MDM2, Xenopus MDM2 and Zebrafish MDM2. The human MDM2 sequences is approximately 490 amino acids in length. Name three areas that are structually conserved amongst these orthologs.

According to the Guide Tree, which two sequences have the highest similarity?

Perform CLUSTALW just using Human MDM2 and Zebrafish MDM2 sequences. Does the human/zebrafish alignment in this run differ from the human/zebrafish alignment obtained in the first run?

Explain why.

Workshop B
There exists a paralog of MDM2. Obtain the sequences of the human paralog MDMX and the mouse paralog MDMX and perform multiple sequence alignment again together with the original five MDM2 sequences. Give the domains (in amino acid number ranges) that are highly conserved within sequences of this entire family. Use the human MDM2 amino acid numbers as the reference when explaining the ranges that are conserved.

Workshop C
The quagga was an African animal that is now extinct. It looked partly like a horse and partly like a zebra. In 1872, the last living quagga was photographed. More recently, mitochondrial DNA was obtained from a museum quagga specimen and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?

Back to Top

Protein Structure Prediction - Workshop

Workshop A-Check to see if the BLIMPs program in the BLOCK searcher can predict the function of PTEN (NP_000305). PTEN is an abbreviation for a protein called the phosphatase and tensin homolog. Obtain the protein sequence from protein database at NCBI. Convert the sequence to FASTA format. Paste sequence into window in BLOCK Searcher (http://blocks.fhcrc.org/blocks_search.html). Determine the major function based on the BLOCK Searcher output. Find out the actual function of PTEN by performing a text search for PTEN in the OMIM database. Did this BLOCK searcher help assess the function of PTEN?

Workshop B-Find the complete amino acid sequence of human p53 and perform a secondary structure prediction with Psi-PRED, GOR, Chou-Fasman, or another secondary structure prediction algorithm.

Workshop C-Calculation of Q3 value of secondary structure prediction program. Go to the Protein Data Bank and obtain the record for the p53 crystal structure (1TSR). There are three identical p53 polypeptides in the record named A, B and C. Choose one of the polypeptides for this exercise. In the remarks section of the record you will observe an assignment of secondary structure for many of the amino acids. These will either be named "helix" or "sheet". For amino acids in the structure that were not assigned to "helix" or "sheet" class assume that they adopt a "coil" structure. Create a line graph that places the amino acid sequence in one row and the known secondary structure from the PDB record that amino acid in the next row. Next, use the predicted structure from Workshop B. Create a third row on the line graph that shows the predicted structure. The 1TSR file only contains the DNA binding domain of p53 so you will only be able to cover about half of the protein. If you can, obtain other portions of p53 where the structure has been solved from the Protein Data Bank (in different records) and fill in those regions in the second row that were not obtained in the 1TSR record. Show the instructor the line figure and calculate the percent accuracy of the Psi-PRED prediction. A hypothetical example is shown below

Percent accuracy: 14/15 X 100

Back to Top


Protein Structure Manipulation - Workshop

Download structure coordinates for 1HEW protein from PDB onto your hard drive.
Follow the tutorial for viewing protein structures at:

http://www.usm.maine.edu/~rhodes/SPVTut/text/SPdbVTut.html

You can start the tutorial at Section 2--Windows and help. If you are already advanced in the manipulation of protein structures you may attempt to predict a 3D model of the protein given a primary sequence. The tutorial is located at:

http://expasy.org/spdbv/text/modeling.htm

Choose a disease-related protein that you studied. Obtain its amino acid sequence. Estimate its location on a 2-D-gel with the following tool:

http://www.expasy.ch/cgi-bin/2dregion-for-seq.pl

Can you justify its position on the gel? Show the instructor. Perform an in silico trypsin-mediated digest of your protein. Obtain monoisotopic masses of peptides and feed data to Mascot server. Determine the minimum number of peptide masses necessary to give a correct identification of your protein. Is there a relationship between the number of peptide masses used and the accuracy of the prediction? What is the number of significant figures needed to get a positive identification? Is there an optimal mass range for positive identification?

Repeat this exercise using Protein Prospector. Which software program is better at retrieving your results?

Back to Top



References

Literature Databases - References

http://adonis.creighton.edu/hsl/Searching/Medline-Fields.html
http://cmgm.stanford.edu/classes/csuh/literature/
http://hml.org/WWW/class/help/medcite.html
http://www.nlm.nih.gov/mesh/meshhome.html

Back to Top

Sequence Comparisons - References

Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 2001
http://www.infobiogen.fr/doc/dotter.html
Segurado et al., EMBO Reports, 4 1048-1053, 200

Back to Top

Scoring Matrices - References

http://cnx.org/content/m11062/latest/
http://life.umd.edu/labs/delwiche/bsci348s/lec/PAMmatrices.html

Back to Top

Sequence Databases - References

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Pevsner, J., Bioinformatics and Functional Genomics, Wiley-Liss, Hoboken, NJ, 2003
Baxevanis and Ouellette, Bioinformatics 2nd Ed, Wiley-Interscience, New York, 2001
Misener and Krawetz, Bioinformatics Methods and Protocols, Humana Press, Totowa, NJ, 2000

Back to Top

Multiple Sequence Alignment - References

Pevsner, Bioinformatics and Functional Genomics, Wiley-Liss, Hoboken, 2003.
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 1998.
Feng and Doolittle, J. Mol. Evol. 25, 351-360, 1987.
Thompson et al., Nuc. Acids Res. 22, 4673-4690, 1994.

 

Back to Top

Protein Structure Manipulation - References

http://swissmodel.expasy.org//course/text/chapter6.htm
Wang et al., Nucleic Acids Research 28, 243-245, 2000.
http://www.iucr.org/iucr-top/comm/ccom/School96/pdf/sb.pdf
http://www.usm.maine.edu/~rhodes/SPVTut/index.html
Back to Top