Curriculum 2009

 

Lecture Title Date & Time html .ppt Workshop References
Intro to Bioinformatics Mon, 6/22
9:00 AM
view download    
Python 01 Mon, 6/22
1:00 PM
view download    
Molecular Life Science Review Mon, 6/22
1:00 PM
view download    
Sequence Databases Tues, 6/23
9:00 AM
view download workshop  
Literature Databases Tues 6/23
9:00 AM
view download    
Sequence Comparisons Tues, 6/23
1:00 PM
view download workshop  
Python 02 Wed, 6/24
9:00 AM
view download    
Python 03 Wed, 6/24
1:00 PM
view download    
Database Searching - Scoring Matrices Thurs, 6/25
9:00 AM
view download workshop  
Professional Development Thurs, 6/25
1:00 PM
       
Statistics Fri, 6/26
1:00 PM
view download    
Intro to NCBI and BLAST Fri, 6/26
1:00 PM
view download    
Longest Common Substrong
Algorithm (LCS)
Mon, 6/29
9:00 AM
view download    
Global and Local Alignment
(Sequence Alignment Algorithms)
Mon, 6/29
1:00 PM
view download    
Space Efficient Alignment Algorithms Tues, 6/30
9:00 AM
view download    
Protein Structure Prediction Tues, 6/30
1:00 PM
view download    
Protein Structure Manipulation Wed, 7/01
9:00 AM
view download workshop  
Proteome Analysis Wed, 7/01
1:00 PM
view download workshop  
Lecture Title Thurs, 7/02
9:00 AM
view download    
Lecture Title Thurs, 7/02
9:00 AM
view download    
Holiday Observed Fri, 7/03
9:00 AM
   
Holiday Observed Fri, 7/03
1:00 PM
       
RNA-Seq Fri, 7/10
9:00 AM
view download    
Microarray Fri, 7/10
9:00 AM
view download    

Workshop

Sequence Comparisons - Workshop

1. Consider the sequence

GAACTCATACGAATTCACGTCAGCCCATCGTGCCACGT

Create a window of 3 nucleotides and slide the window 1 nucleotide at a time. Calculate the %G+C as a function of nucleotide number. Use a Excel spreadsheet and create a plot of %G+C vs. nucleotide number. Change the window to 5 nucleotides and create a second plot. Overlap the two plots. Show your instructor the spread sheet and the graph.

2. Given the following sequence PLSQETFSDLWKLLPENNVLSP use the Kyte/Doolittle Hydropathy scale and a sliding window of 7 amino acids to construct a hydropathy plot.

3. Find the protein sequence for bacteriorhodopsin. Make sure you obtain the full-length sequence. Find the Kyte-Doolittle Hydropathy program at http://www.vivo.colostate.edu/molkit/hydropathy/index.html. Perform Kyte-Doolittle analysis of bacteriorhodopsin. Compare the plot to the one displayed in lecture today. Are there differences in the two plots? If so, why?

4. Import the human p53 (Accession number AAH03596) and squid p53 (Accession number AAA98563) sequences from the protein databases at NCBI onto your hard drive in FASTA format. This can be accomplished by changing the display format on the ENTREZ screen to FASTA. Highlight the entry and copy onto clipboard. Open NotePad on your local hard drive. Paste each sequence into a separate document and save them in a folder named "temp" on C drive. Name the documents p53_human and p53_squid.

Type dotter c:\sequence\p53_human.txt c:\temp\p53_human.txt RETURN

Do you detect some parallel lines? Why? What does the greyramp tool do?

Capture the image and save.

Type dotter c:\temp\p53_human.txt c:\temp\p53_squid.txt RETURN

What are the similarities and differences in the human vs. human dot plot and the human vs. squid dot plot?

Back to Top

Database Searching - Scoring Matrices - Workshop

A. Download PAM250 and PAM40 from internet and print out. What are the differences between the two matrices? Why do you see these differences?

B. Download BLOSUM80 and BLOSUM30. What are the differences between the two matrices? Why do you see these differences?

C. Obtain the mouse p53 sequence and compare it to human p53 with the Dotter program. According to your analysis, do you detect more similarity with human vs. mouse or human vs. squid? Are some subregions within the human p53 protein more conserved than others?

D. The p53 protein is known to have a certain number of conserved domains. Use the Dotter program and a series of p53 proteins from different species (at least 5 proteins) to determine the number of conserved domains and the boundaries of the conserved domains. In reporting the boundaries, use the human sequence number as the standard.

E. Create your own scoring matrix that shows fairly good results with simple sequences using the Dotter program. For example, you may choose to create a scoring matrix that only gives high marks for charged residues similarities but low marks for other amino acid similarities. An easy way to do this is to use one of the BLOSUM or PAM matrices as a template and change the numbers slightly. Show that your scoring matrix works in the Dotter program for some simple polypeptides that are 50 amino acids in length (you may choose your own sequences). Explain the purpose of your scoring matrix and explain your justification for the numbers you choose.

Back to Top

Sequence Databases - Workshop

Below is a small portion (~1,500 bp) of the C. elegans genome:

ATTTTTAAAAATGTACAAAATCAAACGCCCTACAAATCATGTGTGTGAAGAAGAATAATAACTAACATAT CTATTTATATTTACCGAATAAATATATATTCATCAATTAACCTGAAGAACAAACGAATTCGGCTACAGGC GTCGATCAGTCTCGAATCTAGTAACAACAAGAGAGCAATACGAAAACCGGTAAATCAATAGGGGGAAGCG AAACAGTAGGTACAAATTGGAGGGGAAGCACCAATACATTAGGTGGGGGGTACGACTTGAAAAATGAGCT GATTTTCGAATAGTTAAAGCGATGATCGTGTCCGAAAAACAGTTCATTTTTCAAGACAACATTGAGACTG GGAGTACGGGGAAGCTCATTTACGGTGAGAGGAATTGGTGAGATCTTTAGAATATGCTTAAGGAGTTGGG GTGGCTGGAGAAGTTCCTGTAGCCTCCGTGCCGGGATTCGATGGAGAAGTCGTTGCGGCTGGTCCCTTTT CCTTCACTGGTGCTGGATCCTTGGCTGGAAGACATATGCGTGGCTTGACAGTCGATGAGGTGCGAGCCGA CGAGTCCTTGTGAACTTCGTATCTGGAAATATTTTACTTAGATAGCAAATACTAAAATTGTAAAATTACC TCAAAATCTCAGTATCCGGAATGCTCAATTTCTGCTTCAAAACCTGTCCGATGCGAAGATTGACATCATC GCGAGTAGCATCACGAGTCCACAAGGAAACCTTGTCACCCTTTTGACGAACATTCACGACAGCTCCGCAG ATGTAGTCTCCGTACTCGTCGAATTGCTCTCCAACAATAGCCATCAACAGCTCCAACCAGTAGTGATCGA GCAATTGCGTTCTTCTCTGAAGCTTCTATGATTCATTGAATAAAATATATTTCTCAAAACGTACTTGCTT ATCGACAACAACCAACCAACGTCCACCTTGAACGTTGTTGACGTCCTCCCACATTGGCTTGATTCCTTCC TTGAACAAGTAATAATCGGATCCCCAGTTCAATCCTCCGGCAGACTGAATGTGATTGTACAGCGACCAGA AGTCCTCGACAGTGTCGAAAAGTGAAACCATCTGGAAAAAATCGATAAAAGACGTATTTAAAAATCTTCT ACCTTCAGACAATCCTCCCATTCCTTGTTACGGTCAGCTTTCAAGTACCAGAGAGCCCAGCGATTCTGGA GGGGGTGTCTGGTGAGAAGCTCTGGAGGAACTGAAGCATCGGACGCATTCACATCGCCGGAAGCTGACAA TGCTTTGTTTTCCGCTACGGATGTGCTCATTTAGCTGAAAATAGGTAATATTATATACGATTAGAGCTCG GAAAACGATAAAATAGAGAAGAGTATGAATTTGGTTCAAATAACTCGGATTTTATAGGAAATTTTGTTTT ACTGCACATTTTCGGCTAGTTTCCAAGCTTTTTAGATTTTTCAAGTGTAATTGGTAACATCGGGCACAAT AAATTGATATTAAAGCTTGGAAAACAATAAA

Use the sequence above to carry out the following:

Blast the sequence against the C. elegans genome using blastn and selecting the C. elegans genome. 
When you get your report, scroll up and down to get an overall sense of the report. 
Use one of the two links to a genomic flat file sequence. 

1.            Write down the RefSeq numbers for the a) genomic, b) mRNA and c) protein sequences for which this sequence codes. 

Use the RefSeq links to look at the mRNA and protein entries.

2.            What is the name of the protein for which this sequence codes?
3.            List the sections of nucleotides that must be joined in order to assemble the mRNA coding sequence for the protein.
4.            What feature of genomic organization explains why the coding sequence is not continuous?
5.            Is the sequence shown above the sequence of the template strand or the coding strand of the DNA?  How do you know?
6.            Draw an arrow on or near the sequence above to show the direction of transcription.

Genes in eukaryotes are often organized into exons and introns, which require post-transcriptional splicing to produce a mature mRNA with a contiguous open reading frame for translation. This broken organization can make gene identification difficult in eukaryotes and particularly in higher eukaryotes with complex gene organization. Prediction of many genes and their organization has been based on similarity searches between genomic sequence and known protein amino acid sequences and/or genomic sequence and the corresponding full-length cDNAs or even ESTs.   The following exercise challenges you to deduce genomic structure from amino acid information.
Conduct a blastx search (available from NCBI BLAST homepage) of the protein database(s) to which NCBI is linked to attempt to identify regions in this sequence encoding amino acids with similarity to known proteins in this database.  Scroll up and down your result to get the overall picture of the layout. 

7.            What does the blastx algorithm do with your nucleotide sequence before searching SwissProt for matches?

Locate the 3 alignment entries that relate to C. elegans.   Use the first of those three to address the following questions.  Add your answers to this document.

8.            Why is the difference between the numbers at the left and right ends of any “Query” line always larger than the difference between the numbers at the left and right ends of the corresponding “Sbjct” line?
9.            From the blastx output, to what protein does this region of genomic DNA have significant similarity? (There should be no surprises here.)
10.            What features of the output show you that the coding regions for the amino acids within the matched protein are not located within a single contiguous region of the genomic DNA? (There is more than one feature.)
11.            How many separate regions of the genomic DNA align with the highest scoring match in the output?  Does your answer match with the information you saw in your earlier nucleotide blast?

Note the numbering of the sequences in the alignments. Does the database genomic sequence progress in the same direction as the database amino acid sequences in the alignments? In other words is it the same orientation (below):

1.................................114 = query
61...............................98 = subject
or opposite orientation (below):
1.................................114 = query
98...............................61 = subject

12.            What does the orientation of the sequences in the alignment relative to each other tell you about the gene orientation relative to the sequence that was used as the query sequence?  Did the direction of the arrow you first drew above correctly reflect the direction of transcription?

http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html
Reverse complement
ATATGTTAGTTATTATTCTTCTTCACACACATGATTTGTAGGGCGTTTGATTTTGTACATTTTTAAAAAT GCCTGTAGCCGAATTCGTTTGTTCTTCAGGTTAATTGATGAATATATATTTATTCGGTAAATATAAATAG CGCTTCCCCCTATTGATTTACCGGTTTTCGTATTGCTCTCTTGTTGTTACTAGATTCGAGACTGATCGAC AGCTCATTTTTCAAGTCGTACCCCCCACCTAATGTATTGGTGCTTCCCCTCCAATTTGTACCTACTGTTT CAGTCTCAATGTTGTCTTGAAAAATGAACTGTTTTTCGGACACGATCATCGCTTTAACTATTCGAAAATC CCCAACTCCTTAAGCATATTCTAAAGATCTCACCAATTCCTCTCACCGTAAATGAGCTTCCCCGTACTCC AAAAGGGACCAGCCGCAACGACTTCTCCATCGAATCCCGGCACGGAGGCTACAGGAACTTCTCCAGCCAC TCGGCTCGCACCTCATCGACTGTCAAGCCACGCATATGTCTTCCAGCCAAGGATCCAGCACCAGTGAAGG GGTAATTTTACAATTTTAGTATTTGCTATCTAAGTAAAATATTTCCAGATACGAAGTTCACAAGGACTCG GATGATGTCAATCTTCGCATCGGACAGGTTTTGAAGCAGAAATTGAGCATTCCGGATACTGAGATTTTGA CTGCGGAGCTGTCGTGAATGTTCGTCAAAAGGGTGACAAGGTTTCCTTGTGGACTCGTGATGCTACTCGC TCGATCACTACTGGTTGGAGCTGTTGATGGCTATTGTTGGAGAGCAATTCGACGAGTACGGAGACTACAT AAGCAAGTACGTTTTGAGAAATATATTTTATTCAATGAATCATAGAAGCTTCAGAGAAGAACGCAATTGC GGAAGGAATCAAGCCAATGTGGGAGGACGTCAACAACGTTCAAGGTGGACGTTGGTTGGTTGTTGTCGAT TCTGGTCGCTGTACAATCACATTCAGTCTGCCGGAGGATTGAACTGGGGATCCGATTATTACTTGTTCAA AGAAGATTTTTAAATACGTCTTTTATCGATTTTTTCCAGATGGTTTCACTTTTCGACACTGTCGAGGACT TCCAGAATCGCTGGGCTCTCTGGTACTTGAAAGCTGACCGTAACAAGGAATGGGAGGATTGTCTGAAGGT TTGTCAGCTTCCGGCGATGTGAATGCGTCCGATGCTTCAGTTCCTCCAGAGCTTCTCACCAGACACCCCC CGAGCTCTAATCGTATATAATATTACCTATTTTCAGCTAAATGAGCACATCCGTAGCGGAAAACAAAGCA AAAACAAAATTTCCTATAAAATCCGAGTTATTTGAACCAAATTCATACTCTTCTCTATTTTATCGTTTTC ATTGTGCCCGATGTTACCAATTACACTTGAAAAATCTAAAAAGCTTGGAAACTAGCCGAAAATGTGCAGT TTTATTGTTTTCCAAGCTTTAATATCAATTT

Back to Top

Protein Structure Manipulation - Workshop

Download structure coordinates for 1HEW protein from PDB onto your hard drive.
Follow the tutorial for viewing protein structures at:

http://www.usm.maine.edu/~rhodes/SPVTut/text/SPdbVTut.html

You can start the tutorial at Section 2--Windows and help. If you are already advanced in the manipulation of protein structures you may attempt to predict a 3D model of the protein given a primary sequence. Here is the website for predicting 3D structures:

http://expasy.org/spdbv/text/modeling.htm

A more up-to-date swiss model site is located at:

http://swissmodel.expasy.org/workspace/index.php?func=modelling_overview&userid=USERID&token=TOKEN

Back to Top

Proteome Analysis - Workshop

You will be assigned a disease-related protein. Obtain its amino acid sequence. Predict its location on a 2-D-gel with the following tool:

http://www.expasy.ch/cgi-bin/2dregion-for-seq.pl

Can you justify its position on the gel? Show the instructor. Perform an in silico trypsin-mediated digest of your protein. Obtain monoisotopic masses of peptides and feed data to the Mascot server. Determine the minimum number of peptide masses necessary to give a correct identification of your protein. Is there a relationship between the number of peptide masses used and the accuracy of the prediction? What is the number of significant figures needed to get a positive identification? Is there an optimal mass range for positive identification?

Repeat this exercise using Protein Prospector. Which software program is better at retrieving your results?

Back to Top



References

References

Reference Descriptions

Back to Top