Curriculum 2004

 

Lecture Title Date html .ppt Reading Workshop Homework References
Introduction to Course 5/28 view download p. 3-12 BFG   #1 #1
Literature Databases 6/5 view download p. 15-37 BFG #2 #2 #2
Sequence Analysis 6/7 view download p. 41-50 BFG #3 #3 #3
Alignment 1 6/12 view download p. 50-71 BFG #4 #4 #4
Alignment 2 6/14 view download p. 71-79 BFG #5 #5 #5
Intro to Python I 6/19 view download ch. 1-5 #6    
Intro to Python II 6/21 view download ch. 1-5 #7 #7  
Intro to Python II cont 6/26        
Sequence Databases 6/28 view download   #9 #9 #9
Intro to Python III 7/3 view download ch. 5-7      
Intro to Python III cont 7/5        
Developing Sequence Alignment Algorithms 7/10 view download Pevzner, Smith & Waterman, Needlemean & Wunsch   Final Project #12
Statistics 7/12 view download   #10 #10 #10
Rationale for Searching Databases 7/17 view download p. 87-123 BFG #15 #15 #15
Intro to Python IV 7/19 view download Project Help      
Predicting Secondary Sructure 7/24 view download p 223-247 BFG #17 #17 #17
Protein Structure & Model Building 7/26 view download p. 273-313 #18   #18
Protein Structure Tertiary Structure Modeling 7/31 view download       #19
Programming Workshop / Course Summary 8/2 view download Project Help      
Final Projects 8/9        

Extra Lectures
Phylogenetic Analysis   view download        
Overview of Sequence Searching   view download       #11
Multiple Sequence Alginment   view download p. 319-251 BFG #16   #16

Workshop

Workshop #2

Workshop A: Set up a cubby account on a biological topic of interest. Show the instructor the cubby account you set up. Subscribe to NCBI News.

Workshop B: Go to OMIM Website and type "Breast cancer". Link to MIM#113705. What does the light bulb represent? What do the links with the numbers lead to? Obtain the following information on BRCA1: Chromosome location -- Method used to map the BRCA1 gene to that particular location in the chromosome. Name the disease gene that is telomeric and the disease gene that is centromeric to BRCA1.
Obtain protein sequence and cDNA sequence.

Back to Top

Workshop #3

Consider the sequence GAACTCATACGAATTCACGTCAGCCCATCG

Use a window of 3 nucleotides and slide the window 1 nucleotide at a time. Calculate the %GC as a function of nucleotide number. Draw a graph. Change the window to 2 nucleotides. Then overlap the two plots. You may use Excel. Print out the spread sheet and the graph.

Find the protein sequence for bacteriorhodopsin. Make sure you obtain the full-length sequence. Find the Kyte-Doolittle Hydropathy program software at the Expasy Tools website. Perform Kyte-Doolittle Hydropathy analysis of bacteriorhodopsin. Compare the plot to the one displayed in lecture today. Are there differences in the two plots? If so, why?

Back to Top

Workshop #4

A. Download PAM250 and PAM10 from internet and print out. What are the differences between the two matrices? Why do you see these differences?

B. Download BLOSUM67 and BLOSUM30. What are the differences between the two matrices? Why do you see these differences? Download the BLOSUM30 matrix as a text file onto the C drive and name. Name the file BLOSUM30.

C. Import the human p53 (Accession number AAH03596) and squid p53 (Accession number AAA98563) sequences from the protein databases at NCBI onto your hard drive in FASTA format. This can be accomplished by changing the display format on the ENTREZ screen to FASTA. Then highlight the entry and copy onto clipboard. Then open NotePad on your local hard drive. Paste each sequence into a separate document and save them in a folder named sequence on C drive. Name the documents p53_human and p53_squid. Then type dotter c:\sequence\p53_human.txt c:\sequence\p53_human.txt RETURN. Do you detect some parallel lines? Why? What does the greyramp tool do? Capture the image and save..

D. Type dotter c:\sequence\p53_human.txt c:\sequence\p53_squid.txt RETURN. What is the difference between the human vs. human comparison and the human vs. squid comparison?

E. Change the matrix from BLOSUM67 to BLOSUM30 by typing in the following command: dotter -M c:BLOSUM30.txt a:\sequence\p53_human.txt c:\sequence\p53_squid.txt. Which scoring matrix produces more lines? Why?

F. If you have time, obtain the mouse p53 sequence and compare to human. According to your analysis, do you detect more similarity with human or squid. Are some subregions within the p53 protein more conserved than others?

Back to Top

Workshop #5

1. Use your knowledge of dyanamic programming to find the optimal alignments between AATGC and AGGC. Using the Payoff matrix discussed in the lecture what is the score for the optimal path?

Back to Top

Workshop #6

Write a Python program to compute the hydrophobicity of an amino acid. Program will prompt the user for an amino acid and will display the hydrophobicity

Back to Top

Workshop #7

Write a sliding window program to compute the %GC in a sequence of nucleotides. The program should prompt the user for 1) The DNA sequence. 2) The window size (assume the window increment is 1). Test your program using the data for Workshop 3.

** Demonstrate your solution at the beginning of class on May 3rd. Time will be spent at beginning of class on debugging solutions to get them working properly.

Back to Top

Workshop #9

1) Use the following accession number to find a sequence in a nucleotide database: Z68198.

2) Print out the first 1000 nucleotides of the cosmid and decifer the open reading frames from the annotations listed in the flat file.

3) Underline the open reading frames and give direction of the coding strand (5' to 3').

4) Show to the instructor.

Back to Top

Workshop #10

Go to NCBI Website. Open BLAST Website. Perform BLASTP on the sequence SSSVPSQKTYQGSYGFRLGFLHSGTAKSVT. Use default settings. Record the total number of letters in the NR database and the E-value for the top hit. Next, change the database to SwissProt and repeat. Record the total number of letters in the Swiss-Prot data base and the E-value for the top hit. Compare the E-values for the two searches and explain why they are different. Do you think that you obtained these hits by chance? Attempt to find another database so that the search would give you hit with even a lower E-value? What other parameters can you change to give you a score with a lower E-value?

Back to Top

Workshop #15

Use Entrez to find the C-terminal region (approximately 215 residues) of human BRCA1 (SWISS-PROT accession number P38398). Search the NR protein database with this sequence using PSI-BLAST. Why do some new scores have lower E values than the old scores after the second iteration?

Back to Top

Workshop #17

Workshop A: Find the complete amino acid sequence of human p53 and perform a secondary structure prediction with PSIPRED or another secondary structure prediction algorithm. Have the results emailed to you.

Workshop B: Check to see if the BLIMPs program in the BLOCK searcher can predict the function of PTEN (NP_000305). PTEN is an abbreviation for phosphatase and tensin homolog Obtain sequence from protein database at NCBI. Convert to FASTA format. Paste sequence into window in BLOCK Searcher ( http://blocks.fhcrc.org/blocks/blocks_search.html). Determine the major function based on thee BLOCK Searcher output. Determine the actual function of PTEN by performing a text search for PTEN in the OMIM database. Did this BLOCK searcher help assess the function of PTEN?

Back to Top

Workshop #18

Download structure coordinates for 1HEW protein from PDB onto your hard drive. Follow the tutorial for viewing protein structures at http://www.usm.maine.edu/~rhodes/SPVTut/text/SPdbVTut.html. You can start the tutorial at Section 2--Windows and help.

Back to Top


Homework

Homework #1

Write a paragraph with at least five sentences to introduce yourself to the instructors. Describe your goals, aspirations and hobbies. List your standing at CSULA (junior, senior, graduate student, etc.), describe the previous computer-related courses and molecular life science courses you have completed. Describe what you hope to get out of this class. Give your return email address, your name, and your CIN#.

Back to Top

Homework #2

From the list below you will be assigned a disease. Recover three review articles published within the last two years that describe the gene and give the citation for each article. Print out one of the articles. Describe the symptoms of the disease and obtain the name of a single gene that is associated with the disease. Indicate the chromosome location, the disease gene that is telomeric and the disease gene that is centromeric to your gene. Obtain the nucleotide and amino acid sequence of the gene. Include a reference for each website you used to retrieve the information and each article you used. Report the information on a Word document, print out and hand in.

List of diseases: Colon cancer, Ataxia-Telangiectasia, Cystic fibrosis, Cardiomyopathy, Bloom syndrome, Aarskog-Scott syndrome, Albinism, Adrenoleukodystrophy, Muscular Dystrophy (Duchene's Syndrome), Hemophilia, Familial hypercholesterolemia, Huntington's disease, Wilson disease, Menkes disease, Friedreich's ataxia, Werner syndrome, Fragile X RES

Complete problems 1, 3, 5, and 7 from chapter 2.

Back to Top

Homework #3

Given the following sequence: PLSQETFSDLWKLLPENNVLSP use the Kyte/Doolittle Hydropathy scale and a sliding window of 7 amino acids to construct a hydropathy plot.

Back to Top

Homework #4

1. The p53 protein is known to have a certain number of conserved domains. Use the Dotter program and a series of p53 proteins from different species (at least 5 proteins) to determine the number of conserved domains and the boundaries of the conserved domains. In reporting the boundaries, use the human sequence number as the standard.

2. Create your own scoring matrix that shows fairly good results with simple sequences using the Dotter program. For example, you may choose to create a scoring matrix that only gives high marks for charged residues similarities but low marks for other amino acid similarities. An easy way to do this is to use one of the BLOSUM or PAM matrices as a template and change the numbers slightly. Show that your scoring matrix works for some simple polypeptides that are 50 amino acids in length. Explain the purpose of your scoring matrix and your justification for the numbers you choose.

3. Print out your homework and hand in hard copy.

Back to Top

Homework #5

Answer 3.1, 3.2, 3.3, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10

Back to Top

Homework #7

Modify your sliding window program from workshop 7 to compute the hydrophobicity of an amino acid sequence. Use the Kyte and Doolittle scale (lecture #3). Obtain the sequence of bacteriorhodopsin and save in a file. Your script should read the sequence from the input file, compute the hydrophobicity as a function of amino acid residue number, and store the results to an output file. Plot the results in EXCEL and compare your plot to the one given in lecture #3 and your answer for workshop 3. Due May 10th.

Back to Top

Homework #9

1) Locate a primary database on the Web not discussed in lecture, describe the experimental data placed in the database and the programming language used to input data into the database. Find a secondary database that uses your primary database as a source. Describe the new information the secondary database provides. Write your answers and include the websites you visited on a Word document.

2) Answer 3.1, 3.2, 3.3, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10

Back to Top

Homework #10

Given that there are 30,000 open reading frames in the human genome and that the average open reading frame codes for 500 amino acids determine the theoretical number of proteins that would have an amino acid sequence of PQPKKKPL by random chance. This is a putative nuclear localization sequence. Determine the number of times this sequence shows up in the human genome database by using a BLAST protein search of human sequences in Genbank.

Back to Top

Homework #15

1. Problems 4.6, 4.7, 4.9

2. Genes in eukaryotes are often organized into exons and introns which require post-transcriptional splicing to produce a mature mRNA with a contiguous open reading frame for translation. As we have discussed in class, this
often makes gene identification difficult in eukaryotes and particularly in higher eukaryotes with complex gene organization. In the analysis of the sequence derived from the human genome discussed in class, prediction of many
genes and their organization was based on similarity searches between gene sequence and known proteins and/or
gene sequence and either their corresponding full-length cDNAs or ESTs.

Below is a small portion (~1,500 bp) of the C. elegans genome taken from a cosmid sequence (a cosmid is a vector that can hold ~30-50 kb inserts) used as a template to determine the sequence of the C. elegans genome.

ATTTTTAAAAATGTACAAAATCAAACGCCCTACAAATCATGTGTGTGAAGAAGAATAATAACTAACATAT
CTATTTATATTTACCGAATAAATATATATTCATCAATTAACCTGAAGAACAAACGAATTCGGCTACAGGC
GTCGATCAGTCTCGAATCTAGTAACAACAAGAGAGCAATACGAAAACCGGTAAATCAATAGGGGGAAGCG
AAACAGTAGGTACAAATTGGAGGGGAAGCACCAATACATTAGGTGGGGGGTACGACTTGAAAAATGAGCT
GATTTTCGAATAGTTAAAGCGATGATCGTGTCCGAAAAACAGTTCATTTTTCAAGACAACATTGAGACTG
GGAGTACGGGGAAGCTCATTTACGGTGAGAGGAATTGGTGAGATCTTTAGAATATGCTTAAGGAGTTGGG
GTGGCTGGAGAAGTTCCTGTAGCCTCCGTGCCGGGATTCGATGGAGAAGTCGTTGCGGCTGGTCCCTTTT
CCTTCACTGGTGCTGGATCCTTGGCTGGAAGACATATGCGTGGCTTGACAGTCGATGAGGTGCGAGCCGA
CGAGTCCTTGTGAACTTCGTATCTGGAAATATTTTACTTAGATAGCAAATACTAAAATTGTAAAATTACC
TCAAAATCTCAGTATCCGGAATGCTCAATTTCTGCTTCAAAACCTGTCCGATGCGAAGATTGACATCATC
GCGAGTAGCATCACGAGTCCACAAGGAAACCTTGTCACCCTTTTGACGAACATTCACGACAGCTCCGCAG
ATGTAGTCTCCGTACTCGTCGAATTGCTCTCCAACAATAGCCATCAACAGCTCCAACCAGTAGTGATCGA
GCAATTGCGTTCTTCTCTGAAGCTTCTATGATTCATTGAATAAAATATATTTCTCAAAACGTACTTGCTT
ATCGACAACAACCAACCAACGTCCACCTTGAACGTTGTTGACGTCCTCCCACATTGGCTTGATTCCTTCC
TTGAACAAGTAATAATCGGATCCCCAGTTCAATCCTCCGGCAGACTGAATGTGATTGTACAGCGACCAGA
AGTCCTCGACAGTGTCGAAAAGTGAAACCATCTGGAAAAAATCGATAAAAGACGTATTTAAAAATCTTCT
ACCTTCAGACAATCCTCCCATTCCTTGTTACGGTCAGCTTTCAAGTACCAGAGAGCCCAGCGATTCTGGA
GGGGGTGTCTGGTGAGAAGCTCTGGAGGAACTGAAGCATCGGACGCATTCACATCGCCGGAAGCTGACAA
TGCTTTGTTTTCCGCTACGGATGTGCTCATTTAGCTGAAAATAGGTAATATTATATACGATTAGAGCTCG
GAAAACGATAAAATAGAGAAGAGTATGAATTTGGTTCAAATAACTCGGATTTTATAGGAAATTTTGTTTT
ACTGCACATTTTCGGCTAGTTTCCAAGCTTTTTAGATTTTTCAAGTGTAATTGGTAACATCGGGCACAAT
AAATTGATATTAAAGCTTGGAAAACAATAAA

Using this sequence carry out the following:

Conduct a blastx search (BLAST) of the Swiss Protein database (SwissProt within the "Choose Database" drop down window which has nr as the default) to attempt to identify regions in this sequence encoding amino acids with similarity to known proteins in the current database.

A1. From the blastx output, to what protein does this region of genomic DNA have significant similarity to?

Note the number of individual HSPs or alignments within the first 5-10 highest scoring hits.

A2. Are the amino acids within the matched protein located within a single contiguous region of the genomic DNA? If not, how many separate regions of the genomic DNA align with the highest scoring match in the output?

A3. What essential feature of the organization of the gene does the above information provide?

A4. Note the numbering of the sequences in the alignments. Do both the genomic sequence and the protein sequences progress in the same direction within the alignments or not? In other words is it the same orientation:

1.................................38 = query
60...............................98 = subject

or opposite orientation:

1.................................38 = query
98...............................60 = subject

A5. What does the orientation of the sequences in the alignment tell you about the gene orientation and strand that was used as the query sequence?

Back to Top

Homework #17

Go to the Protein Data Bank and obtain the record for the p53 crystal structure (1TSR). Create a line graph that places the amino acid sequence in one row and the known secondary structure of that amino acid in the next row. Next, obtain the PSIPRED predicted secondary structure of p53. Create a third row on the line graph that shows the predicted structure. If you can, try to obtain other portions of p53 where the structure has been solved from the Protein Data Bank and fill in those regions in the second row that were not obtained in the 1TSR record. Turn in the hard copy of the line figure and calculate the percent accuracy of the PSIPRED prediction. A hypothetical example is shown below

Sequence: MEETHAPYRGVCNNM
Actual Structure: CCCCCHHHHHHEEEE
PSIPRED Predict.: CCCCCHHHHHHEEEH

Percent accuracy: 14/15 X 100

Back to Top


References

References for Lecture #1

Gelehrter, Collins, Ginsburg, Principles of Medical Genetics, Williams & Wilkins, Baltimore, 1998.
Baxevanis and Ouellette, Bioinformatics-2nd Edition, Wiley-Interscience, New York, 2001
http://cmgm.stanford.edu/classes/csuh/

Back to Top

References for Lecture #2

http://adonis.creighton.edu/hsl/Searching/Medline-Fields.html
http://cmgm.stanford.edu/classes/csuh/literature/
http://hml.org/WWW/class/help/medcite.html
http://www.nlm.nih.gov/mesh/meshhome.html

Back to Top

References for Lecture #3

Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 2001
http://www.infobiogen.fr/doc/dotter.html
http://info.bio.cmu.edu/Courses/BiochemMols/BCMolecules.html

Back to Top

References for Lecture #4

http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#2page6
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 1998
http://cmgm.stanford.edu/classes/csuh/literature/
http://www.infobiogen.fr/doc/dotter.html

Back to Top

References for Lecture #5

http://www.finchcms.edu/biochem/Walters/nw.html
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 1998
http://cmgm.stanford.edu/classes/csuh/search1/
http://bioweb.pasteur.fr/docs/doc-gensoft/EMBOSS/doc/programs/text/needle.txt
http://www.maths.tcd.ie/~lily/pres2/sld003.htm
http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html

Back to Top

References for Lecture #9

http://www.ncbi.nlm.nih.gov/Sitemap/index.html#GenBank
Baxevanis and Ouellette, Bioinformatics 2nd Ed, Wiley-Interscience, New York, 2001
Misener and Krawetz, Bioinformatics Methods and Protocols, Humana Press, Totowa, NJ, 2000

Back to Top

References for Lecture #10

Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 1998
http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html#Expect
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
http://www.calstatela.edu/faculty/jmomand/FacultyWorkshop_01/ProbabilityMethods_files/v3_document.htm
http://www.wikipedia.org/wiki/Bayes%27_theorem

Back to Top

References for Lecture #11

http://www.math.tau.ac.il/~rshamir/algmb/scribe00/html/lec03/lec03.html
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 1998
http://cmgm.stanford.edu/classes/csuh/literature/
Misener, S. and Krawetz, Bioinfomatics Methods and Protocols, Humana Press, Totowa, 2000
http://searchlauncher.bcm.tmc.edu:9331/help/AlignmentScore.html
http://barton.ebi.ac.uk/papers/rev93_1/subsection3_7_6.html
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Seg.html

Back to Top

References for Lecture #12

http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html
Lectures: Database search (4/16) and Rationale for DB Searching (5/16)
Computational Molecular Biology – An Algorithmic Approach, Pavel Pevzner
Introduction to Computational Biology – Maps, sequences, and genomes, Michael Waterman
Algorithms on Strings, Trees, and Sequences – Computer Science and Computational Biology, Dan Gusfield
http://www.sbc.su.se/~arne/kurser/swell/pairwise_alignments.html

Back to Top

References for Lecture #15

Misener and Krawetz (Eds), Bioinformatics, Methods and Protocols, vol. 132, Humana Press, Totowa, 2000 pp 185-219
Baxevanis and Ouellette, Bioinformatics 2nd ed. , Wiley-Interscience, New York, 2001
http://www.dkfz-heidelberg.de/tbi/bioinfo/Variants/LocalAli/
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/BLAST/slide_list.html

References for Lecture #17

Pietrokovski et al., The Blocks Database-A system for protein classification.
Baxevanis and Ouellette, Bioinformatics 2nd edition, Wiley-Interscience, New York, 2001
http://www.sbc.su.se/~arne/kurser/swell/secstrpred.html
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10493868&dopt=Abstract
http://www.chembio.uoguelph.ca/educmat/chm730/f730.html

Back to Top

References for Lecture #18

http://www.usm.maine.edu/~rhodes/SPVTut/text/SPdbVTut.html
http://www.sbc.su.se/~arne/kurser/swell/secstrpred.html
http://www.expasy.ch/swissmod/course/text/chapter6.htm
http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml

Back to Top

References for Lecture #19

http://swissmodel.expasy.org//course/text/chapter6.htm
Wang et al., Nucleic Acids Research 28, 243-245, 2000.
http://www.iucr.org/iucr-top/comm/ccom/School96/pdf/sb.pdf

Back to Top