Introduction to
Introduction to Bioinformatics.
Introduction toBioinformatics1
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to Bioinformatics.LECTURE 3:SEQUENCE ALIGNMENT*Chapter 3:All in the family2
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.1Eye of the tiger* In 1994 Walter Gehringet alum(Un. Basel) turn thegene eyelessonin various places onDrosophilamelanogaster* Result: on multiple places eyes are formed* ‘eyeless’ is a master regulatory gene that controls +/-2000 other genes* ‘eyeless’ on induces formation of an eye3
Mutant Drosophila melanogaster: gene ‘EYELESS’ turned on
Eyeless Drosophila4
LECTURE 3: SEQUENCE ALIGNMENTHomeoboxes and Master regulatory genes
MutantDrosophila melanogaster: gene ‘EYELESS’ turnedon5
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
LECTURE 3: SEQUENCE ALIGNMENTHomeoboxes and Master regulatory genes6
LECTURE 3: SEQUENCE ALIGNMENTDrosophila melanogaster: HOX homeoboxes
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENTHOMEO BOXA homeobox is a DNA sequence found withingenes that are involved in the regulation ofdevelopment (morphogenesis) of animals, fungiand plants.7
LECTURE 3: SEQUENCE ALIGNMENTDrosophila melanogaster: PAX homeoboxes
LECTURE 3: SEQUENCE ALIGNMENTDrosophila melanogaster: HOX homeoboxes8
LECTURE 3: SEQUENCE ALIGNMENTHomeoboxes and Master regulatory genes
LECTURE 3: SEQUENCE ALIGNMENTDrosophila melanogaster: PAX homeoboxes9
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
LECTURE 3: SEQUENCE ALIGNMENTHomeoboxes and Master regulatory genes10
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.2On sequence alignmentSequence alignment is the most important task inbioinformatics!11
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.2On sequence alignmentSequence alignment is important for:* prediction of function* database searching* gene finding* sequence divergence* sequence assembly12
LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.3On sequence similarityHomology: genes that derive from a common ancestor-geneare calledhomologsOrthologousgenes are homologous genes indifferentorganismsParalogousgenes are homologous genes inoneorganismthat derive fromgene duplicationGene duplication: one gene is duplicated in multiple copiesthat therefore free to evolve and assume new functions13
LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS
LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS14
LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS versus ANALOGOUS
LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS15
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT: sequence similarity
LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS versusANALOGOUS16
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT:sequence similarityCauses for sequence (dis)similaritymutation:a nucleotide at a certain location is replaced byanothernucleotide (e.g.: ATA AGA)insertion:at a certain location one new nucleotide isinserted inbetween two existing nucleotides(e.g.: AA AGA)deletion:at a certain location one existing nucleotideis deleted (e.g.: ACTG AC-G)indel:aninsertion or adeletion17
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.4Sequence alignment: global and localFind the similarity between two (or more)DNA-sequences by finding a goodalignmentbetween them.18
The biological problem ofsequence alignmentDNA-sequence-1tcctctgcctctgccatcat---caaccccaaagt|||| ||| ||||| |||||||||||||||||tcctgtgcatctgcaatcatgggcaaccccaaagtDNA-sequence-2Alignment19
Sequence alignment - definitionSequence alignmentis an arrangement of two or more sequences,highlighting their similarity.The sequences are padded withgaps(dashes) so that wherever possible,columns containidentical charactersfrom the sequences involvedtcctctgcctctgccatcat---caaccccaaagt|||| ||| ||||| |||||||||||||||||tcctgtgcatctgcaatcatgggcaaccccaaagt20
AlgorithmsNeedleman-WunschPairwiseglobalalignment only.Smith-WatermanPairwise,local(or global) alignment.BLASTPairwiseheuristiclocal alignment21
PairwisealignmentPairwise sequence alignmentmethods are concerned with finding the best-matching piecewise local or global alignments of protein (amino acid) or DNA(nucleic acid) sequences.Typically, the purpose of this is to findhomologues(relatives) of a gene or gene-product in a database of known examples.This information is useful for answering a variety of biological questions:1. The identification of sequences ofunknownstructure or function.2. The study ofmolecular evolution.22
GlobalalignmentAglobal alignmentbetween two sequences is an alignment in which all thecharacters in both sequences participate in the alignment.Global alignments are useful mostly for finding closely-related sequences.As these sequences are also easily identified by local alignment methods globalalignment is now somewhat deprecated as a technique.Further, there are several complications to molecular evolution (such asdomainshuffling) which prevent these methods from being useful.23
Global AlignmentFind theglobalbest fit between two sequencesExample: the sequencess=VIVALASVEGASandt=VIVADAVISalign like:A(s,t) =V I V ALASVE G AS| | | ||||V I V ADA-V- -ISindels24
The Needleman-Wunsch algorithmTheNeedleman-Wunsch algorithm(1970, J Mol Biol.48(3):443-53) performs a global alignment on twosequences (sandt) and is applied to align protein ornucleotide sequences.The Needleman-Wunsch algorithm is an example ofdynamic programming, and is guaranteed to find thealignment with the maximum score.25
The Needleman-Wunsch algorithmOf course this works for both DNA-sequencesas for protein-sequences.26
Alignment scoring functionThe cost of aligning two symbolsxiandyjis thescoring functionσ(xi,yj)27
Alignment costThe cost of the entire alignment:28
A simple scoring functionσ(-,a) = σ(a,-) = -1σ(a,b) = -1 ifabσ(a,b) = 1 ifa=b29
The substitution matrixA more realistic scoring function is given by thebiologically inspired substitution matrix :-AGCTA10-1-3-4G-17-5-3C-3-590T-4-308Examples:*PAM(Point Accepted Mutation) (Margaret Dayhoff)*BLOSUM(BLOck SUbstitution Matrix) (Henikoff and Henikoff)30
Scoring functionThe cost for aligning the two sequencess=VIVALASVEGASandt=VIVADAVIS:A(s,t) =is:M(A)= 7 matches + 2 mismatches + 3 gaps= 7 2 3=2V I V ALASVE G AS| | | ||||V I V ADA-V- -IS31
Optimal global alignmentTheoptimal global alignmentA* between two sequencessandtis the alignmentA(s,t) that maximizes the totalalignment scoreM(A) over all possible alignments.A* = argmaxM(A)Finding the optimal alignmentA* looks a combinatorialoptimization problem:i. generate all possible allignmentsii. compute the scoreMiii. select the alignmentA* with the maximum scoreM*32
LocalalignmentLocal alignmentmethods find related regionswithinsequences - they canconsist of asubsetof the characters within each sequence.For example,positions 20-40ofsequence Amight be aligned withpositions50-70ofsequence B.This is a more flexible technique thanglobal alignmentand has the advantagethatrelated regionswhich appear in a different order in the two proteins (which isknown asdomain shuffling) can be identified as being related.This isnotpossible withglobal alignmentmethods.33
The Smith Waterman algorithmTheSmith-Waterman algorithm(1981) is for determining similar regionsbetween two nucleotide or protein sequences.Smith-Waterman is also a dynamic programming algorithm and improveson Needleman-Wunsch. As such, it has the desirable property that it isguaranteed to find theoptimal local alignmentwith respect to the scoringsystem being used (which includes the substitution matrix and the gap-scoring scheme).However, the Smith-Waterman algorithm isdemanding of time andmemoryresources: in order to align two sequences of lengths m and n,O(mn) time and space are required.As a result, it has largely been replaced in practical use by theBLASTalgorithm; although not guaranteed to find optimal alignments, BLAST ismuch more efficient.34
Optimal local alignmentTheoptimal local alignmentA* between twosequencessandtis theoptimal global alignmentA(s(i1:i2),t(j1:j2)) of the sub-sequencess(i1:i2)andt(j1:j2) for some optimal choice of i1, i2,j1and j2.35
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Sequence alignment - meaningSequence alignment is used to study theevolutionof the sequencesfrom acommon ancestorsuch as protein sequences or DNAsequences.Mismatchesin the alignment correspond tomutations,andgapscorrespond toinsertionsordeletions.Sequence alignment also refers to the process ofconstructingsignificantalignmentsin a database of potentially unrelated sequences.36
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT: statistical analysis
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.5Statistical analysis of alignmentsThis works identical to gene finding:* Generate randomized sequences based on thesecond string* Determine the optimal alignments of the firstsequence with these randomized sequences* Compute a histogram and rank the observedscore in this histogram* The relative position defines thep-value.37
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT: statistical analysisHistogram of scores of randomlygenerated strings using permutationof original sequencetoriginal sequences38
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.6BLAST:fast approximate alignmentFast but heuristicMost used algorithm in bioinformaticsVerb:toblast39
40
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
41
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.7Multiple sequence alignment:Determine the best alignment between multiple(more than two) DNA-sequences.42
Multiple alignmentMultiple alignment is an extension of pairwise alignment toincorporate more than two sequences into an alignment.Multiple alignment methods try to align all of the sequencesin a specified set.The most popular multiple alignment tool isCLUSTAL.Multiple sequence alignment is computationally difficult andis classified as an NP-Hard problem.43
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Multiple alignment44
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT3.8Computing the alignments* NW and SW are both based on DynamicProgramming (DP)* A recursive relation breaks down the computation45
Dynamic ProgrammingApproach toSequence AlignmentThedynamic programming approachtosequence alignment always tries tofollow the best prior-resultso far.Try to align two sequences by inserting some gaps at different locations, so as tomaximizethe score of this alignment.Score measurementis determined by"match award", "mismatch penalty" and"gap penalty".The higher the score, the better the alignment.If both penalties are set to 0, it aims to always find an alignment withmaximummatches so far.Maximum match= largest number matches can have for one sequence byallowing all possible deletion of another sequence.It is used to compare the similarity between two sequences of DNA or Protein, topredict similarity of their functionalities.Examples:Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)46
The Needleman-Wunsch algorithmTheNeedleman-Wunsch algorithm(1970, J Mol Biol. 48(3):443-53) performs aglobal alignment on two sequences (A and B) and is applied to align protein ornucleotide sequences.The Needleman-Wunsch algorithm is an example ofdynamic programming,and is guaranteed to find the alignment with the maximum score.Scores for aligned characters are specified by the transition matrixσ (i,j):thesimilarityof charactersiandj.47
The Needleman-Wunsch algorithmFor example, if the substitution matrix was-AGCTA10-1-3-4G-17-5-3C-3-590T-4-308with agap penaltyof-5, would have the following score...then the alignment:AGACTAGTTACCGA---GACGT48
The Needleman-Wunsch algorithm1.Create a table of size (m+1)x(n+1) for sequencessandtof lengthsmandn,2.Fill table entries (m:1) and (1:n) with the values:3.Starting from the top left, compute each entry using the recursive relation:4.Perform the trace-back procedure from he bottom-right corner49
The Needleman-Wunsch algorithmOnce the F matrix is computed, note that the bottom right hand corner of thematrix is the maximum score for any alignments. To compute which alignmentactually gives this score, you can start from the bottom left cell, and compare thevalue with the three possible sources(Choice1, Choice2, and Choice3 above) tosee which it came from. If it was Choice1, then A(i) and B(i) are aligned, if it wasChoice2 then A(i) is aligned with a gap, and if it was Choice3, then B(i) is alignedwith a gap.50
The Needleman-Wunsch algorithm51
52
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
The Smith-Waterman algorithm1.Create a table of size (m+1)x(n+1) for sequencessandtof lengthsmandn,2.Fill table entries (1,1:m+1) and (1:n+1,1) with zeros.3.Starting from the top left, compute each entry using the recursive relation:4.Perform the trace-back procedure from the maximum element in the table tothe first zero element on the trace-back path.53
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENTEXAMPLE: Eyeless Gene HomeoboxCompare the gene eyeless of Drosophila Melanoganster with the human geneaniridia. They are master regulatory genes producing proteins that control largecascade of other genes. Certain segments of genes eyeless of Drosophilamelanogaster and human aniridia are almost identical. The most important ofsuch segments encodes the PAX (paired-box) domain, a sequence of 128 aminoacids whose function is to bind specific sequences of DNA. Another commonsegment is the HOX (homeobox) domain that is thougth to be part of more than0.2% of the total nummber of vertebrate genes.54
Introduction to BioinformaticsLECTURE 3: GLOBAL ALIGNMENT
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT55
Introduction to BioinformaticsLECTURE 3: GLOBAL ALIGNMENT
Introduction to BioinformaticsLECTURE 3: GLOBAL ALIGNMENT56
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
Introduction to BioinformaticsLECTURE 3: GLOBAL ALIGNMENT57
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT58
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT
END of LECTURE 359
Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT60
61