Pattern Discovery and Motif Finding in DNA
with Neil Jones

When William Legrand from Edgar Allan Poe's novel "The Gold-Bug" found a parchment written by the pirate Captain Kidd:

53& & $305) ) 6*; 4826) 4& .) 4& ) :806*; 48$8{60) ) 85; 1& ( ; :& *8$83( 88) 5*$; 46( 88   *96*?; 8) *& ( ; 485) ; 5*$2:*& ( ; 4956*2( 5*--4) 8{8*; 4069285) ; ) 6$8) 4& & ; 1( & 9; 48081; 8:8& 1; 4885; 4) 485$528806*81( & 9; 48; ( 88; 4( & ?34; 48) 4& ; 161; :188; & ?;

his friend told him, "Were all the jewels of Golconda awaiting me upon my solution of this enigma, I am quite sure that I should be unable to earn them."  Mr. Legrand responded, "It may well be doubted whether human ingenuity can construct an enigma of the kind which human ingenuity may not, by proper application, resolve." He noticed that a combination of three symbols ; 48 appeared very frequently in the text. He also knew that Captain Kidd's pirates spoke English and that the most frequent English word is THE. Assuming that ; 48 coded for THE,  Mr. Legrand deciphered the parchment note and found the pirate treasure ( and a few skeletons as well) . After this insight, Mr. Legrand had a slightly easier text to decipher ( try to complete deciphering) :

53& & $305) ) 6*THE26) H& .) H& ) :E06*THE$E{60) ) E5T1& ( T:& *E$E3( EE) 5*$TH6( EE   *96*?TE) *& ( THE5) T5*$2:*& ( TH956*2( 5*--H) E{E*TH0692E5) T) 6$E) H& & T1( & 9T HE0E1TE:E& 1THE$E5TH) HE5$52EE06*E1( & 9THET( EETH( & ?3HTHE) H& T161T1EET& ?T

There are some similarities between Mr. Legrand's task and making sense of the human genome. You may try to complete deciphering, particularly in view of the fact that 3 billions letters of the human genome has just become a public knowledge. However, DNA texts are not easy to decipher, and there is little doubt that Nature can construct an enigma of the kind which human ingenuity may not resolve. Computational molecular biology borrowed Mr. Legrand's scientific method with frequent words corresponding to signals in DNA. However, Nature is more inventive than Captain Kidd and, surprisingly enough, finding frequent words in genomes is a very hard problem. The work on deciphering signals in the human genome is about to start and it faces a number of combinatorial challenges.

Signal finding  (pattern discovery) in DNA sequences is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification.  Despite many studies,  this problem is far from being resolved:  most  signals in DNA sequences are so complicated that we don't yet have good models or reliable algorithms for their recognition. We develop new  combinatorial approaches to signal finding in DNA. Our recent focus is on using comparative genomics for discovering new regulatory motifs in mammalian genomes