dna sequence matching problem....

saj1919 · June 1, 2013, 2:37pm

i am getting a trouble finding an approach to solve this problem…

input-output sequences are as follows

input1 : aaagctgctagag

output1 : a3gct2ag2

input2 : aaaaaaagctaagctaag

output2 : a6agcta2ag

input nsequence can be of 10^6 characters and largest continuous patterns will be considered. For example in input2 “agctaagcta” it will not be agcta2gcta but it will be “agcta2”.

any help appreciated.

Extra Examples :

input aabbaabb
output is aabb2 not a2b2a2b2
input aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb
output is a9b9a9b9 not aaaaaaaaabbbbbbbbb2

It shows that smaller the encode it is most likely to be an answer.

cyberax · June 2, 2013, 12:59am

I would read the input string character by character, and look if the next one is the first of a suffix of already processed data. If so, i increment a counter for this chunk, else i just continue. I don’t know if this method is correct, but that would be my first approach.

cyberax · June 2, 2013, 10:41pm

@rakeshbubli143: why downvoting ?