 # What is the best algorithm to find the most occurrence of a word in text?

Does anyone know what is the best algorithm to find the most occurrence of a word in text (very large), so hash map is not a solution I’m looking for. I’m thinking of either “Trie” or “Suffix Array” but I haven’t had a clear idea how to solve it. Any idea or suggestion would be greatly appreciated.

For example: This is is is an example.
Then the most occurrence is “is”.

``````c = first char
while c not text end
c = get char
if c terminates word
get actual word from trie
increment counter and check if it is max
else
move in trie pointer using c
``````

after this word

``````ROOT - t - h - i - s (1) // <- most occurences
``````

after first is

``````ROOT + t - h - i - s (1) // <- most occurences
- i - s (1)
``````

after second is

``````ROOT + t - h - i - s (1)
- i - s (2) // <- most occurences
``````

at the EOF

``````ROOT + t - h - i - s (1)
+ i - s (3) // <- most occurences
+ a - n (1)
- e - x - a - m - p - l - e (1)
``````

edit: but this works only if you can split the text to words… If your question is about substrings instead of words, then the question is if you are looking something better than substrings of length 1 (just characters).

what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.

you can use KMP or Aco-Corasick both used for pattern matching, if u have large number of pattern match with a a input text of large size u can use Aco-Corasick

the question was asked 1 year ago so i think the op would have got his answer by now. its showing up at the top of the feed because of my cmment “what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.” can you please answer it. also please add some details in ur current answer. i really dont understand how KMP or aho corasick could be used here…

I think KMP is the best one with O(n+m) refer to Needle in a haystack problem on SPOJ