The BooSTL Algorithms: Boost Algorithms That Extend the STL (2/3)
One good way to extend our knowledge beyond the STL is to learn the Boost STL-like algorithms. I like to call them the BooSTL algorithms. To cover all the contents in this algorithms library, we have chunked up the story into three parts:
- the BooSTL algorithms on sorting and partitioning,
- the BooSTL algorithms on searching,
- the other BooSTL algorithms.
After seeing the BooSTL algorithms on sorting and partitioning, we are now going to focus on BooSTL algorithms on searching.
The algorithms that Boost offers about searching consist in searching a pattern inside of a range, like a word inside a sentence.
Note that the STL also allows that with the std::search
algorithm, even if it’s not well-known. But the complexity of std::search
is allowed to be (size of pattern) * (size of range), which could be implemented with the naive algorithm of comparing the pattern with the first N elements of the range, then with the next N elements and so on.
But there are faster ways, at least in algorithmic complexity, to perform that search. Boost offers 3 of them (the last two were included in the C++ standard in C++17):
- the Knuth-Morris-Pratt algorithm,
- the Boyer-Moore algorithm,
- the Boyer-Moore-Horspool algorithm.
Knuth-Morris-Pratt
The idea behind the Knuth-Morris-Pratt algorithms is that when a pattern of size N does not correspond to the sub-range [0, N-1), we don’t necessarily try again at sub-range [1, N) which is immediately after it.
Instead, the algorithm considers the first element where the pattern did not match the sub-range, and depending on the pattern, skip some neighbouring sub-ranges that have no chance of matching. For instance, if we search for the pattern “ABCDEF” and the lookup in the sub-range [0, N-1) matches “ABC” but fails at character ‘D’, no need to try comparing the pattern with the sub-ranges [1, N) nor [2, N+1), because they certainly don’t start with an ‘A’ (otherwise the first lookup would not have matched “ABC“).
So for each element in the pattern, there is a new place to start if a lookup fails at that element. All this information is stored in a table. More on the Knuth-Morris-Pratt algorithms on its wikipedia page. The fact that the algorithms skips the places where the search has no chance to succeed gives it a better worse-case complexity of O(size of pattern + size of searched range).
It is interesting to note that the table only depends on the pattern, and not on the range that we search the pattern in. So we can reuse the same table for looking up a pattern in several ranges. This is why Boost lets you build an object that contains the table with make_knuth_morris_pratt
and has an operator()
to search it in a range:
#include <iostream> #include <string> #include <boost/algorithm/searching/knuth_morris_pratt.hpp> int main() { std::string sentence = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, " "it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season " "of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had " "nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the " "period was so far like the present period, that some of its noisiest authorities insisted on its being received, " "for good or for evil, in the superlative degree of comparison only."; std::string word = "incredulity"; auto searcher = boost::algorithm::make_knuth_morris_pratt(word); auto wordPosition = searcher(sentence); if (wordPosition.first != end(sentence)) { std::cout << "The word " << word << " goes from position " << std::distance(begin(sentence), wordPosition.first) << " to position " << std::distance(begin(sentence), wordPosition.second); } }
This function returns a pair of iterators, that contains the begin and end position of the sub-range equal to the pattern (or twice the end of the searched range if it wasn’t found). The above code outputs:
The word incredulity goes from position 158 to position 169
However, if, like in the above code, you only need to perform one search, use the knuth_morris_pratt_search
that builds a table to store the potential places to search, performs the search all in the same function:
#include <iostream> #include <string> #include <boost/algorithm/searching/knuth_morris_pratt.hpp> int main() { std::string sentence = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, " "it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season " "of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had " "nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the " "period was so far like the present period, that some of its noisiest authorities insisted on its being received, " "for good or for evil, in the superlative degree of comparison only."; std::string word = "incredulity"; auto wordPosition = boost::algorithm::knuth_morris_pratt_search(sentence, word); if (wordPosition.first != end(sentence)) { std::cout << "The word " << word << " goes from position " << std::distance(begin(sentence), wordPosition.first) << " to position " << std::distance(begin(sentence), wordPosition.second); } }
The Knuth-Morris-Pratt algorithm is available in the header boost/algorithm/searching/knuth_morris_pratt.hpp.
Boyer-Moore
Boyer-moore is probably the most popular algorithm of string searches. Like Knuth-Morris-Pratt, it consists in not examining hopeless sub-ranges based on a pre-calculated table, but it operates differently.
Boyer-moore starts by comparing the pattern with the first sub-range of the searched range, but performs its comparisons backwards: it compares the last letter of the pattern with the last letter of the sub-range, then the letter before it, and so on. When the pattern fails to match the sub-range, the algorithm has two ways to choose how much to skip to examine a next sub-range (and it chooses whichever one allows to skip farthest):
The first way is very much like Knuth-Morris-Pratt (but backwards): when encountering an element of the pattern that fails to match, the algorithm looks up in its pre-computed table how many neighbouring sub-ranges are hopeless to check, given that the latest element of the pattern matched the latest element of the sub-range. This is called the “bad character rule“.
The second way consists in considering the suffix of the pattern that did match the suffix of the subrange (if there are any elements that did match). The algorithm then moves up the pattern so that the next occurrence of that suffix inside the pattern lines up with the suffix of the sub-range. And if there is no other occurrence of the suffix in the pattern, then it shifts the pattern so that a prefix of the patterns lines up with a suffix of the sub-range. And if there isn’t even such a prefix, then the algorithm shifts the pattern by its whole length. This second way is called the “good suffix rule“.
So every time the pattern fails to match the sub-range, the Boyer-Moore algorithm skips some sub-range based on the bad character rule or the good suffix rule, whichever allows it to skip the most. For more details about the Boyer-Moore algorithm I recommend this visual tutorial from Ben Langmead and the wikipedia page.
Boost offers the Boyer-Moore algorithm with two interfaces (like for Knuth-Morris-Pratt): one with an object that contains the table of the pattern and that can be used for searching it in several ranges:
auto searcher = boost::algorithm::make_boyer_moore(word); auto wordPosition = searcher(sentence); auto wordOtherPosition = searcher(otherSentence);
And one to make only one search of the pattern:
auto wordPosition = boost::algorithm::boyer_moore_search(sentence, word);
The Boyer-Moore algorithm is available in the header boost/algorithm/searching/boyer_moore.hpp.
Boyer-Moore-Horspool
If you’ve understood Boyer-Moore, then you’ll get Boyer-Moore-Horspool immediately, as it is a simplified version of Boyer-Moore, that only has the bad character rule, and not the good suffix rule.
So the Boyer-Moore-Horspool is like Knuth-Morris-Pratt, except that the elements inside the pattern and the searched sub-range are compared backwards (if you understand this sentence, it means you’ve got it all).
Like for the other searching algorithms Boost has two interfaces for Boyer-Moore-Horspool, one with an object that contains the table for a pattern and that can be reused for searching it in several ranges:
auto searcher = boost::algorithm::make_boyer_moore_horspool(word); auto wordPosition = searcher(sentence); auto wordOtherPosition = searcher(otherSentence);
And one with just one function call:
auto wordPosition = boost::algorithm::boyer_moore_horspool_search(sentence, word);
The Boyer-Moore algorithm is available in Boost in the header boost/algorithm/searching/boyer_moore_horspool.hpp.
Those are the searching algorithms Boost brings on top of the STL. If you see some algorithms missing, drop a comment and I’ll add them. Next up, the final chapter on BooSTL algorithms: the other BooSTL algorithms!
Hum… I know this chapter header doesn’t sound very impacting. But it turns our that the rest of the BooSTL algorithms are scattered across various families of algorithms. Anyway they’re cool (and much easier than the searching algorithms), so stay tuned!
Related articles:
- the BooSTL algorithms on sorting and partitioning,
- the other BooSTL algorithms.
Share this post!