Full Text Search Shenanigans

This great blog post from 2013 describes using bloom filters to build a space-efficient full text search index for small numbers of documents. The algorithm is simple: Per document, create a bloom filter of all its words. To query, simply check each document's bloom filter for the query terms.

With a query time complexity of O(number-of-documents), we can forget about using this on big corpuses, right? In this blog post I propose a way of scaling the technique to large document corpuses (e.g. the web) and discuss why that is a bad idea.

Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild. But let's get into it.

Why even try this?

The bloom filter index's big selling point is its small size. It allows static websites with dozens of pages to ship a full text search index to the client that is as small as a small image. An equivalent inverted file index, which is the traditional textbook approach for keyword-based full text search, would be multiple times bigger.

But index size is not only relevant on small blog websites. If we could scale this technique to larger document corpuses and achive similar space savings, that would be huge!

The main thing that seems to stand in our way is query performance. Instead of always checking every document's bloom filter, we will try to construct an index that only checks a small subset of filters, but still finds all matching documents.

Some Ideas that don't work at all

Look, I brainstormed a bunch of ideas for how to improve the bloom filter based index. I will quickly go over two of them, because identifying and discarding ideas that will not work is an important part of science and engineering.

Sort the filters

If we sort the filters by some metric, for example by the most to least significant bits, then we can use a binary search algorithm or something like that, right? - Wrong.

We can construct a simple counter example to show that this does not work. Here the query is matched by the first and the last filter in a sorted list of filters.

test

Tree of filters

Plain sorting does not work, but what if we structure our big set of filters into a tree? Imagine it sort of like this, but much bigger.

test

At each branch node, we construct an aggregate filter that encodes all documents that are reachable from the branch. Aggregate filters are constructed by a simple bitwise or of the other filters like this.

test

When we get a query, we first check it against the top level branch filters. If a filter does not match e.g. the filter for document 6-10, we can discard that entire branch of the tree for this query.

Ideally we would like to search as few branches of the tree as possible to improve performance. How many branches we do need to search, depends heavily on the partitioning of the documents. Intuitively, we can think of it like this: branch A should contain all documents that contain the words "dog", "cat", "bird". Branch B should contain documents with "car", "bus", "plane". The fewer branches each word is contained in the better.

Here comes the problem: What if there is a document that says "I took my cat on the bus today"?
This breaks our assumtion above, and suddenly for the query "bus" we need to search both branches. You can imagine this happening for almost every word in the dicionary across all branches, because language is complex and lets us say so many different things in many different contexts.

Or in other words: Text documents are high-dimensional.
I recommend reading about the curse of dimensionality for an intuition of what issues this implies. Here it means that it is basically impossible to cluster text documents into disjunct subsets without significant overlap. For our seach index that means that even when using this tree, we would still need to search almost every document for every query.

Inverted Index of Bloom Filters

The problem with our previous tree-based idea is that there is so much overlap between text documents. But I know one book that only contains every work exactly once: The dictionary. We can construct a search tree of the entire dictionary, again based on bloom filters. Each leaf represents a set of words. At each leaf we keep a list of pointers to every document's filter that contains one of those words.

A diagram of a simplified search tree over the english dictionary

Maybe not so incidentally, this looks a lot like an inverted file index. And it works! For any query term we can walk the tree to the leaf that contains the query term and then we match only against the filters at that leaf. Instead of a hash table, as in the inverted file index, our index uses a tree for the dictionary, but fundamentally it does a similar thing.

The big difference is that the tree can be smaller than the hash table. Remember, size is the main reason to attempt this at all. Not only is there no empty space in a tree, but we also encode all the words in our bloom filters instead of storing them outright. Modern bloom filters (actually called Xor filters) require about ten bits per element [1], much less than the 8 bits per character required to store a full word.

Why all of this is still a bad idea

We did it! We have a working idea for a bloom filter based search index that works for large document corpuses. The query time complexity is not as good as for an inverted file index, but it is logarithmic with the number of documents. That is good enough if you ask me. So why do I write that it is still a bad idea?

Let's think about what allows the bloom filter based index to be small again. Instead of storing the entire dictionary in our index, we use bloom filters that require about ten bits per word. Ten bits per word. Ten bits per every word. Not unique word. Every word in our document corpus (except duplicates in the same document). While an inverted file index must store every word in the dictionary exactly once, sharing the space when a word is reused, bloom filters share space amongst each other. Every document's bloom filter must encode all words in the document from scratch. If a word is contained in hundreds of thousands of documents, that requires much more space than simply storing the word in plain text. As a consequence, the more documents we have, the less efficient our index becomes compared to an inverted file index, eventually requiring more space than the latter.

Conclusions

When you have a small number of documents relative to the size of your dictionary, bloom filters can indeed achieve a much smaller full text search index than is possible traditionally.

Bloom filters are space efficient when compressing a large dictionary into a small number of filters. As more filters share the same dictionary, this efficiency decreases. Intuitively this is because bloom filters cannot share information amongst each other. Each filter must encode its entire dictionary from scratch. An inverted file index does not do this. It only stores the dictionary once and shares it for all documents, so it gets more space efficient with the number of documents.

More generally, there is no synergy between bloom filters. Each filter on its own is efficient, but as a whole system, a different approach might be more efficient. We can transfer this insight to other problem domains as well. For example, imagine a content moderation system on a social media platform that allows blocking individual accounts. If we have one global blocklist on our platform, a bloom filter can be an efficient (though maybe not ideal) implementation of this. But allow every user to create their own blocklist and a different design will be more scaleable.