Stealing techniques from Apache Lucene

A deep-dive into the Lucene source code gave me the education I needed to create my own inverted-index search solution, unlocking the value from multiple terabytes of data.

ElasticSearch haunted the team. They attempted to use it to create a search solution multiple times, but in the end all they gained was massive server bills.

But we had valuable data coming in that would be a lot more valuable if we could search it quicly and easily. I offered up alternatives solutions - let’s use Spark and Hadoop, let’s just crawl the JSON on S3 and look for what we need, let’s use AWS Athena (which is a great product by the way).

No. Management didn’t want the internal users logging into AWS services, and they didn’t want to wait 10 seconds for Athena to return results, they wanted our own web interface with sub-second responses to user queries.

I was asked to setup yet another ElasticSeach solution. I agreed, but on the condition that I would take the time to actually learn about ElasticSearch first. There must be people using ElasticSearch successfully. There must be a reason our teams had failed so spectacularly to make it work.

Of course, ElasticSearch isn’t realy a search solution. ElasticSearch is a cluster computing solution. It let’s you run a cluster of Lucene instances.

My first impression was that it’s odd for us to pay all of this money to AWS for scalable services, and then setup our own cluster of ElasticSearch machines… couldn’t an AWS service worry about the scaling for us?

The “elastic” part of the whole thing was not immediately interesting to me, anyway. I want to understand how the search works. And why it wasn’t working for us.

Thankfully, Lucene has great documentation and source code.

Two concepts are at the forefront when you start looking at Lucene: tokenization and inverted index. Lucene’s inverted index is implemented using a skip list.

For the project I was working on, tokenization by Lucene was 100% not needed. Our data was already processed into structured fields. The tokenization part was done. Lucene is designed for use cases like full-text search, but we didn’t have full-text, we have JSON fields.

So I was able to skip over worrying about tokenization and start learning about inverted indexes and skip lists.

The idea of an inverted-index is that you have a collection of all of the tokens (i.e. in a tree or list), and when you look up a token, it points to all of the documents where that token was found. Specifically, in Lucene the data for each token contains the document, line number and character number, so you can go right to each occurence.

A key thing here is that Lucene not only tokenizes the documents and stores the index, it also stores all of the documents. So even though we already had terabytes of data on S3, we’d have to duplicate all of that data in ElasticSearch. No wonder it was costing so much money… terabytes on EC2/EBS is a lot more expensive than terabytes on S3.

I started picturing ways in which we could implement our own indexes using skip lists. For instance, it would be slow an unwieldy, but the whole index could be implemented as S3 objects.

Fortunately, AWS already provides a service that let’s you create huge and fast indexes without too much effort: DynamoDB. If the company weren’t an AWS shop, I’d also look into just putting the index in Postgres.

After a week of researching, I was able to create a DynamoDB based search solution in just a few hours. I wrote Python code to process the JSON documents and write the tokens to Dynamo. Each Dynamo record contained the URI for an S3 document, along with line and column number, just like how I saw Lucene do it.

I was then able to easily create a serverless JS application to provide access to the search. After logging in with Amazon Cognito, users are provided an IAM role that allows access to Dynamo and S3, a query in the web app becomes a Dynamo query via the AWS API, then Dynamo returns the first page of results extremely quickly, and using the AWS API, the data from the S3 documents start streaming into the app asynchronously, with the DOM continuosly updating to show new results.

I'm working on a long-form article describing this project, complete with code samples, code review and screenshots. If you'd like to access an early draft of the content, enter your email below and I will get in touch.