Bitwise mask emulation with Solr

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 28 Feb 2014
Tagged with: [ bit manipulation ]  [ solr

Solr is great for searching through a massive data collection in lots of different ways. But one thing Solr lacks is the possibility to search bitwise. And this by itself makes sense: Solr uses inverted indexing and doing bitwise operations on it’s indexes might result in a loss of performance. There are, however, some plugins that will allow you to use bitwise operations, but there might even be a more native way:

The problem

A client wanted to search through a large set of documents. These documents could either be publicly available, or only available for a specific group of users. This is implemented by the customer as an added bitwise integer field to the document, where every bit corresponds to a certain group.

A simple example on the right structure:

bit 0 - publicly available document
bit 1 - editors
bit 2 - managers
bit 3 - processors
bit 4 - administrators
bit 5 - packers
bit 6 - shippers

If a document would be visible for managers only, it would set bit 2 (access: 4), if the document would be visible for shippers, it would have set bit 6 (access: 64). Setting multiple bits could be possible as well. For instance, when setting bit 2 and 6 (access: 4 + 64 = 68), it would mean that a user needs to be both a manager AND a shipper to view the document. It’s important to realize this is an AND relation, not an OR relation (68 does not mean: the user needs to be either a manager OR a shipper).

From our API, a user could query a document but the access rights must be checked. Normally, our query would look something like this:

q=content:foobar

which would just query a document for content, but does not do any checking.

Suppose the user is part of the managers and packers group, giving the user the access group 36 (32 + 4).

q=content:foobar AND access:36

Depending on your needs, the access could either be part of the query (q), or a filterquery (fq).

This all works perfectly when you need to check against a complete match. When a user is both a manager and a shipper, the document will be returned, as the access matches.

But what if a document is solely for a manager? In that case, the document access rights would be 4, and the query would not match. In fact, there is no way to get this decently matched directly from Solr.

The solution

Even though there are some plugins that allows you to do bitwise operations in Solr, we tried to solve the issue a bit more natively. The first idea is to store all the bits separately into a multivalued field. Access with a document would become something like:

access:[4, 32]

This would make it much easier to check. Because now we could get the separate bits from the user:

q=content:foo AND (access:4 AND access:32)

This would match our document, but unfortunately, it would match much more. For instance, a document that is has access 100 (managers 4 + packers 32 + shippers 64) would match too, even though the user is NOT part of the shippers-group and thus should not be able to view the document.

The real solution

After a lot of trail and error, we didn’t come up with a decent solution that could match bits correctly. However, we tried something completely different: instead of trying to check the set bits, we tried to check the UNSET bits. Basically, all bits that are 0.

Consider we only use 8 bits for our access structure. The user is still a manager AND a packer. In this case, we would have a solr query like:

q=content:foo AND -(access:1 OR access:2 OR access:8 OR access:16 OR access:64 OR access:128)

What we did here, is basically inverting the search. Since we need to look for 4 and 32, as these are the users permissions, we are looking for any documents NOT having 1, 2, 8, 16, 64 or 128. Pretty much we are asking Solr to give back all documents that do not have any permissions the user doesn’t have either.

Does this work? For documents matching the users access this works, but the easier solution would have worked too. What happens when a user is part of the group managers, packers and shippers, totalling to an access group of 100. Could we find a document with access managers and packers?

q=content:foo AND -(access:1 OR access:2 OR access:8 OR access:16 OR access:128)

In this case, yes: our multivalued field would still be 4 and 32. Both are NOT inside the negative list we give to the user.

The other way around: would a user who only is a manager, be able to see a document that should be intended for managers AND packers? Let’s find out:

q=content:foo AND -(access:1 OR access:2 OR access:8 OR access:16 OR access:32 OR access:64 OR access:128)

Since the multivalued access field would contain both 4 AND 32, it would not match the query above, as we are checking against the absence of 32. This works too!

Conclusion:

The problems described in the first (and failing) solutions are all related to one issue: matching bits in AND mode, because you need ALL specified permissions. In the final solution this is switched to an OR search by searching for any documents matching one of the groups the user doesn’t have, and excluding those.

This solution works for this specific use-case, but a small change in the use-case might require a totally different solution. Or in an extreme case it might not even be possible to solve efficiently. Carefully analyse your access rights to find the best solution, and check its impact and limitations.

For instance, the downside of the system is that it would work only with a certain number of specified bits: you need to know which bits to mask. We are currently using multivalued ints for this purpose, so we can use up to 31 bits (could possible be more if we would use 64bit (unsigned) integers, but this hasn’t been tested yet.

All things considered, this solution is much faster and more efficient than adding different plugins into Solr, of trying to filter out documents AFTER they are returned from Solr. Filtering Solr documents afterwards should always be a very last resort. It would likely mess up scores, paging, document counts, facets and more.