Using varnish to offload (and cache) your OAuth requests

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.

« Symfony2: Implementing ACL rules in your Data Fixtures MultiParamConverter for Symfony2 »

Posted on 06 Jul 2012
Tagged with: [ oauth ] [ offloading ] [ varnish ]

For a current project both me and a colleague are working on a big API system that authenticates through an OAuth system. Normally, such an API does all the necessary OAuth checking, handling of tokens etc, but we wanted to have a system that actually offloads our authentication just the same way one could offload HTTPS traffic for keeping the API simple, extendible and even performant.

Using varnish as superglue

As a watchful reader noticed: this system only works with Varnish 3.0 or higher. This is mainly because Varnish 2.x does not accept a return(restart) in the vcl_deliver. On a 2.x system this will show up as a: "INCOMPLETE AT: cnt_deliver(196)" inside your varnislog

There are many (software) tools for offloading HTTPS traffic: stunnel, stud, and even nginx or apache can be used for offloading HTTPS so your internal systems are working with HTTP only (which makes is easier to debug traffic, if you need one reason alone to offload it). But offloading your authentication is much more difficult. Somehow, every API call means we need to check access tokens to figure out if the token is allowed. This means extra traffic, processing and worse case scenario: database lookups on EVERY SINGLE REQUEST. No fun!

Our API works with 2 different kind of identification: a context, defining what kind of site, device it is. Either a specific site, a third party side, or maybe even a mobile app. the second identification is the actual user that logged in. So combined we can have the API say that user John is not able to modify a resource on a mobile device, but it is allowed to do so on a special third party website.

Very crude setup:

 client --(https)--> [internet] --> ssl offloader ---(http)--> varnish --> api
                                                                  |
                                                                  v
                                                           oauth server

As you can see, the Varnish Caching proxy is our main system. Every incoming HTTPS request gets offloaded by the SSL offloader so we continue with only HTTP traffic. Straight from the SSL offloader we move into Varnish. Two things are happening here:

If a request to /oauth/… is made, these requests should be directly moved to the OAuth server.
Every other request should be checked for an access token, validated and moved to our real API system.

The first item is pretty easy: we can check the url and change the backend accordingly. The second item is a bit more difficult: we should check for a token in the authorization header but somehow we need Varnish to validate it.

Request hijacking

Because of lack of a better naming, we call the setup we are using “request hijacking”. Which in effect is what we do: we modify the original request into a secondary request, and we transform the result of this request back into the original request. So things might seem a bit complex, but in essence we have to consider two requests in every method. If you have ever use fork() or threads in your code, you probably know what I mean with complex: you have to realize the same code gets passed twice in two different contexts. A change in the code affects both passes, but luckily, our setup isn’t as complex as multi-threading or multi-process programs :)

So again: we take the original request, we modify it and push our verification server. The result headers we get back gets copied inside the original request and we let Varnish “restart” this original request. Sounds complex? We thought so too, which is why i tried to explain it in this blogpost.

Please open another screen and get the Varnish flow diagram out so you can follow along:

backend default {
    .host = "api.example.org";
    .port = "80";
}

backend oauth {
    .host = "oauth.example.internal";
    .port = "80";
}

Here we define the two different backends: the API and our authentication server. Off course, these can be pools of servers if you like.

Our first entrypoint in Varnish is the vcl_recv() method:

sub vcl_recv {
    if (req.restarts == 1) {
        set req.backend = default;
        set req.http.host = "api.example.org";

        return(lookup);
    }

This small block does a lookup if we are on a restart. Since we hijacked the request, we must make sure that we correct our backend (we also reset our http host because we use namebased virtualhosts.

    unset req.http.x-api-user;
    unset req.http.x-api-context;
    unset req.http.x-restart;

If we aren’t on a restart, we unset some standard headers. This to make sure nobody can spoof any headers and bypass security.

    if (req.url ~ "^/oauth/") {
        set req.backend = oauth;
        set req.http.host = "oauth.example.internal";
        set req.url = regsub(req.url, "^/oauth/", "/");

        return(pipe);
    }

If the request URL starts with /oauth/, it’s a request to the authentication server (for instance: /oauth/access, or /oauth/request) in order to get a access token. In this case we change the backend to our Oauth server pool, set the host (we’re still namebased), AND we change the URL so when a client uses /oauth/access as the url, it actually gets to the oauth pool as /access. At the end of the block we do a return(pipe), because from this point on we don’t need (or want) Varnish to handle these requests.

    if (req.http.x-auth-token) {
        set req.backend = oauth;
        set req.http.host = "oauth.example.internal";
        return(lookup);
    }

If a x-auth-token has been set, we assume for now that somebody has made a request to the API. We must check if this token is valid or not, but we handle that later on. For now, we change our backend and return with a lookup.

    error 401 "Not Authorized";
}

At the end of vcl_recv(), we return a 401 status. This is a last resort fallback which should not happen, but if it does, at least we know that we can’t get into the API without proper validation (remember that we take care of the authentication and we assume everybody is authenticated properly once they enter the API).

After the vcl_fetch, Varnish will generate a hash through vcl_hash:

sub vcl_hash {
    if (req.http.x-auth-token && req.backend == oauth) {
        hash_data("TOKEN " + req.http.x-auth-token);
        return(hash);
    }

If we have set a x-auth-token and the backend is oauth (which should always be the case, but again, a sanity check), we change the way we hash. We actually use a special hash function that returns a hash of our token. Everytime we need to validate a request with a token, we use the token as a hash. This is a hash for our “hijacked” request, not our “main” request.

if (req.http.x-api-user) {
    hash_data(req.http.x-api-user);
    hash_data(req.http.x-api-context);
}

This part is the hashing part for when an authentication user wants to access an API page. Normally, pages reqeusts with OAuth cannot really be cached since there is an authorization header which much be adhered and a page for me can display other information as for you while still using the same URL. At this point we actually store the context and header in the hash. As you can see, we DONT issue a return(hash), meaning the original vcl_hash gets called as well, which will hash the url and the host (or the ip), resulting in a unique url-per-person-context hash / cache.

From the vcl_hash() we have two possible outcomes: either an object with the same hash is found in the Varnish cache, or not. In the first case it will continue with vcl_hit(), if it can’t find anything, vcl_miss() will be called. Let’s start with vcl_miss():

sub vcl_miss {
    if (req.http.x-auth-token && req.backend == oauth) {
        set bereq.url = "/checktoken.php";
        set bereq.request = "HEAD";
    }
}

Pretty much the same if-condition which we use in almost all functions to distinguish our real request and the “hijacked” request. From this point on we now know that A) a user has requested a API page, B) a token was given by the user and C) this original request hasn’t been cached.

From this point we need to check with the authorization server to find out if the token is valid. Here is where our hijacking takes place: instead of the original request, we issue a HEAD request to /checktoken.php on the backend. Remember that we set our destination to the backend in the vcl_recv() before.

The only thing this checktoken.php file needs is the actual token (which is given in the header, remember we’re passing the original request, but with some modification), but we don’t need to the complete data. We change the http request method to HEAD, but just using GET is possible to. We are only interested in the headers of the backend response, not its body. (At this point it would be useful if we could actually strip the body content of the backend request, but we haven’t found anything that makes it possible to do this). I doubt the HEAD request will make a substantial difference.

A “normal” request would just pass through vcl_miss() and does it’s default thing.

Now, before I can explain the vcl_hit(), let’s just follow the path after a vcl_miss(). After the miss, we have backend request ready which gets fired and varnish will return the HEADERS of the backend request and saves it into a backend response object. After that it will call the vcl_fetch() method (so note that this happens AFTER the actual fetch).

sub vcl_fetch {
    if (req.http.x-auth-token && req.backend == oauth) {
        if (beresp.status != 200) {
            error 401 "Not Authorized";
        }

Yes, it’s getting boring, but the same if-condition again to check if it was a backend request to the oauth server. However, we check here if the response we got back from our backend request (to checktoken.php, remember we set this is the vcl_miss()), is a 200 request. If it’s not (either a 401, 404, or something completely else), we let varnish return a 401 status to the client. Something went wrong and we weren’t able to process and check the token. We deny access to be on the safe side.

        set req.http.x-api-user = beresp.http.x-api-user;
        set req.http.x-api-context = beresp.http.x-api-context;

        set req.http.x-restart = "1";

        return(deliver);
    }

But if we do get a 200 response, the checktoken.php also returns 2 additional headers as well: a x-api-user and x-api-context. These are the user and the context for this token and which are the actual headers the API can work with. Pretty much we “exchanged” our token for our login information. We store this information in the original request.

Also, an additional header called x-restart will be set, which we will explain later. At the end of the conditional block, we do a return(deliver) to bypass any default implementation of varnish in the vcl_fetch(). And again, if we have a original request, it will never reach the block and will do a default vcl_fetch anyway.

sub vcl_deliver {
    if (req.http.x-restart) {
        unset req.http.x-restart;
        return(restart);
    }
}

Between fetch() and deliver, Varnish will actually fetch and stream the rest of the backend request, but we still haven’t moved it to the client yet. Inside the vcl_deliver() we check if a header called x-restart exists, and if so, we unset this header, and restart varnish. This pretty much means that Varnish will go back to vcl_recv() again, but with the modified request (in our case, we have modified the request with our x-api-user and x-api-context. Also things like which backend to use has been modified. If you look back in the vcl_recv(), you see that a request that has a restart count of 1 is handled differently, it will set the backend to default (because we already changed it for our hijacked request), and we do a return(lookup). Pretty much it skips all the other checks and goes directly to vcl_hash etc, but because the special “req.http.x-auth-token && req.backend == oauth” condition never applies here now, it will do it’s normal flow.

If you have studied the varnish flow chart, or if you happen to know a lot about varnish, you might wonder why we set a x-restart header flag and check for it in vcl_deliver, when in fact you can also do a return(restart) inside the vcl_fetch. This is true, however, between the vcl_fetch and vcl_deliver, the actual backend response gets cached. If we do a restart in vcl_fetch, this response never gets cached. That’s why we need to wait until vcl_deliver to do the actual restart.

The last method we haven’t discussed is the vcl_hit(). This is called when we have found a request that actually was cached.

sub vcl_hit {
    if (req.http.x-auth-token && req.backend == oauth) {
        set req.http.x-api-user = obj.http.x-api-user;
        set req.http.x-api-context = obj.http.x-api-context;

        set req.http.x-restart = "1";
    }
}

Remember: if we hijack the request we also use the access-token as the hash. Meaning that if we got a hit AND the condition is true, we know it’s a hit on the checktoken.php file, which returns if the token the client supplied is valid or not. At this point, we DO NOT PHYSICALLY CALL the oauth server, but use the cached response. This means that our cached object has got the headers we need as well: we copy the x-api headers from the cached obj into the actual request and we set the x-restart flag. vcl_hit() will then go directly to vcl_deliver (it CAN do a pass, but we don’t use passes during our hijack calls). vcl_deliver() sees the x-restart flag and will restart the request with the correct x-api-headers.

Conclusion:

In the end we have managed to following:

100% offload authentication/authorization requests to another system.
Added the possibility to cache API calls per user/context.
Have a separate access token check.
Have these checks cached in varnish, meaning it’s pretty damn hard to do access checks faster.
Have the checktoken decide on how long to cache tokens (10 seconds, 5 minutes etc). But be careful: if a access token needs to be revoked, you have to wait until the cache expires OR clear the varnish cache.

If you want to use or check this concept: a simple gist can be found here: https://gist.github.com/3062338

We’re still in the proof-of-concept phase for this workflow, but it looks pretty solid and fast. We still need to check performance but I’d reckon we can get have one single authorization machine deal with loads of API servers without problem.

Now, it MIGHT be possible that all of this can done a lot easier with another system. If that’s the case, please do tell! I don’t mind spending my time figuring out the nitty-gritty of Varnish, but easier system are much more maintainable in the end. I don’t think this system is too complex, but it can be a lot to grasp for somebody who never deals with systems like this. If you implement such a system, or find (security) flaws or fixes, please don’t hesitate to mail..

« Symfony2: Implementing ACL rules in your Data Fixtures MultiParamConverter for Symfony2 »