Introducing ElasticDNS

ElasticDNS is an intelligent DNS server available as part of ContainerNet (alongside DHCP and Weave network) to all Cloud 66 Docker stacks. It acts as a seamless Service Discovery for containers.

What is Service Discovery?

Containers allow applications to run as multiple services. This is usually called the micro-services architecture. The roots of micro-services architecture is in Domain Driven Design where "Services" provide their service to the rest of the application through public contracts. This detangles monolithic applications into separate and more manageable parts. This services can then be developed independently and deployed separately at different life cycles, allowing much greater flexibility and better developer team structure. Remember the rule of developer team size that says you should be able to feed your team with a maximum of two pizzas? Well, this helps a lot with that!

However, breaking your application into multiple services brings its own issues: how these different parts are going to talk to each other? One way is to have fix addresses for each service, so HOST1 always runs SERVICE1 and HOST2 always runs SERVICE2. But it is clear that this is far from ideal. What if there is a failure on one HOST? What about scalability? How are we going to move these services around to different hosts?

One answer to this problem is Service Discovery. Service Discovery is a way to allow services find each other automatically.

Different ways of Service Discovery

I'm sure you can imagine many different ways to build a Discovery service into your application. You can build a "service registry" so each service registers itself with the registry and each caller asks the registry for the service it requires. This has the advantage of being simple but can potentially introduce a single point of failure to your system (the registry itself).

Another way is convention based discovery (based on naming conventions) which can be very elegant but might require great foresight as where your application is heading. Not very lean!

You can also build a peer-to-peer system without a single point of failure to act as a collective register.

The problem of code change

There are many different technologies to help you with each one of these solutions. Protocols like Raft make building peer-to-peer consensus based systems possible and stable key value stores like Redis allow you to build central registries.

One problem is however always there: code change. Ideally you don't want to change your application code to work with your discovery service.

The oldest discovery service out there

OK. This title might not be historically accurate but I think it's close enough to be a good one! We all know and use DNS every day. The premise is simple, you give it a name and it gives you an address. That's what we wanted, right?

By using DNS as service discovery we don't need any client libraries, APIs or other code changes to find out the address of a service, we simply point our service to a A record like myservice.domain.com and we get the IP address of it back. DNS servers can even do things like round robin rotation so different callers get different endpoints allowing some sort of load balancing.

Building a central DNS service

Our first iteration of service discovery for Cloud 66 was based on etcd. Etcd is a great open source tool that acts as a distributed service registry, but it had the following issues:

Clients need to change their code to use it
The registry needs to be updated on the client when services change (like after a deployment or scaling up or down). These changes happen centrally but the etcd cluster needs to be updated on each server.
etcd (so far) doesn't cope very well with a node going down. When a node is down the client is responsible to find the next client and repeat the request. This makes the client code even more complicated.

All of these reasons meant that we had to start from scratch and build a different solution. We had the following requirements:

Simple to use. No code change
Fault tolerant
Centrally updated

This resulted in ElasticDNS. ElasticDNS is a centrally controlled DNS server with DNS clients on every server.

ElasticDNS consists of two parts: a small client and a central service. The small client (written in Go and running in a container itself) has a DNS server and a local cache. It serves DNS queries ending with .cloud66.local by making a query to the central service and caching the results for their TTL duration. This connection is based on ZeroMQ subscriptions allowing the client to get the latest updates and invalidate its cache. Data crossing between the client and server is encrypted and serialised with Protobuf for higher compatibility with future clients while keeping the payload efficiently small. The client can tolerate temporary network issues with retries and serving data from its local cache.

The central service is built of multiple stateless servers behind load balancers. They all act as gateways to a cluster of Redis servers storing the DNS records. This enables resiliency and fault tolerance at different layers of the central system while avoiding serving of stale data which can happen with consensus based systems.

Versioned Service Discovery

ElasicDNS primarily is a discovery service. Clients can use addresses like myservice.cloud66.local to get the address of a container running myservice. But since ElasticDNS is backed centrally it also knows about the caller so can do much more than acting as a dumb DNS server. Knowing who is asking for the address to a service is important when you have multiple versions of your app in-flight.

Imagine you have an app, consisting of 2 services: a web service (accessible externally) and an api (used by the web service internally) service. Every time you deploy your application a new version of both services is rolled out to your servers.

When a deployment happens the load balancer for the externally available services (like web in this case) is instructed to switch new traffic to the new containers while still serving the existing traffic with the old containers. So if a visitor to your site is in the middle of a large file upload, it is not going to be interrupted. At this point you have 2 versions of your web and 2 versions of your api service up and running.

ElasticDNS is clever enough to know which version of the app is running in a container. So if an old web container asks for api.cloud66.local it will get the address to an old api container, but if a new web container asks for the same thing, it will get the address to a new api container.

Self information

Another feature of ElasticDNS is Self Information. ElasticDNS client also serves a RESTFul API to all containers local to its host, so they can ask questions about themselves. This is useful when a container is part of a larger cluster of containers and needs to announce its presence to its peers. Using Self Information, and container can make an HTTP GET call to http://authority.cloud66:5569/whoami and get back a detailed list of information about itself: its Docker and Weave IP addresses, its host IP address, it's exposed ports, its container ID and more.

Since the ElasticDNS client is available locally and is caller-aware, the API call is always the same for all containers but will return different answers.