What happens if Enclave is down?¶

As described in our how-it-works documentation, the Discovery Service is the only centralised component of Enclave's architecture. It is responsible for enrolment of new systems along with computing and distributing policy to all Enclave devices on your network. However, if the discovery service goes down your Enclave network will mostly continue to function:

What will work¶

Enclave does not route any traffic through the Discovery Service. Most of the time, devices will talk directly to one another. In cases where a direct connection cannot be established, devices will fall back to using one or more traffic relays located in different regions all over the world. That means that in the event that the Discovery Service becomes unavailable, traffic in your Enclave networks will continue to move between your devices.
Devices can continue to communicate with each other until either one of the devices experiences key material expiry, or until the network connection between those devices is lost or interrupted (Changing from a wired network to wireless network, for example).

If the discovery service is unavailable for long enough, the key material held by each device may expire. In most cases the machine keys use to authenticate systems have a long lifetime, so it's more likely that if you're using an IdP integration to authenticate users as well as machine keys, then user identity key expiry may happen much more quickly. However, exactly when is device-dependent and based on the last time a user authentication took place, which is likely to be different between each device on the network.
When Policy or configuration changes are made by platform administrators they are pushed your Enclave devices. Policy is both cached and enforced locally, meaning that your existing Policies, ACLs and connectivity arrangements will continue to function in the absence of the Discovery Service.

What won't work¶

On the other hand, without the Discovery Service available:

Any devices which disconnect from one another will be unable to reconnect.
Any devices which attempt to come online won't be able to connect.
Any devices using user identity as part of the authentication requirements to construct tunnels will be unable to renew their keys, meaning that those devices will gradually lose access to each other. If the tunnel is only authenticated using machine keys however, it's unlikely that those machine keys will expire in a time-frame relevant to a service outage.
Existing Policies and ACLs cannot be updated.
Existing devices cannot be revoked and removed from your account.

How we approach availability¶

We engineer the Enclave platform to achieve 99.9% unplanned downtime for our Discovery Service and 99.999% availability for all other SaaS components like the management platform https://portal.enclave.io and APIs.

To anybody versed in high-availability system design, 99.9% uptime might not sound highly available. That's up to 8 hours, 45 minutes and 56 seconds of downtime every year, so let's unpick that.

Firstly, Enclave is designed to be a highly available and highly resilient platform, we expect, measure, and exceed 99.9% availability, but there's a difference between designed and achieved uptime.

The central component of the Enclave platform, the Discovery Service has a unique architecture and special role in the overall availability.

The Enclave Discovery Service is an extremely light-touch component in terms of its impact on connectivity between enrolled systems, it has relatively infrequent conversations with enrolled systems. Each enrolled system retains a full cache of the last conversation it had with the Discovery Service and operates with full autonomy if it loses connection to the Discovery Service. In fact, if the connections with the Discovery Service are briefly lost, it has no operational significance for the enrolled system, any established tunnels remain up and online. It simple reconnects at the earliest opportunity.

This architecture allows us to avoid investing diminishing returns to engineering for a higher level of availability and instead embrace the ephemeral role the Discovery Service plays in the Enclave platform architecture. In turn, this allows us to focus on engineering simplicity, which itself produces a stable, maintainable, predictable, and easier to reason about platform.

Instead of prioritising uptime as a design goal, we prioritise the lowest possible time-to-recovery in the event of failure. Like Chaos Monkey, we routinely simulate failure by taking the Discovery Service offline in micro-intervals as part of normal, daily routine operations. This not only keeps our platform resilient to failure, but expectant of it.

By exploiting the Enclave architecture and designing our platform to expect failure, when failures do happen, they have almost no perceptible impact on customer experience or the operation of the platform, apart from to slightly increase latency on already eventually consistent administrative transactions, like pushing updated policy out to enrolled systems.

In summary, the Enclave platform components are designed to achieve 99.9% uptime, but in practise customers can expect Enclave to be one of the most reliable platforms they use.

Last updated May 5, 2022