For a first pass, I wanted to talk about some of the high-level reasons Erlang was chosen for KAZOO back in 2010. Even if you aren't technical I hope you'll find some value in reading this.

Some Erlang History

Erlang was developed at Ericsson, starting in 1986 and was open-sourced in 1998. It was designed to improve the development of telephony applications within Ericsson.

Erlang can be broken up into three main parts:

  1. Erlang the language - a small language with relatively few rules and syntax (compared to other modern languages)
  2. The Open Telecom Platform (OTP) - a collection of middleware, libraries, and tools
  3. The Erlang virtual machine (BEAM) - executes the instructions of compiled Erlang (similar to the JVM if you're familiar).

The features of the language and OTP, combined with running the BEAM, provide a powerful platform on which to build systems that require fault-tolerance and resiliency in the face of errors.

Some KAZOO/2600Hz History

The primary experiences that informed what is now KAZOO were born out of Darren and Karl's experiences running telecom infrastructure as well as the broader community's experiences. As 2600Hz consulted with more and more companies for their telecom infrastructure, patterns and common pain points emerged for how people were building these systems. KAZOO is designed to address these commonalities while also exposing functionality that lets operators and resellers provide real value-add features. At the end of the day, most of telecom is rote, routine, and commoditized - connecting two phones or providing voicemail won't win the hearts and minds of customers.

Under the hood

What customers and even management of service providers may not understand or appreciate is all the complexity and difficulty in:

  • writing software the handles errors gracefully
  • scaling said software across multiple servers
  • scaling said cluster of servers across multiple data centers

Single-server software

Most software (telecom included) is written to be run on a single server. As the load the server is handling increases, the software's ability to cope decreases, eventually hitting the limits of the hardware itself.

Imagine a growing town with only one grocery store and no way for residents to drive to other towns - those aisles are going to be packed with people and wait times in checkout lines will be horrendous (the store can only fit so many clerks and registers).

When things go wrong in the store (dropped eggs on the floor, for instance) it will take longer to fix and unrelated customers will be impacted. The store is not handling errors gracefully.

Erlang provides a way to manage this complexity and keep individual customers (processes) isolated from each other. Now one customer's dropped eggs won't impact anyone else! One customer's large order doesn't have to block all the other customers behind them; registers can be opened and closed dynamically to fit demand (still subject to the store's space constraints of course - we're not breaking the laws of physics here!).

But, software on a single server is a single point of failure (SPOF)! Any number of events, accidental, expected, or malicious, can make a server unreachable.

Adding servers

At a certain point it becomes painfully obvious a new store is needed. The conventional approach is to add a new server (grocery store) and direct work (customers) to one or the other (this is typically a server running a load balancer). Load balancers introduce new complexity as units of work (customers) aren't identical.

Balancing work

The options are to use a "dumb" load-sharing algorithm like round-robin or to create a "smart" load balancer that has to communicate with the underlying servers to know which server has capacity to accept more work.

The big drawback to round-robin and similar strategies is unbalanced load sharing (one grocery store gets all the needy customers and gets bogged down while the other store gets all the "just need eggs" customers who are in and out, and the store stays mostly empty).

Of course, with the "smart" option, those smarts need to be built by somebody and now you have a third server communicating (with all the nastiness of distributed systems that entails).

Coordinating work

Another problem with software that isn't designed for running across servers is that they don't coordinate between each other. In the grocery analogy, if the butcher at store 1 knows a customer comes in around 9:00 to pick up a particular cut of meat, the butcher can have it ready (pre-warming a cache) for that customer, leading to a faster, more pleasant experience for the customer (phone call is connected faster, perhaps). If for some reason the customer deviates and goes to store #2, the service will be sub-optimal compared to previous visits.

Software typically gets around this by using a shared database (generally on a separate machine). Except now you have a SPOF again! If the database server can't be reached, the servers won't be able to function either. So now you need a second database server and some redundancy.

Erlang makes inter-server communication easier

Erlang provides the programmer with tools that make talking to other servers' processes (running a copy of the BEAM) as easy and transparent as talking to the local BEAM and its processes. If a server becomes unreachable, the existing server can be setup to receive these events and react accordingly. In the analogy, the butchers at both stores can communicate about the customer's needs and both be prepared regardless of which store the customer chooses.

Adding data centers

Now that the software has a shared database and the SPOFs have been addressed, the software needs to consider communicating over a wider area network (WAN) - this is commonly the public Internet. See, developers make assumptions about how computers talk to each other - most just assume the packets from one computer will arrive at their destination in a quick and orderly fashion. Software is then written on these false assumptions. And while the software is running in a single data center (typically with pretty good networking cables and equipment) these assumptions seem to hold true.

But even data centers aren't impervious to becoming unreachable. A utility company accidentally cuts a fiber cable, a hurricane floods the data center, the bandwidth provider has an outage - these events happen more than most people might assume! The software then breaks in unexpected ways and becomes a nightmare to figure out why.

Practical concerns

Now that the software must communicate over less reliable networks (inter-data center, public Internet) and over circuits that charge for usage (generally intra-network communications are free or heavily discounted), new issues on the technical (increased latency, bandwidth saturation, packet loss / re-transmission) and business (higher bandwidth costs, data center build-outs, techs local to the new data centers maybe?) become a new reality.

For instance, servers on the local network may be programmed to sync their state (their individual view of the world) periodically to maintain a mostly coherent view. Depending on the system, this could result in many extra mega-, giga-, or even terabytes of bandwidth used, just to sync! (This isn't even counting the bandwidth used to process actual work). That data doesn't get transferred instantaneously nor for free (pricing on Amazon EC2 can start at $0.09/GB). Imagine a phone call being delayed or choppy audio on existing calls because the servers are hogging the bandwidth to sync their state!

Technical concerns

There are whole swaths of computer science dedicated to distributed computing and all the fun that comes with unreliable communication. Sadly, it probably isn't likely that the majority of programmers have dealt with these types of systems or have the tooling and infrastructure in place to build systems that handle these distributed computing problems.

Erlang itself is no panacea either; the default communication mechanisms do not work great over unreliable networks. However, several projects are addressing the gap including Lasp, Partisan and Riak process groups among others. Other technologies can be used too (like RabbitMQ as Kazoo does) to help with this.

To use these infrastructure components with software that wasn't designed for it requires a large investment to rewrite the software, or, more commonly, the software will try to emulate the functionality within the software itself (and often quite poorly).

The problem with most telecom infrastructure

Simply put, it was not designed to operate across geographically-distributed networks and does not take into account the errors and issues inherent in communicating over unreliable networks. Those features are bolted on, typically under pressure from management and by an inexperienced team in distributed systems development, and as a result are often poorly implemented and riddled with bugs and inefficiencies.

However, for better or worse, that software works for the most part because the underlying infrastructure is robust - this breeds a false sense of security, and marketing teams start throwing around "reliability" and "redundancy" to describe their company's offerings.

When things do go wrong (and as Murphy's Law shows, they always do) however, the price is paid many times over. "An ounce of prevention is worth a pound of cure" and all that.

So Why Erlang?

Given that things do go wrong and that even the smartest team can't anticipate all errors, it stands to reason that the work is to handle known and unknown errors gracefully (fault tolerance), starting at the lowest level.

Erlang has a 'Let it crash' philosophy related to units of work - if the data is wrong, the external resource is unavailable, whatever, crash the worker (stop it) and let a supervising process handle the crash. This could be restarting a new worker, bubbling an error up to the API response, logging the crash and moving on, or any number of strategies that are required. The value is that Erlang and OTP put these concerns front and center and give the programmer the tools to handle them in an orderly fashion.

Because it is trivial to start and stop workers, the code's architecture benefits tremendously from separating the "work" from the "management" and "supervision" of the work. As a consequence, the code becomes smaller and more concise (in Erlang, one would say "program the happy case") which makes reasoning easier, testing easier, and should generally produce better code faster.

Wrapping Up

Erlang provides a language (Erlang), a framework (OTP), and a runtime (BEAM) for building fault-tolerant, resilient systems. Embracing the 'Let it crash' philosophy yields code that handles errors more gracefully, allowing the system as a whole to remain stable in the presence of these errors.

From this foundation, highly-available, fault-tolerant systems are built.

In future articles I'll take a deeper dive at some of the architectural decisions KAZOO makes that build on the bedrock of Erlang and how those decisions impact a KAZOO cluster's ability to operate in the face of errors at all levels, from the individual processes running code to the data centers disappearing due to hurricane flooding.

Finally, I hope when marketing teams put out the buzzwords of what their platforms are, the skeptic in all of us will say "Sure, prove it. Show me the architecture you've built that allows that". Because it isn't easy to do, is even harder to get right, and harder still to scale it out. Let's pay attention to the man behind the curtain and stop listening to the great and wonderful Oz!