Snowsuit Zine // issue 02

Table of Contents

Teams Are Limited By Their Competence, Not Their Technology

Loosely defined, the operational capacity of a team is the most complicated system it can stably deploy.

For complex engineering efforts, finding a team with a high operational capacity is often the dominant cost in engineering. Without such a team, there is no hope of scaling the technology. Hence, the central question of modern software development is: can this team specifically deploy a system that addresses the problem at hand?

You can't get there from here

Somehow, this is not the question that usually gets asked. Almost always, the question is something like, "should I use this particular language or framework to solve this particular problem"? The answers seem to always be something like “whatever makes you most productive.”

But the truth is that this sort of question reveals a fundamental misunderstanding about how technical risk should be assessed. Instead of asking what language should be used to solve a problem, a better approach would be to prioritize the technical decisions of the product by risk, and then assess what the options are in terms of operational capacity both of the current team, and of the people who can be hired.

For example, in many cases, a company’s primary datastore is the riskiest place to try new technology. If a team chooses something like MongoDB over something like Postgres, the operational capacity of the best engineer a team can possibly hire will be much lower, and the operational capacity of the average engineer will also be much lower because the largest deployed MongoDB cluster doesn't rival the Postgres deployments. This is a bad path to go down if the database is supposed to hold customer data. On the other hand, language tends to be a bit less risky — the average and maximum operational capacity of a C++ programmer might not be that smaller than, say, that of an OCaml or C# programmer.

Another consideration to take into account is that it is often not possible to hire enough people to get enough operational capacity to build a viable product.

A good example of this problem is Bing, which in some cases actually has better technology than Google (for example, the COSMOS paper is much richer than the MapReduce paper), but it employs a fraction of the number of search relevance engineers. The fact of the matter is that most of the best search engineers work for Google, and will never move, so there is no way for Bing to win in this environment. It will always be trying to catch up to Google's features with a dramatically smaller workforce.

A complexity you can’t escape

One of the unfortunate problems of large software projects is that, invariably, some critical parts of the system will be handled entirely by a third party library or service. The alternative is to re-invent critical system components, which is usually more risky.

In practice, it is impossible for most complex software projects to avoid this. To some extent, this means that the solution for many hard problems depends on the team’s ability to stably deploy third party software to solve their problem. Sometimes, this is trivial, but for complex problems, it is often difficult. ZooKeeper, for example, is notorious for being a useful, but tricky, system to deploy, even for tasks that it is arguably well-suited to.

In general, the most effective mitigation seems to use only a few flexible infrastructure projects in the backend: once the team can deploy those systems, they can redeploy them for everything. This makes the stack easy to reason about, and it allows the team to capitalize on previous experience deploying the same infrastructure.

Conclusions

The success of a project is dependent on asking the right questions. It is important to have the tools to appraise (1) which decisions are risky, (2) which solutions help solve the problems at hand, and (3) how to mitigate threats to operational capacity.

Frameworks are cheap to write. Code is replaceable. These are not worth heavy debate because they will not decide success or failure. Making decisions that in the long term will help you keep the system within your operational capacity will.

Articulations

Operational Intelligence: Intuiting Availability

Over the last decade the trend in computing has been pushing it more and more into centralized services. For many, their email client is their web browser. Instead of downloading movies and music they stream it over the Internet. With the consolidation of activities around fewer providers, the availability of those services becomes more important.

The rise in importance of availability has affected organizations from the outside in. The idea of a service being operationally friendly has only recently become popular. The proliferation of DevOps is a result of this for it is an organizational solution, something easy to implement with an existing staff. But for anyone who will be involved in the operations of a system, it is important to have a mental model for how availability works and what to look for in a design.

One popular example of how a design has been directly affected by availability, which has had a profound impact on the technology industry, is the Dynamo (not to be confused with DynamoDB) paper from Amazon. The story the paper tells is how Amazon approached the architecture of their services to be highly available after an embarrassing outage. They started from the beginning and asked what was the desired availability and then designed a database around that goal. By defining the problem in terms of availability, Dynamo had significant advantages over the existing solutions. This is in contrast to trying to figure out how to squeeze more availability out of an existing design.

This article will focus on developing an understanding of the factors that affect availability. How to calculate availability is out of scope of such a short article. Defining which metrics to use is dependent on the problem one is solving. For example, an Internet service, such as Netflix, can easily count how successful all of the traffic is that makes it to the service. Once a request hits one of the load balancers it can be logged. But what if there is an outage that affects the ability to track requests? Or what if the service is only partially available, perhaps only a subset of the catalog can be reached by a user? These problems all have answers but they are up to the operator to decide how to measure them. For more discussion on how to think about these, see the paper, Harvest, Yield, and Scalable Tolerant Systems, by Fox and Brewer.

The Theory

The availability of a service is up to time to decide. But a firm grasp of the theoretical aspects will provide a guide for real world decisions.

It is hopefully clear that 100% availability is not a realistic design goal. One might get lucky and nothing goes wrong over whatever time scale they are measuring, but no component is 100% reliable, which means downtime is a virtual certainty. The goal, then, is to architect a solution to match the desired availability. The table of uptimes below shows what one will commonly design for. The values in the table are calculated by multiplying the total time in a year by the probability of downtime.

Nines Percent Uptime Downtime Per Year
1 Nine 90% 36.5 days
2 Nines 99% 3.65 days
3 Nines 99.9% 8.76 hours
4 Nines 99.99% 52.6 minutes
5 Nines 99.999% 31.5 seconds

Many services are happy with 4 Nines and ecstatic with 5 Nines. Amazon S3 has a services availability guarantee of at least 4 Nines. How obtainable a particular availability is comes down to a mix of the problem itself, the tools at ones disposal, how clever the solution is, and how much one is willing to spend.

While the availability of a service as a whole is what one cares about, a single service is almost always composed of multiple components working together. There are two ways in which those components can be composed together: in serial or in parallel. An example of serial composition is a standard Rails setup: a web server and a database behind it. The web frontend depends on the database. An example of a parallel composition is having multiple web frontends. While all frontends depend on the database, the web frontends are isolated from each other. Losing one web frontend doesn't affect the remaining frontends. Services composed in serial are dependent on each other. Services composed in parallel are independent of each other.

The equations for these two compositions is below. These equations are really describing the probability of a component being available, so the availability values are always a number between 0.0 and 1.0. For example, 4 Nines would be 0.9999.

Composition Equation
Dependent At = As1 * As2 * … * AsN
Independent At = 1 - ((1 - As1) * (1 - As2) * … * (1 - AsN))

In English, given N services, the total availability (At) of the services when they depend on each other is the product of their individual availabilities. When they are independent of each other, it's the product of their unavailability.

Two things fall out of the above equations: adding a dependent component will always decrease availability (a number less than zero multiplied but a number less than zero is an even smaller number). And adding independent components will always increase availability. With that, while it's not easy, the game is at least clear: increasing availability is how clever services can be composed in these two ways.

Going back to the simple webserver example from above but with an additional component: the Internet. And, in this case, Internet means the cables and equipment that connects the datacenter the service is in to the Internet. Each service has been assigned a (made up) availability of 3 Nines, or 8.76 hours of downtime per year. In reality, the Internet connection would have an availability better than 3 Nines, but makes the example simpler. The service is diagramed below.

Internet               Frontend             Database
+---------+           +---------+          +---------+
|  0.999  | <-------> |  0.999  | <------> |  0.999  |
+---------+           +---------+          +---------+

In this service, all of the components are dependent on each other. Calculating the total availability is thus the product of each component's availability.

Table 1: Base availability
Component Availability Downtime per year
Ainternet 0.999 8.76 hours
Afrontend 0.999 8.76 hours
Adatabase 0.999 8.76 hours
At Ainternet * Afrontend * Adatabase  
At 0.999 * 0.999 * 0.999  
At 0.997 1.08 days

Almost nine hours of downtime per component is not great, but for a new system might be an expected lower-bound. What is interesting is how the individual values come together. It's not until the relationship between the components is taken into account that the actual cost becomes clear. Depending on each other, the system as a whole could be down over a whole day per year. While the downtime is theoretical, it represents the expected downtime over the long-term life of the service.

There are two options to improve the availability of a service, the first is that the availability of each individual component can be increased and the second is that more independent components can be added.

Taking the first approach, imagine the availability of the web servers and database can be increased by another 9 (the datacenter availability is not under control of the application developers and operators).

Table 2: Availability with better components
Component Availability Downtime per year
Ainternet 0.999 8.76 hours
Afrontend 0.9999 52.56 minutes
Adatabase 0.9999 52.56 minutes
At Ainternet * Afrontend * Adatabase  
At 0.999 * 0.9999 * 0.9999  
At 0.9988 10.5 hours

That looks a lot better but it's around a 2x improvement and two components needed a 10x improvement to get there. That is an order of magnitude more effort than the result. But even if the frontend and database could be engineered to a near 100% uptime, it's a fools errand. This architecture can never do better than 8.76 hours of downtime because it is bounded by the connection to the Internet.

Making each component better is clearly a viable strategy for improving availability. Companies like Joyent are investing a lot in improving individual components. But improving each component significantly has diminishing returns. The other approach is to add independent components.

By architecting a service such that completely independent instances of it can be run in different datacenters, the limiting factor of the datacenter's Internet connection can be alleviated.

Table 3: Availability with multiple datacenters
Component Availability Downtime per year
Adatacenter1 0.997 1.08 days
Adatacenter2 0.997 1.08 days
At 1 - (1 - Adatacenter1) * (1 - Adatacenter2)  
At 1 - (1 - 0.997) * (1 - 0.997)  
At 0.9999 52.56 minutes

Without changing components at all, just adding another datacenter with the same components, the service has gone from 1.08 days of downtime to 52.56 hours, almost a 30x improvement. Adding another datacenter gives 3.2 seconds of downtime per year.

That is important enough to repeat. The single datacenter setup was embarrassingly poor. But by doing nothing other than adding more, the availability noticeably improved. With three instances of the datacenter, the availability is nearly 100%.

The Practical

The theory sounds nice, just add more independent instances of the service and profit. Unfortunately, reality often rears its ugly head and it's not that simple.

The value of understanding the math behind calculating availability is being able to determine a rough bound on the expected availability of an architecture. But there is a big assumption in the math: outages of individual components are uncorrelated. But that is mostly not the case.

Anecdotally, the number one cause of outages is human error. This could be operator error. It could be a bug in the code that is rolled out to both systems before presenting. The surface area of errors is simply massive. Consider leap seconds, an infrequent event that is generally not explicitly tested. A bug that occurs because of leap seconds could have been introduced months or years prior to the actual event. But leap seconds are entirely predictable and it's a matter of taking the time to test for them. For every leap second, there are an uncountable number of unpredictable events that could interrupt a service.

Humans may be the cause of errors in existing systems, but the problem space itself could not lend itself to running components completely independently. In order to achieve availability in such a case one's solution will have to become more complex. Complexity feeds back, increasing the chance of human error. No matter the complexity of the solution, however, it will be built on the same two principles: dependent and independent components.

Conclusion

Whether it is to design a system or to understand a system in operation, having a grasp on the foundations of how systems achieve availability will inform decisions. Time can be saved simply by doing a back-of-the-envelope availability calculation to see if an idea is viable.

Disirregardless of how one uses the math, the math doesn't lie. Don't listen to the siren song that says because downtime has not been experienced recently downtime is becoming less likely. Even Twitter, which fought to become stable after a period of only 98% uptime (more than six days in a year) experiences outages in 2014. Building a system without respect for its availability is like tight rope walking without respect for gravity. By the time you realize you've slipped it might be too late to recover.

The Experience Of Being Creative

Software developers know it is common, if not the norm, to be asked to do something they have never done before. The solutions will somehow appear in their mind. One random thought leads to another and for some reason that leads to an idea that solves a problem.

Creativity in programming faces a world of black and white truths. It is often obvious when something is not implemented correctly because it just does not work. Unfortunately for the programmer there is all the reason in the world to think the software is wrong. Code does exactly what we tell it to and saying exactly how something should work is notoriously error prone.

There has been a long standing discussion in the programming world on the value of various forms of strictness. On the one hand, some languages require very specific statements that can be tricky to write but often guarantee the software performs well. On the other hand, some languages feel more like duct taping one idea to another and does so towards the goal of making ideas easier to express.

In a great talk on creativity, John Cleese describes creativity as not a talent but a way of operating. It cannot be explained, he says, but it can be performed somewhat reliably with some consideration for how it works. John identifies two primary modes of creativity, the open and the closed modes.

The open mode is where ideas get thrown around, free association is encouraged, and there is no such thing as a bad idea. The purpose of the open mode is to get as many ideas up for consideration as possible. One of the challenges of being in the open mode is having an environment where the open mode thrives. To understand a good environment, we can identify a few things that break the open mode: a snide comment, a "well-actually". In general, someone simply has to make a negative contribution of some kind and the open mode will end.

The closed mode is the detail oriented, perfectionist side of being creative. The purpose of the closed mode is to evaluate everything the open mode found and form opinions, sort the good from the bad, and find all the errors where ever they may be. Declaring something is incorrect is both fine and encouraged during the closed mode. It is the time to put together the best of the ideas and nudge them towards perfection as much as possible.

Pondering a problem requires going between the open mode and the closed mode. While the open mode produces a lot of possibilities the closed mode can then be used to narrow the focus. It is then best to switch back to the open mode to receive feedback then again switch to the closed mode to consider it all. And there, something is being created.

Many things can work against creativity. The open mode's big flaw is that it can be closed so easily. The big flaw of the closed mode is that it has enormous power to stifle productivity. An example is when some code is never quite good enough.

Numerous studies have reported that humans are notoriously bad at managing their time and programmers are no exception. Each faces a time when they were too optimistic about how long something would take. Maybe a single component took twice as long as expected and, because they were new to programming, they didn't budget extra time for absorbing mistakes. This is par for the course for more experienced programmers. Consider that programmers are often asked to do something they've never done before and must rely on their creativity to somehow figure it all out.

Time must be made for the oscillation of the open and closed modes. Someone on a creative endeavor must essentially think long enough to find a great idea. A simple test for knowing if one has thought long enough is by whether or not an idea inspires them to action. Some folks will build too soon, others never quite get around to it, and somewhere between is the right mixture of discovery and implementation.

An environment must be available for one to be in the open mode long enough to find the great ideas. This environment should be as free of distractions as possible. Cleese captures this sentiment nicely by pointing out that it's easier to do trivial things that are urgent than it is to do important things that are not urgent; and it is easier to do little things we know we can do than it is to do big things we're not so sure about.

Most of the discussion so far has been about creativity with some programming details sprinkled here and there. If we go further into programming, we can find the natures of open and closed thinking across the language spectrum. In one area we have languages that are very forgiving of mistakes and sometimes lead to conditions that make no sense, yet sort of do something.

Along the spectrum of specificity and error handling is a tendency towards being in the closed mode. A strong type system will say you're wrong more often than no type system. From the point of view of networks, complication breeds friction. Understanding Haskell requires more upfront knowledge than Python. In other words, languages that promote the open mode also tend to be less rigorous.

Creativity is not a talent. It is a process that requires awareness of modes and lots of time. Consider that if you go from idea to typing, you have skipped the open mode. And consider that if you edit a section over and over and over, you are in an infinite closed mode loop. The most productive programmers, such as the mythical 10x programmers, don't type 10x faster than everyone else. They were more creative and found better solutions.

Monthly Consumption

Books

  • The Checklist Manifesto: How to Get Things Right by Atul Gawande (link)
  • Hatching Twitter: A True Story of Money, Power, Friendship, and Betrayal by Nick Bilton (link)
  • Siddhartha by Hermann Hesse (link)

Papers

  • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center by Hindman et al (link)
RSS