Here are some subjective opinions I've developed on software design - I'm "putting them out there" to see what other people think of them, and to explain to people who hear my opinions the thinking behind them.

Even the greatest dependencies can go down.

When you depend on another system, you should ask what you'll do if that system goes down. If anyone tells you "it will never go down", they probably don't know what they're talking about. Unless "it" is airliner avionics with three independent versions coded by separate companies - and maybe even not then.

Occasional multi-hour service outages are inevitable. and gmail can have outages, and who could have have bigger incentives, better practices or better employees than that?

Divide your dependencies into those you can and can't work without, and make the second list as short as possible.

If you're writing an service backed by an SQL database, I don't expect it to keep working if the SQL database is unavailable - but if the service also relies on a mail-sending service, you can probably still service requests if the mail-sending service has a brief outage.

Cache everything you can, and prefer to retrieve all the data to cache where that's feasible.

For any reasonably small, reasonably stable dataset - less than a few hundred kilobytes, say - you should try to retrieve the whole dataset and cache it in memory. For example, if your software needs a list of EC2 regions, you should retrieve the entire list, cache it, and periodically refresh it.

This isn't just because it'll be faster - although it will be. It's also because, if the service you're getting data from is temporarily unavailable, you'll be able to soldier on with your cached data.

The I prefer getting all data to just getting data on demand is to get simple, predictable behaviour. If you load the entire list of EC2 regions, it's either fully populated or not at all. But if you load entries on demand and some are cached and others are missing, you might end up with confusing failure behaviour where some requests succeed while others fail. It also gives you a simple way to remove items from the cache when they're removed from the underlying dataset, preventing another possible confusing behaviour.

But raise a big flag if a dependency is down when your service starts up.

Your production environment should be at least as reliable as your integration testing environment - and it's normal for integration test environments to be much less reliable, what with all the testing going on (if you have a complex system with a highly reliable test environment, I'd be glad to be corrected on this one).

If test environments have nine times as many problems as production environments have, and you launch something in your test environment that reports Amazon's production environment is down, nine times out of ten the problem's going to be on your end. Maybe you've messed up a firewall or proxy or test configuration or security group.

You should make this raise a big red flag in your test environment - something so abundantly clear the people testing can't possibly miss it. You could prevent the service from starting up at all - but that might be inconvenient because of....

Test what you fly, fly what you test

When the thing you deploy to production is different to the thing you tested, the differences will bite you in the ass. If at all possible, the file you use for final testing should be exactly the same file you're going to deploy.

Example: Your database includes a "last modified by process" column, and any time your program updates a row, it puts its name and version into that column. The version is compiled into your software, and you do a different build with for every environment, with the environment name included in the version string. The previous version wrote ProgramName-1.0-Production and the version in test writes ProgramName-1.0.1-CIT, both were tested and work fine. But ProgramName-1.0.1-Production couldn't update anything - because the version name was too long to fit into the column.

Avoid message queues wherever possible, especially for interfaces between teams.

With a HTTP API, if the sender sends a malformed message, the receiver can return an error code immediately, so an error is reported close to where it's called. Your automated integration test can fail, or your operations team get an alert about the right system.

With a message queue, such a malformed message will only cause an error at the receiver. You'll need an extra interface if you want to do automated integration testing; and if malformed messages are being sent chances are your operations team will get an alert from the receiver, not the system with the fault.

I've also seen a bunch of other mistakes and faults:

  • "Our consumer was down and no-one noticed for 4 hours"
  • "We have a queue size alert, but the alarm threshold is set to avoid false alarms from occasional surges. It takes thirty minutes of traffic to build up before the alert triggered, so our consumer was down and no-one noticed for 30 minutes"
  • "To detect producers going down, we have an alert if there's no traffic for a certain time - but because two separate services send to this queue, when one failed the traffic from the other prevented the alert from triggering"
  • "Our consumer took a batch of messages off the queue, but crashed before putting them into the database. Can you resend the recent messages, please?"

There are ways around these problems, but they're all very clunky compared to just replacing your queue with a HTTP API.

If you want to avoid vendor lock-in, you can only use the features of the second-best vendor.

Or more precisely, you can only use the intersection of the features of the first and second best vendors. If the best vendor adds a new feature no-one else has, and you start relying on it, that's vendor lock-in. And that temptation will be very difficult to resist.

I include this point also for people who are trying to convince others of the importance of avoiding vendor lock-in; some people will be difficult to convince for this reason.

Turn your problems that happen once a year into problems that happen once a day.

Code that runs thousands of times a day is easy to trust - if there were any major problems, they would have already been laid bare by the extensive use.

Software that only gets run rarely is a different matter. If it's a rare event that you have to update that huge file perhaps you'll find someone else has used the disk space you were relying on using; if it's rare you have a server failure perhaps you'll find your failover server is missing some critical configuration or your clients can't gracefully failover; if it's a rare event that you update your SSL certificates, perhaps you'll forget to do it all together.

The simplest way to make the rare events as reliable as the regular events is to turn the rare events into regular events. This is the idea behind Netflix's Chaos Monkey - but you don't have to limit it to server failures. If you want people consuming from your queues to support multiple deliveries, make sure they get multiple deliveries; and if you want your web service clients to retry when they get a retryable error, make sure they get them regularly.

Needless to say, you'll find this a lot easier to introduce at the start of a software project.

Build servers are great, but never lose the ability for developers to build locally.

If you let the build server turn into a black box with behaviour nobody else can replicate, you're going to have a bad time.

Being able to do a full build locally is super useful when you have problems with your build server - the people who aren't fixing the build server can keep working, and the people who are fixing the build server can test things locally and compare their results to those of the build server. And even if you never need to understand your build server in order to fix it, at some point you're going to want to add capabilities or upgrade it, so you still need to understand it.


22 February 2016

Tags · Home · Archive · Tags