Re: Containerization in Corda

Mike Hearn


In Corda 4 we're shipping an official Docker container, although it's by no means the only way to run things:

So if you want to use a containerized node you can use that.

You're asking for arguments against, but not arguments for, so I guess you've already heard some of those. I'm a bit of a contrarian when it comes to Docker and container tech so here's as good a place as any to lay out why you may decide against using it.

The underlying containers technology (kernel cgroups) were developed by Google for its in-house equivalent to kubernetes, called Borg. I spent 7-8 years working with Borg every day so I ended up using containers and orchestrators a lot. They're awesome if you have a totally consistent internal hardware platform managed centrally, and everything you do handles either huge traffic or huge datasets.  Then you want to replicate everything six ways from Sunday and be sure every server is automatically setup and unsetup when finished. Containers make it hard for developers to escape the control of the machine coordinator, so you don't get bits of old apps lying around, and they're a packaging format that's very simple, so there's minimal chances of anything going wrong during setup.

Containers evolved out of chroot jails because scaling hardware vertically is very cost effective, so if you have big computing needs you end up with huge machines with 64-128 cores and terabytes of RAM, because that amortises the cost of all the other components like power supplies. But most apps can't use anywhere near those kind of resources, so you want to run multiple apps on each machine,  without having them interfere with each other. This is hard! Borg became masterful at it over time, and it is common there for latency sensitive search/ads servers to run right next to giant batch jobs that want to saturate every machine resource they can, with servers being constantly re-arranged as the giant brain in the sky finds superior binpackings. Yet it all works and users never notice!

However, most companies and situations aren't like this, and the technology comes with costs. What are those costs:

  1. You have to tell the container system precisely what the requirements of your jobs are: memory, disk, network access. This turns out to be hard to always do well especially if you didn't write the software, and especially as versions change.

    Borg started by having fixed resource quotas, like Kubernetes does, but over time those became somewhat advisory. The system takes what you give it as being basically a guess, and adjusts the sizes of your containers over time based on observed usage, task deaths due to running out of resources, etc. In other words actually trusting container sizes causes huge waste, so it does statistical overcommit.

  2. That's great if your orchestrator coordinates with the rest of your infrastructure to make server deaths and movements totally transparent. Unfortunately that's hard and requires a ton of support code that is not all open sourced. Moreover it requires apps to be written in very particular ways to use all that infrastructure, which normal programs are not.

    This is one reason why Google has such an infamous case of not-invented-here syndrome. Using open source servers in a fully containerised infrastructure like that is very hard because normal programs make assumptions like ... data I write to disk will still be there when I start up next time. Or, I can probably make large temporary allocations because the server can free up space by flushing disk caches. Or, I can use host names/IP addresses in config files as ways to find other servers. Or, HA is implementable by failing back and forth between backup servers.

    None of these things are true in environments where stuff is constantly on the move and can be killed unilaterally without notice. So you end up needing to either patch the software very heavily or do lots of hacky workarounds.

    As an example, containes do things that are normally considered critical bugs, like deleting everything you wrote to disk when the program quits. This is a perfectly reasonable thing to do inside Google because all well trained Googlers know to store their stuff in Colossus or Spanner using proprietary filesystem APIs. It's total nonsense for running the rest of the world's software.  We've already seen people get this wrong and do things like accidentally delete their private keys because they forgot to tell Docker to preserve a particular directory. That would never happen without containers.

  3. Containers are popular partly because the giant loop of version management history is nearly back at the start and everyone seems to have forgotten the last time around. That's not necessarily progress.

    In the early days of UNIX there was no dynamic linker so every binary was static. It was quickly found this is (a) inefficient and (b) insecure, because a bug fix to a library required rebuilding every binary, so security fixes couldn't be quickly applied. DSO/DLL files were the solution, which led to the question of how you manage dependencies between them, so package formats like DEB/RPM/MSI were created to solve that, but everyone had their own way to do things, and in particular everyone had their own way to label things, so there was no way to express dependencies in a unified form. It ended up being kind of a mess, a.k.a. "dependency hell".

    So we reach 2013 and Docker is released. Docker images are in effect statically linked binaries. It's the one true package format we were all waiting for! But, oh dear, it's (a) inefficient and (b) insecure. Now a security bug in a dependency can't be applied once and affect everything. You have to rebuild every image.

    Google has a solution! "Managed base images" are versioned container layers you can depend on:

    and they patch them for you. You just say "$distro:latest" in your Docker file and... hey, doesn't that sound a bit like an ordinary operating system now? You could just distribute a zip file of your software and say "run it on the latest version . of Ubuntu 16". It'd boil down to something rather similar.

    The next step will surely be to observe you don't actually need a full blown copy of Ubuntu or whatever installed for every server. You only need parts of it. So then maybe they'll come up with different dependency names to reflect subsets, and then different cloud vendors will name things differently, and then eventually people will be tired of GCP / AWS / Azure specific Docker images, so there'll be a standards body that gives standard names to bundles of dependencies. We'll have reinvented the Linux Standard Base project.

    In the end, we can see where this is all going, because we're been there before. Alpine Linux is a particularly problematic outgrowth of this loop.

So, can you use containers with Corda? Yes, and since Corda 4 it's now officially supported. Should you? Well, try it out and see if you prefer it. A Corda node isn't horizontally scalable (at the moment) so you'll probably only set it up once, and even with fancy HA enterprise setups, each machine will be running different software and want stable DNS names or IP addresses, so you can't just move things around at will to binpack them better either. Given that, if I were setting up a node I'd probably just script it with Ansible, systemd or a deb. But it'll all work. Do what you like best.

Join to automatically receive all group messages.