Sep 22, 2021

Infrastructure as Code – The Wrong Way

You are probably familiar with the term “infrastructure as code”. It’s a great concept, and it’s gaining steam in the industry. Unfortunately, just as we had a lot to learn about how to write clean application code, you can easily fail to write clean infrastructure code. Here are a few common ways that we’ve seen infrastructure code done the wrong way, and some ways you can do better.

1. Use latest dependencies

The wrong way: “Don’t peg your dependencies to specific versions, that’s too much trouble! You always want the latest and greatest! If you just set your dependencies to latest, then you’ll get the best version every time you build. This works with Node packages, Docker base images, Python packages, Java libraries. Even if you pin a version, don’t worry about hashsums. They won’t replace the tagged image unless it’s really necessary.”

A better way:

  • Peg your dependencies to a particular version. This will help ensure you get the same result whether you build your project today, tomorrow, or six months from now.
  • Use a lockfile if your system supports it—this will ensure that the package or image will match the desired hashsum.
  • Install all of your dependencies on your image at build time—avoid running a package install when a server instance comes online. This will help prevent issues occurring at runtime, if a new instance comes online.
  • From a security perspective, it can be valuable to use tooling that will identify insecure versions of libraries at both build time and runtime. This will help you catch potential vulnerabilities quickly, but fix them with a discrete action in source control.

2. Write big scripts

The wrong way: “Write a quick 2-line Bash script that will launch your infrastructure. Infrastructure As Script! Oh, you need to tear down your infrastructure? We can write another script that will delete it. Need to update the infrastructure, well, let’s write an if-block…”

Your code will quickly start to look something like this (pseudo-code):

asg = get_auto_scaling_group(...)
if !asg.exists() {
   asg = new AutoScalingGroup(min_size=2, max_size=5)
   create_auto_scaling_group(asg)
} else if asg.min_size != 2 {
   asg.min_size = 2
   update_auto_scaling_group(asg)
}
....

This is code that is not very testable, readable, or maintainable. It is because you are managing state using imperative code.

A better way:

I prefer the term “declarative infrastructure”. This is something like Terraform, Kubernetes resources, or CloudFormation, that you define the desired state, and the tool compares it to current state, and then performs the necessary actions. That ends up looking like

resource "aws_autoscaling_group" "bar" {
  name                      = "foobar3-terraform-test"
  min_size                  = 2
  max_size                  = 5
  ...
}
  • Rather than a script or application with imperative commands, build declarative configuration and use a tool that will figure out what it needs to do to get there.
  • This avoids complex logic, which means it helps you avoid bugs
  • Kubernetes codifies this model with an API built around declarative resources (which is why those imperative CLI commands such as kubectl expose or kubectl run are counterproductive, and shouldn’t exist)

3. Deploy your infrastructure from your workstation

The wrong way: “Why wait for Terraform to run on a CI server? You can run it on your laptop. Don’t worry about the Terraform state – it can be locked. You’ll always remember to commit your local source code! Somebody will eventually review the PR and merge it. And everybody else will have the same tools, access, and local environment that you have.”

In reality, you know that folks make mistakes. They forget to commit a change they run. They accidentally use the wrong version of a tool such as Terraform. They have different access than others on the team.

A better way:

Run your infrastructure through a CI tool such as GitHub Actions, Azure DevOps Pipelines, GitLab, or any similar process.

  • On PR builds, run a lint, dry-run, terraform plan – anything you can to help validate that it will work.
  • Deploy your infrastructure for CI to ensure the same versions of tools are used, the same credentials, and that there is an auditable change history.

4. Run everything locally

The wrong way: “The only way to be sure that everything will work is to run everything locally! So let’s run Kubernetes, an API gateway, Kafka, databases, and every microservice in the company in a local virtual machine!”

If you choose to unravel this ball of string eventually you’ll be running the whole internet on your laptop. Obviously, that’s not feasible. Testing this way is the sort of “end-to-end” test that is at the top of the test pyramid — slower, more expensive, and more difficult to troubleshoot when something goes wrong.

A better way:

I prefer to let engineers run software the way they prefer to run it – often, this is to use IDE tools. Requiring local application to run in a local Kubernetes cluster or even Docker Compose stack can make it difficult to use the native debuggers or test frameworks.

And if you are attempting to “run the world” locally, consider what problem you are actually trying to solve. Is it that you want to get the Kubernetes config correct? Perhaps consider running it through a JSON schema validation. Are you needing to interact with an external service? Consider stubbing or mocking out that service.

Whole books could be written on this, but my rule of thumb is to try and reduce engineer cycle time as much as possible. If they are writing application code, they should be able to iterate quickly, locally, with unit tests or local “poking around” builds. If they are interacting with another API, they should be able to test against some published contract or specification.

5. Use YAML

The wrong way: “All configurations should be in YAML. Even better—put YAML inside of your YAML. Best—Template your YAML inside your YAML”

Gross! Listen, I use YAML as much as the next person, but we have to recognize that we have a problem. YAML is easier to read (and template) than JSON. But it has a lot of shortcomings. Here are some fun examples I run into, regularly:

  • boolean or numeric values used where a string is expected—such an environment variable value. These will cause errors when deserializing the YAML unless they are quoted.
  • 0233849835: AWS account IDs with a leading zero—these can be interpreted as a number, which will remove the leading zero: 233849835
  • 1.2.0 is interpreted as a string, "1.2.0", but 1.2 is interpreted a numeric value 1.2.
  • 042 is interpreted by some parsers as an octal number, and translated into the decimal 34.

There are just a lot of ways that YAML can go wrong.

A better way:

As mentioned above—schemas are available for Kubernetes APIs and Helm chart values, and you should use them. Insert validation of your YAML files in the “validate” phase of the pipeline so that you can catch issues before merging to the main branch.

And maybe consider using JSON or another data format!

Conclusion

Hope you enjoyed this post—I had fun writing it and delivering it as a presentation! We run into all of these sins from time to time, and I have perhaps committed most of them myself. But the important thing is that we learn from the mistake, and do better next time!

About the Author

David Norton profile.

David Norton

Director, Platform Engineering

Passionate about continuous delivery, cloud-native architecture, DevOps, and test-driven development.

  • Experienced in cloud infrastructure technologies such as Terraform, Kubernetes, Docker, AWS, and GCP.
  • Background heavy in enterprise JVM technologies such as Groovy, Spring, Spock, Gradle, JPA, Jenkins.
  • Focus on platform transformation, continuous delivery, building agile teams and high-scale applications.
Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blog Posts
Feature Flags in Terraform
Feature flagging any code can be useful to developers but many don’t know how to or even that you can do it in Terraform. Some benefits of Feature Flagging your code You can enable different […]
Snowflake CI/CD using Jenkins and Schemachange
CI/CD and Management of Data Warehouses can be a serious challenge. In this blog you will learn how to setup CI/CD for Snowflake using Schemachange, Github, and Jenkins. For access to the code check out […]
How to get your pull requests approved more quickly
TL;DR The fewer reviews necessary, the quicker your PR gets approved. Code reviews serve an essential function on any software codebase. Done right, they help ensure correctness, reliability, and maintainability of code. On many teams, […]
Kafka & Kubernetes: Scaling Consumers
Kafka and Kubernetes (K8s) are a great match. Kafka has knobs to optimize throughput and Kubernetes scales to multiply that throughput. On the consumer side, there are a few ways to improve scalability. Resource & […]