Validating Terraform Plans using Open Policy Agent

When developing infrastructure as code using terraform, it can be difficult to test and validate changes without executing the code against a real environment. The feedback loop between writing a line of code and understanding its impact on the system is often significantly longer then when developing application code. The goal of this blog post is to explore one way to shorten that feedback loop using Open Policy Agent.

In this blog post, we will explore the possibilities for using Open Policy Agent to validate terraform plans before they are applied. Open Policy Agent uses the Rego Query Langauge to express policies and generate evaluations based off specific input. There is some existing documentation about integrating with Terraform, but this documentation does not go through the process of asking specific questions about Terraform resources. This blog post will take a simple set of cloud resources defined by Terraform and ask whether the plan corresponds to a simple set of questions defined in OPA.

The code for this project can be found on Github.

Terraform Code

In this terraform code, we will create a basic Azure network, deploy a single node Kubernetes cluster, then deploy nginx to the cluster. Let’s walk through the basics, first initialize an azurerm_virtual_network, with an azurerm_subnet and associated azurerm_network_security_group. We will deploy the node of the AKS cluster into that subnet and the NSG rules will apply to traffic inbound and outbound from that cluster.

Then, we initialize the Kubernetes provider and create Kubernetes resources using that provider. The deployment will roll out the pods to the cluster, two replicas at this point, and create a LoadBalancer service targeting the pods from that deployment. This will automatically provision an Azure Load Balancer, which will create a public IP address associated to that deployment. All of this code for our project can be found in a single terraform file here.

Defining Policies

This is where the rubber hits the road. Our goal is to ask some specific questions and see if the terraform plan validates against those questions. Let’s use some publicly available resources to help us figure out what questions to ask.

Sidenote: In my experience, policy queries like this are best done directly. There’s no use in saying “do all resources correspond to X,Y,Z” when each resource has its own schema. Forcing the resource types to correspond to the same standard structure will only create headaches when building good queries. I prefer asking direct statements about particular scenarios that are desirable to catch, allowing us to write very specific clauses that will be easy to read. There is a difference between general queries and simplifying the query statement, which I will clarify next.

Now, we want to determine some good questions to ask, one based off each of the respective documents:

We want to put together a rego query that will validate these three questions.

First Steps

The Terraform documentation for OPA first let me to believe there were specific built-ins to the language that helped manage terraform. Digging into the code of the OPA project, I found no such code and was left searching for what was missing. In fact, the documentation outlines how to build some queries based off the terraform plan JSON document. This is helpful in getting started, but unfortunately I didn’t particularly grok the specific queries that they made. First, we will need to dig into a terraform plan document. When terraform generates the JSON, it is structured as follows:

{
    "format_version":"0.1",
    "terraform_version":"0.12.18",
    "planned_values":{
        "root_module":{
            "resources":[
                ///lots of stuff
            ]
        }
    },
    "configuration": {
        "provider_config": {
            ///provider information
        }
    },
    "resource_changes": [
        ///list of changes
    ],
    "root_module": {
        "resources": [
            ///current values with reference fields
        ]
    ]

I will use resource_changes to validate things being modified by the plan. I want to generate a basic basic query that can help me put together a good object that I can operate on. For these rules, I’m just going to operate on the values that are in the create change set.

create_action := "create"
created_objects := {resource.type: spec |
    # Use a specific variable to match the create action to the spec of the object
    some i
    tfplan.resource_changes[i].change.actions[_] == create_action
    resource := tfplan.resource_changes[i]
    spec := [ after_spec |
        after_spec := tfplan.resource_changes[i].change.after
    ]
}

Array iteration is done implicitly through variables being placed in the index fields. In this case, we set some i as an inner variable because we don’t care about the index of the change, just that the create action matches up with the spec of the object that is being created. The after statement on the change is the state that is being targeted.

Confirm Network Security Rule Does Not Allow RDP/SSH

So, with this rule we’re going to make some simplifying assumptions. Basically, we’re going to avoid the fact that you could allow a port range like 18-30 and assume that you will either specify “22” or “25” in the port list, or allow all “*”. Here’s the rule:

  • No azurerm_network_security_rule “Allow” rule that gets traffic from the “Internet” tag on “*” or “22”.

We can use our plugin code to get a list of the azurerm_network_security_rule objects. Then, since we need to find violations to this policy, let’s set the default value for this evaluation to false and look for rules matching that statement:

rules := created_objects["azurerm_network_security_rule"]
 
default found_open_ports = false
found_open_ports {
    some i
    rules[i].direction = "Inbound"
    rules[i].access = "Allow"
    rules[i].source_address_prefix = "Internet"
    # All ports
    rules[i].destination_port_range = "*"
}
found_open_ports {
    some i
    rules[i].direction = "Inbound"
    rules[i].access = "Allow"
    rules[i].source_address_prefix = "Internet"
    # SSH Port
    contains(rules[i].destination_port_range, "22")
}
found_open_ports {
    some i
    rules[i].direction = "Inbound"
    rules[i].access = "Allow"
    rules[i].source_address_prefix = "Internet"
    # RDP Port
    contains(rules[i].destination_port_range, "25")
}

There’s definitely some refactoring we can do here. Let’s drop the first three statements into a function:

inbound_rule(rule) = check {
    rule.direction = "Inbound"
    rule.access = "Allow"
    rule.source_address_prefix = "Internet"
    check := "true"
} else = check {
    check := "false"
}

Originally, this rule allowed every destination port. Let’s pass this rule by modifying the terraform code. Since we’re just exposing nginx over port 80, so let’s change the inbound rule to only allow on port 80. After this commit this policy passes.

Validate Daemonset Exists

Let’s put together a little more complex query, asking whether the kured daemonset being deployed every Azure Kubernetes cluster that is deployed. We need to reshape the question in terms of terraform resources:

  • When an azurerm_kubernetes_cluster is created, a kubernetes_daemonset with the image name containing docker.io/weaveworks/kured is also created.

Logical OR is difficult in Open Policy Agent, so let’s break it down into multiple statements. The above statement is a logical implication, which means it is true when the first statement is false or both statements are true. This means, I can validate the following two statements:

  • No azurerm_kubernetes_cluster is created
  • The same number of azurerm_kubernetes_cluster and kubernetes_daemonset objects are created where the daemonset matches the correct image name.

First, we default a rego statement to false and check whether there are no azurerm_kubernetes_cluster objects:

default kube_daemonset_rule = false
 
kube_daemonset_rule {
    count(created_objects["azurerm_kubernetes_cluster"]) = 0
}

Then, let’s validate that the number of kube cluster created equals the number of deployments targeting those clusters. First, build a list of the daemonsets that match, then compare the sizes of those two lists:

kube_daemonset_rule {
    count(created_objects["azurerm_kubernetes_cluster"]) > 0
    count(created_objects["kubernetes_daemonset"]) > 0
    daemonset_list := [res |  
       res:= created_objects["kubernetes_daemonset"][_]
       contains(res.spec[_].template[_].spec[_].container[_].image, "docker.io/weaveworks/kured")
    ]
    count(daemonset_list) = count(created_objects["azurerm_kubernetes_cluster"])
}

Now, run the eval without the daemonset present in the plan:

$ opa eval --format pretty --data terraform-plan-parsing.rego --input plan.tfstate.json "data.terraform.parsing.kube_daemonset_rule"
false

Then, define the required kubernetes_daemonset object and get the output we desire. You can see the addition in the terraform code in this commit:

$ opa eval --format pretty --data terraform-plan-parsing.rego --input plan.tfstate.json "data.terraform.parsing.kube_daemonset_rule"
true

Confirm Kubernetes Service Accounts

Now, we can start to put together some more complex rules. This next rule will be stated as the following:

  • All workloads in Kubernetes should run under non-default service accounts to identify the application specifically.

We could extend this to enforce using Azure Pod Identity to identify workloads, but non-default service account should be good enough for now. Again, we need to start with some simplifying statements:

  • Kubernetes workloads are run by resource types kubernetes_deployment and kubernetes_daemonset.
  • A service account must be specified within the pod spec for each of these workloads.
  • The service account must not be named default

The goal here will be to demonstrate how we can validate the Deployment and Daemonset types in an abstract enough way where you can pick out other resources as desired. First, let’s set the value we want to check to false. The policy will fail when has_default_service_account becomes true, since we want it to fail whenever it finds a single entry.

Then, we need to somehow default the value when the service account is not defined. We will do this using a key func that returns two possible strings, and a very simple has_key function that returns true when the object has the key we’re looking for.

Then, we can validate against both the daemonset and the deployment spec.template.spec objects as that entry should have the service_account_name field.

default has_default_service_account = false
sa_key := "service_account_name"
has_default_service_account {
    workloads := array.concat(created_objects["kubernetes_deployment"], created_objects["kubernetes_daemonset"])
    spec := workloads[_].spec[_].template[_].spec[_]
    val(key_func(spec,sa_key), spec,sa_key) == "default"
}
 
val("has_key", spec, key) = ret {
    ret := spec[key]
}
 
val("no_key", spec, key) = ret {
    ret := "default"
}
 
key_func(spec, key) = message {
  message := "has_key"
  has_key(spec, key)
} else = default_out {
  default_out := "no_key"
}
 
has_key(x, k) { x[k] }

The original nginx deployment didn’t set any service_account_name field, so it would run under the default service account. Therefore, when evaluating the policy, the output looks like:

$ opa eval --format pretty --data terraform-plan-parsing.rego --input plan.tfstate.json "data.terraform.parsing.has_default_service_account"
true

Then, after this commit, we get:

$ opa eval --format pretty --data terraform-plan-parsing.rego --input plan.tfstate.json "data.terraform.parsing.has_default_service_account"
false

Putting it all together

Now, we can make an allow rule that combines the three statements:

allow {
    not has_default_service_account
    kube_daemonset_rule
    not found_open_ports
}

Then, the policy can be run using the package name with the data prefix, since each rego policy file is loaded as data.

$opa eval --format pretty --data terraform-plan-parsing.rego --input plan.tfstate.json "data.terraform.parsing.allow"
true

The terraform plan now passes all the allow policies defined in our Open Policy Agent policy definition. This would be a great sign that our code is ready to be pushed into the environment.

I hope you can use this to help build complex policy evaluations for your Terraform code using Open policy agent. Feel free to drop any questions or interesting rego policy language puzzles in the comments section or out in the Github issues!

About the Author

Object Partners profile.

One thought on “Validating Terraform Plans using Open Policy Agent

Leave a Reply

Your email address will not be published.

Related Blog Posts
Natively Compiled Java on Google App Engine
Google App Engine is a platform-as-a-service product that is marketed as a way to get your applications into the cloud without necessarily knowing all of the infrastructure bits and pieces to do so. Google App […]
Building Better Data Visualization Experiences: Part 2 of 2
If you don't have a Ph.D. in data science, the raw data might be difficult to comprehend. This is where data visualization comes in.
Unleashing Feature Flags onto Kafka Consumers
Feature flags are a tool to strategically enable or disable functionality at runtime. They are often used to drive different user experiences but can also be useful in real-time data systems. In this post, we’ll […]
A security model for developers
Software security is more important than ever, but developing secure applications is more confusing than ever. TLS, mTLS, RBAC, SAML, OAUTH, OWASP, GDPR, SASL, RSA, JWT, cookie, attack vector, DDoS, firewall, VPN, security groups, exploit, […]