The document discusses testing infrastructure as code using automated tests. It recommends writing unit tests to test individual components in isolation by deploying real infrastructure, validating it works through methods like HTTP requests or API calls, and then undeploying it. The document provides an example of using Terratest to write a unit test for a Terraform module that deploys a "Hello World" web app. It shows how to build and deploy the infrastructure, validate it works by making an HTTP request, and clean it up after the test.
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
1. Automated testing for:
✓ terraform
✓ docker
✓ packer
✓ kubernetes
✓ and more
Passed: 5. Failed: 0. Skipped: 0.
Test run successful.
How to
test
infrastructure
code
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
automated-testing-terraform-docker-
packer/
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
24. We know how to write automated
tests for application code…
25. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
But how do you test your Terraform code
deploys infrastructure that works?
26. apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world-app-deployment
spec:
selector:
matchLabels:
app: hello-world-app
replicas: 1
spec:
containers:
- name: hello-world-app
image: gruntwork-io/hello-world-app:v1
ports:
- containerPort: 8080
How do you test your Kubernetes code
configures your services correctly?
27. This talk is about how to write
tests for your infrastructure code.
49. Instead, break your infra code into
small modules and unit test those!
module
module
module
module
module
module
module
module
module
module
module
module
module module
module
50. With app code, you can test units
in isolation from the outside world
51. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
But 99% of infrastructure code is about
talking to the outside world…
52. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
If you try to isolate a unit from the
outside world, you’re left with nothing!
53. So you can only test infra code by
deploying to a real environment
55. Therefore, the test strategy is:
1. Deploy real infrastructure
2. Validate it works
(e.g., via HTTP requests, API calls, SSH commands, etc.)
3. Undeploy the infrastructure
(So it’s really integration testing of a single unit!)
56. Tool
Deploy /
Undeploy
Validate Works with
Terratest Yes Yes
Terraform, Kubernetes, Packer,
Docker, Servers, Cloud APIs, etc.
kitchen-terraform Yes Yes Terraform
Inspec No Yes Servers, Cloud APIs
Serverspec No Yes Servers
Goss No Yes Servers
Tools that help with this strategy:
57. Tool
Deploy /
Undeploy
Validate Works with
Terratest Yes Yes
Terraform, Kubernetes, Packer,
Docker, Servers, Cloud APIs, etc.
kitchen-terraform Yes Yes Terraform
Inspec No Yes Servers, Cloud APIs
Serverspec No Yes Servers
Goss No Yes Servers
In this talk, we’ll use Terratest:
58. Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests
59. Sample code for this talk is at:
github.com/gruntwork-io/infrastructure-as-code-testing-talk
60. An example of a Terraform
module you may want to test:
62. resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
Under the hood, this example runs on
top of AWS Lambda & API Gateway
63. $ terraform apply
Outputs:
url = ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
$ curl ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
Hello, World!
When you run terraform apply, it
deploys and outputs the URL
64. Let’s write a unit test for
hello-world-app with Terratest
68. func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
2. Run terraform init and terraform
apply to deploy your module
69. func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
3. Validate the infrastructure works.
We’ll come back to this shortly.
70. func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
4. Run terraform destroy at the end of
the test to undeploy everything
71. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
The validate function
72. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
1. Run terraform output to get the web
service URL
73. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
2. Make HTTP requests to the URL
74. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
3. Check the response for an expected
status and body
75. func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
4. Retry the request up to 10 times, as
deployment is asynchronous
76. Note: since we’re testing a
web service, we use HTTP
requests to validate it.
77. Infrastructure Example Validate with… Example
Web service Dockerized web app HTTP requests Terratest http_helper package
Server EC2 instance SSH commands Terratest ssh package
Cloud service SQS Cloud APIs Terratest aws or gcp packages
Database MySQL SQL queries MySQL driver for Go
Examples of other ways to validate:
79. $ go test -v -timeout 15m -run TestHelloWorldAppUnit
…
--- PASS: TestHelloWorldAppUnit (31.57s)
Then run go test. You now have a unit
test you can run after every commit!
80. Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests
96. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
The validate method
97. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
1. Wait until the service is deployed
98. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
2. Make HTTP requests
99. func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
http_helper.HttpGetWithRetry(t,
serviceUrl(t, opts), // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3*time.Second // Time between retries
)
}
3. Use serviceUrl method to get URL
103. $ go test -v -timeout 15m -run TestDockerKubernetes
…
--- PASS: TestDockerKubernetes (5.69s)
Run go test. You can validate your
config after every commit in seconds!
104. Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests
106. Pro tip #1: run tests in completely
separate “sandbox” accounts
107. Tool Clouds Features
cloud-nuke AWS (GCP planned)
Delete all resources older than a certain
date; in a certain region; of a certain type.
Janitor Monkey AWS
Configurable rules of what to delete.
Notify owners of pending deletions.
aws-nuke AWS
Specify specific AWS accounts and
resource types to target.
Azure Powershell Azure
Includes native commands to delete
Resource Groups
Pro tip #2: run these tools in cron jobs
to clean up left-over resources
124. func TestProxyApp(t *testing.T) {
webServiceOpts := configWebService(t)
defer terraform.Destroy(t, webServiceOpts)
terraform.InitAndApply(t, webServiceOpts)
proxyAppOpts := configProxyApp(t, webServiceOpts)
defer terraform.Destroy(t, proxyAppOpts)
terraform.InitAndApply(t, proxyAppOpts)
validate(t, proxyAppOpts)
}
6. At the end of the test, undeploy the
proxy app and the web service
135. func TestProxyApp(t *testing.T) {
t.Parallel()
// The rest of the test code
}
func TestHelloWorldAppUnit(t *testing.T) {
t.Parallel()
// The rest of the test code
}
Enable test parallelism in Go by adding
t.Parallel() as the 1st line of each test.
136. $ go test -v -timeout 15m
=== RUN TestHelloWorldApp
=== RUN TestDockerKubernetes
=== RUN TestProxyApp
Now, if you run go test, all the tests
with t.Parallel() will run in parallel
138. resource "aws_iam_role" "role_example" {
name = "example-iam-role"
}
resource "aws_security_group" "sg_example" {
name = "security-group-example"
}
Example: module with hard-coded IAM
Role and Security Group names
139. resource "aws_iam_role" "role_example" {
name = "example-iam-role"
}
resource "aws_security_group" "sg_example" {
name = "security-group-example"
}
If two tests tried to deploy this module
in parallel, the names would conflict!
141. resource "aws_iam_role" "role_example" {
name = var.name
}
resource "aws_security_group" "sg_example" {
name = var.name
}
Example: use variables in all resource
names…
142. uniqueId := random.UniqueId()
return &terraform.Options{
TerraformDir: "../examples/proxy-app",
Vars: map[string]interface{}{
"name": fmt.Sprintf("text-proxy-app-%s", uniqueId)
},
}
At test time, set the variables to a
randomized value to avoid conflicts
146. 1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service
When iterating locally, you sometimes
want to re-run just one of these steps.
147. 1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service
But as the code is written now, you
have to run all steps on each test run.
148. 1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service
And that can add up to a lot of
overhead.
(~3 min)
(~2 min)
(~30 seconds)
(~1 min)
(~2 min)
150. webServiceOpts := configWebService(t)
defer terraform.Destroy(t, webServiceOpts)
terraform.InitAndApply(t, webServiceOpts)
proxyAppOpts := configProxyApp(t, webServiceOpts)
defer terraform.Destroy(t, proxyAppOpts)
terraform.InitAndApply(t, proxyAppOpts)
validate(t, proxyAppOpts)
The original test structure
151. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
The test structure with test stages
152. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
1. RunTestStage is a helper function
from Terratest.
153. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
2. Wrap each stage of your test with a
call to RunTestStage
154. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
3. Define each stage in a function
(you’ll see this code shortly).
155. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
4. Give each stage a unique name
156. stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
Any stage foo can be skipped by
setting the env var SKIP_foo=true
158. $ go test -v -timeout 15m -run TestProxyApp
Running stage 'deploy_web_service'…
Running stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
--- PASS: TestProxyApp (105.73s)
That way, after the test finishes, the
infrastructure will still be running.
160. $ go test -v -timeout 15m -run TestProxyApp
Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
--- PASS: TestProxyApp (14.22s)
This allows you to iterate on solely the
validate stage…
161. $ go test -v -timeout 15m -run TestProxyApp
Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
--- PASS: TestProxyApp (14.22s)
Which dramatically speeds up your
iteration / feedback cycle!
162. $ SKIP_validate=true
$ unset SKIP_cleanup_web_service
$ unset SKIP_cleanup_proxy_app
When you’re done iterating, skip
validate and re-enable cleanup
163. $ go test -v -timeout 15m -run TestProxyApp
Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app’…
Skipping stage 'validate’…
Running stage 'cleanup_proxy_app’…
Running stage 'cleanup_web_service'…
--- PASS: TestProxyApp (59.61s)
This cleans up everything that was left
running.
164. func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
Note: each time you run test stages via
go test, it’s a separate OS process.
165. func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
So to pass data between stages, one
stage needs to write the data to disk…
166. func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
And the other stages need to read that
data from disk.
174. You could use the same strategy…
1. Deploy all the infrastructure
2. Validate it works
(e.g., via HTTP requests, API calls, SSH commands, etc.)
3. Undeploy all the infrastructure
175. But it’s rare to write end-to-
end tests this way. Here’s why:
182. Assume a single resource (e.g.,
EC2 instance) has a 1/1000
(0.1%) chance of failure.
183. Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
The more resources your tests deploy,
the flakier they will be.
184. Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
You can work around the failure rate
for unit & integration tests with retries
185. Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
You can work around the failure rate
for unit & integration tests with retries
186. Key takeaway: E2E tests from
scratch are too slow and too
brittle to be useful
193. Technique Strengths Weaknesses
Static analysis
1. Fast
2. Stable
3. No need to deploy real resources
4. Easy to use
1. Very limited in errors you can catch
2. You don’t get much confidence in your
code solely from static analysis
Unit tests
1. Fast enough (1 – 10 min)
2. Mostly stable (with retry logic)
3. High level of confidence in individual units
1. Need to deploy real resources
2. Requires writing non-trivial code
Integration tests
1. Mostly stable (with retry logic)
2. High level of confidence in multiple units
working together
1. Need to deploy real resources
2. Requires writing non-trivial code
3. Slow (10 – 30 min)
End-to-end tests
1. Build confidence in your entire
architecture
1. Need to deploy real resources
2. Requires writing non-trivial code
3. Very slow (60 min – 240+ min)*
4. Can be brittle (even with retry logic)*