Terraform at PlushCare

When I started at PlushCare in January 2019 everything was managed by engineers making changes as necessary in the AWS Console or on other service’s management UI. Today this is all almost entirely done with code that is checked into GitHub and reviewed in its own repository aptly named “infrastructure.”

Cloudformation vs Terraform #

The two biggest competitors we investigated as an infrastructure as code solution was AWS Cloudformation and HashiCorp Terraform. Cloudformation is the top pick for many organizations for creating and managing AWS resources. Terraform was the new kid on the block and was gaining a lot of traction.

Cloudformation you upload a json template with variables and it will launch, but changes will not be rectified if made after creating the Cloudformation Stack. Changes can be made after launching by changing the template variables if properly configured, but in my experience most of the time changes are made using the AWS Console and would result in configuration drift over time. The biggest drawback to using Cloudformation is that if you would like to manage anything not in AWS you are out of luck.

Terraform manages the state of the infrastructure and you run plan or apply commands to compare the current state with what is expected and any drift between the two will cause Terraform to propose or make changes to match what is in code. Any drift from will be resolved by bringing the state to what is in code. (Some of the times changes are made in the management console first so we make sure to always verify the changes Terraform is making are what we want and if not we update Terraform to match.)

I had a few years of experience developing Cloudformation templates using Troposphere and a custom built python CLI at a previous company, but at PlushCare we wanted to manage resources that were not part of AWS like Cloudflare. We picked Terraform over Cloudformation based on this almost entirely.

Proof of Concept/Initial Steps #

Our first step to introducing Terraform was to write modules that would match existing resources in AWS that were created manually by engineering. Once we created the module we would import the resources into Terraform using the terraform import command and verify that it was imported correctly and there were no changes proposed when running the terraform plan command. The goal here is to import as much as possible into Terraform so it can begin to be managed. The only change we did at this point was to add tags whenever possible to identify what was and was not managed by Terraform.

In almost all of our modules we have the minimum common_tags local variable that looks like the code below:

1
2
3
4
5
6
7
locals {
  common_tags = {
    environment = var.environment,
    cost_category = var.cost_category,
    terraform = “true”,
  }
}

On any resource we then add AWS tags:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
resource "aws_security_group" "this" {
  ...

  tags = merge(
    {
      Name = "${var.environment}-${var.name}"
    },
    local.common_tags
  )
}

Rollout #

Where we are today #

After nearly two years of building and updating our infrastructure with Terraform we have gotten to the point where almost all of our resources are managed.

A sample of our current Terraform repository directory tree is below. We have 4 different environments: global, dev, staging, and prod. Global is the only special one here as it is used for resources that do not fit into one of the others. Examples would include AWS Cloudtrail, IAM, Cloudflare config and GitHub.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
├── dev
│ ├── main.tf
│ ├── outputs.tf
│ ├── terraform.auto.tfvars
│ └── versions.tf
├── global
│ ├── aws
│ │ ├── cloudtrail.tf
│ │ ├── iam
│ │ │ ├── data.tf
│ │ │ ├── groups.tf
│ │ │ ├── main.tf
│ │ │ ├── policies.tf
│ │ │ ├── policy_documents.tf
│ │ │ ├── snapshot_lifecycle.tf
│ │ │ └── users.tf
│ │ ├── main.tf
│ │ ├── route53.tf
│ │ └── s3.tf
│ ├── backup.tf
│ ├── cloudflare
│ │ ├── audit-log.tf
│ │ ├── firewall.tf
│ │ ├── main.tf
│ │ ├── page-rules.tf
│ │ ├── rate-limits.tf
│ │ └── versions.tf
│ ├── domains.tf
│ ├── github.tf
│ ├── main.tf
│ ├── terraform.auto.tfvars
│ └── versions.tf
├── modules
│ ├── aws-ec2-instance
│ ├── aws-ecr
│ ├── aws-eks
│ ├── aws-eks-application-load-balancer
│ ├── aws-eks-node-group
│ ├── aws-eks-target-group
│ ├── aws-elasticcache
│ ├── aws-elasticsearch
│ ├── aws-iam
│ ├── aws-rds-mysql
│ ├── aws-rds-postgres
│ ├── aws-region
│ ├── aws-route53
│ ├── aws-route53-domain-redirects
│ ├── aws-s3-bucket
│ ├── aws-s3-bucket-website
│ ├── aws-vpc
│ ├── elasticbeanstalk-application
│ └── elasticbeanstalk-environment
├── prod
│ ├── main.tf
│ ├── outputs.tf
│ ├── terraform.auto.tfvars
│ └── versions.tf
├── roles
└── staging
  ├── main.tf
  ├── outputs.tf
  ├── terraform.auto.tfvars
  └── versions.tf

We also split our Terraform modules into two logical directories. We use “modules” for reusable resources and “roles” for resources that generally are launched once per environment or a specific application or service.

What’s next #

Right now about 90% of code in each of the environments is duplicated. The differences are mostly in our production environment where we create some additional resources for our data team and marketing. Over the next few months we will be looking into moving the data and marketing resources into their own top tier environments and remove any differences between dev, staging and prod. We would then like to merge dev, staging and prod into one with differences being managed inside variables. We are still not sold on a solution for this as the Terraform documentation recommends against using workspaces for separate environments.