Platform Engineers are Breaking Under Terraform Complexity - There's a Better Way

(Estimated Read Time: 10 minutes)

Nov 19, 2024

Imagine you have 200 applications that you need to migrate to the cloud (or 20 for that matter). Often these types of applications will become Frankenstein, morphed into what they are today with little strategic planning for infrastructure needs. Sure a lift-and-shift approach can move everything off your hardware to the cloud (till the first bill comes) but most businesses want to:

Minimize security risks
Manage costs
Have systems scale up/down as needed
Modernize deployment processes
Have robust monitoring

Even if these are not explicitly mentioned by the business, its what the cloud is being thought of/sold as, and it falls on the platform engineering teams to deliver since it does not magically happen just by moving to the cloud.

In order to achieve this consistently, the team must have a repeatable (but flexible) process to prove applications between environments. In cloud, because it is easy to obtain compute resources with endless options or limits, configuring it using click button menus can result in a highly varied environments potentially costing far more than anticipated. To prevent this, Infrastructure as Code (IaC) allows teams to provision resources using a standardized process. By using IaC, the cost of environments can be controlled by having them thought out and build consistently, the speed will improve as environments can be configured with effectively a click of a button and there by reducing the risk. Scale that across the number of applications and their associated environments, and IaC starts to become fundamental to leveraging all the benefits of being in the cloud.

There are many providers of this capability, but Terraform is the major player in this space. Its wide spread adoption stems from several key advantages:

It works across all major cloud providers (including Azure and AWS)
Has the largest and most active user community
Is a defacto standard most engineering people/teams learn (even in school)

Sure, there are provider-specific tools that companies like Amazon and Microsoft provide (Azure's Bicep and AWS Cloud Formation), but it would be short-sighted for an engineering team to adopt those as it would limit any future growth and integration.

So what does a typical Terraform practice look like in most organizations:

The team has to really ramp up their ability to deploy as modernization programs always need to be done fast
The team just starts building and don't have time adopt any structure and best practices resulting in:
Massive Terraform modules (super modules) that can take too long to deploy
Have only 1 to 2 modules with everything in it resulting in everyone having access to make changes
Introduces problem of making code harder to test
A lot of repetitive tasks, troubleshooting and struggles to deploy codes
No clean up as teams are focused on pushing new features and services

If this sounds like the same/similar set of problems teams have on their on premises environment, or a new set of problems that will just cost more money, the question becomes why bother...configure manually and deal with what we know.

Addressing The Issues

As you scale, Terraform becomes harder to manage. The truth is, like anything a little planning and some tooling help (in that order), even if you are down this path already, can help right the ship.

Terraform Module Design

Modules should be broken down based on several factors including:

Who's responsible
How often it is being deployed
Ephemeral (short-term) vs. long-term environments
Logical groupings
Risk-related changes (ex. a networking/security change may be a higher risk and should be in a different module compared to frequently changes application resources)

The key takeaway is to have development, infrastructure and platform engineering teams white board sessions with all environments and their components and then associate them with modules.

This separation helps to minimize blast radius of changes, enables fine-grained access control for different teams, and allows for different testing and approval workflows based on the sensitivity of the infrastructure being managed.

Core Terraform Issues

While the above issues will make it easier for you to manage Terraform modules, there are still underlying issues that need to be addressed, namely:

Configurations will still have a lot of repetition
Remote state management will still be complex
Code may still get overwritten
You will still need to pick and choose terraform modules for various environments resulting in having to have a lot of variations

For a longtime, I thought this was just part and parcel of working with Terraform until I was introduced to Terragrunt.

What is Terragrunt?

Terragrunt is an open-source wrapper for Terraform that allows the team to:

Keep Terraform configurations DRY (Don't Repeat Yourself) through defining shared configurations that can be applied across multiple Terraform modules
Provides features like remote state management, input variable validation, and automated dependency management
Allows the running of terraform commands across multiple Terraform modules with a single Terragrunt command
Apply best practices like ensuring consistent environment configurations across different stages (e.g. dev, staging, prod)
Integrates with Terraform Cloud/Enterprise to manage remote state and other cloud-based resources
Bonus: Even the files are stored in Hashicorp's HCL format (and all of those benefits it brings)

While Terragrunt is open source, it was developed by Gruntwork, a DevOps company with significant experience using Terraform (its part of their core business) as a way to solving their own problems. The project is MIT license with over 200 contributors seeing vast growth in usage and downloads, giving me the confidence it will stick around in the long term.

But wait...wasn't HashiCorp acquired by IBM and Terraform will no longer be open source?!

In 2023, Hashicorp announced plans to change their license for Terraform from Mozilla Public License 2.0 (MPL 2.0) to Business Source License (BSL) which means they are no longer open-source.

This year they were purchased by IBM. Was the license changed due to IBM's integration plans? Time will tell.

For now we have the following implications of the license change:

It's a time-limited license that converts to open source (MPL 2.0) after 4 years
Cannot use Terraform to create competing products/services (for Terragrunt this might cause issues)

IBM's answer maybe to push using Terraform Cloud or other paid services in order to replace what Terragrunt would do for the team. With that comes additional costs, vendor lock-in and other risks that a very stable open-source platform such as Terraform has been operating under for a long time.

The good news is Terragrunt is compatible with OpenTofu (an open-source fork of Terraform) with easy migration path between Terraform to OpenTofu (but this is a topic for a different day).

So, what's the catch?

I am always skeptical when someone pitches a new tool that will solve a problem, because what I have seen in most cases is just tool sprawl with no defined benefits and objectives.

My goal on helping implement tools with any client or project is to see the ROI and Terragrunt delivers that by:

If your team is new to Terraform (and struggling), while they will struggle to add one more tool, the benefits within 3-9 months will outweigh the pain they are going through now
Not going to a paid tool like Terraform Cloud means that you don't have to worry about cost over runs (for management beyond free tiers) and vendor lock-in to use an open-source IaC tool (Terraform)
Bolts on to your existing repository

The best time to implement is at the beginning of implementing IaC. The second best time is wherever you are in your journey unless your team already has matured custom processes that are resolving these problem today.

Discussion about this post

Ready for more?