Cloud misconfiguration is the number one cause of data breaches involving public cloud services such as those offered by Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. According to Neil MacDonald at Gartner, “nearly all successful attacks on cloud services are the result of customer misconfiguration, mismanagement and mistakes.”
Cloud Security Posture Management (CSPM) emerged to address this risk by helping cloud customers make sure their cloud environments are configured securely. CSPM has primarily consisted of “scan and alert” tools that look for “known-bad” misconfigurations while still relying on humans to review alerts, prioritize critical issues, and remediate them. But when bad actors are using automation to find cloud misconfigurations to exploit, humans are too slow and error-prone to stop them.
To address this gap, CSPM has turned to automated remediation of cloud misconfiguration, an approach that is entering mainstream enterprise adoption. Automated remediation (sometimes referred to as "guardrails") helps reduce human errors and can bring the Mean Time to Remediation (MTTR) for cloud misconfiguration down to minutes from hours or days. Finally, there’s a cost driver: manual remediation is an enormous drag on resources.
Two methods of automated remediation for cloud misconfiguration
Let’s take a close look at two methods for automated remediation of cloud misconfiguration and evaluate them against some universal goals we have with CSPM, as well as some universally bad things no one wants to experience in cloud security and operations.
Method 1: Automation scripts and bots
When faced with the never-ending tedious task of managing cloud misconfigurations, engineers have frequently turned to using automation scripts to correct problems. While the implementation details can differ, the fundamentals of how they work are the same. A monitoring tool identifies a problem and creates an alert, which triggers a script or workflow that executes a change via the cloud service provider (CSP) APIs. Typically these scripts or workflows leverage serverless functions, such as AWS Lambda or Azure Functions.
Automated remediation scripts or workflows are also referred to as bots, cloud bots, or “lambdas”.
Method 2: Infrastructure baselining
A cloud infrastructure baseline is a snapshot of a cloud environment—a complete picture of all of the resources you have running and every configuration attribute. Validated against compliance and security policies, a baseline serves as a “contract” between cloud teams (i.e. DevOps; security; developers). It is a single source of trust as to what is “known good”—a “gold image” for your cloud environment.
By continuously surveying the cloud environment, we can detect “drift” from the established baseline, which is helpful to understand change and identify potentially dangerous misconfiguration events. For security-critical resources, we can enforce the baseline by automatically remediating cloud configuration drift back to the established baseline. This provides everyone with a single source of truth for what is running in their cloud environment.
Because baseline enforcement makes cloud resource configurations self-healing without the need for bespoke automation scripts or workflows, we consider it to be autonomous remediation.
Positive Criteria: Things we want with CSPM
Establishing trust between teams
The DevOps (and DevSecOps) movement is as much about culture and process as it is about tools and technology. But the two aren’t mutually exclusive. The right tools and technology can help teams bridge chasms, build trust, and improve collaboration.
But getting developers on the same page with security teams is an age-old challenge. The former are under pressure to innovate fast and ship often and the latter are under pressure to protect data and ensure compliance. This creates friction and tradeoffs between velocity and security.
Baselining helps here because all stakeholders share the same source of trust in “known goodness” for cloud environments, and truth in what’s actually running in their cloud environment. Developers can move faster safely because they understand what’s known-good, and security and compliance teams have a basis for understanding change, identifying issues, securing critical resources, and ensuring compliance.
With scripts and bots, there is no single source of truth or trust. Employing this approach at-scale can require hundreds of different scripts or workflows, each with their own logic and target configuration state. In this scenario, each bot becomes a different source of distrust, especially when they deliver surprises.
For example, a bot operated by the Security team may take an action as simple as enabling HTTPS on an Elastic Load Balancer and disabling HTTP to enforce encryption requirements. But if the Security team isn’t aware of a legacy application that still only supports HTTP, this bot will cause an application outage, deepening distrust between teams.
⇒ Advantage: baselining
“Shift Left” on cloud security and compliance
These days, more cloud teams are implementing a “Shift Left” on security and compliance, and this is awesome. When we identify security and compliance problems earlier in the software development life cycle (SDLC), it makes correcting them faster and cheaper than when we discover them later on after a lot more work has been completed.
Baselining gives developers a complete picture of known-good cloud environments. Proposed changes to infrastructure environments can be compared to the production baseline to understand what will change. Once changes are applied, we can check drift events to validate that the change occurred as intended (and nothing else).
Scripts and Bots are solely focused on finding and fixing known-bad misconfigurations in production environments and provide no useful mechanism for a “Shift Left” on cloud security and compliance.
⇒ Advantage: baselining
Eliminate cloud misconfiguration
This is the whole point of automated remediation, right? Cloud resources can drift and create dangerous misconfiguration vulnerabilities. We want automation tools that can find these misconfigurations and fix them before the bad guys do.
Because a baseline is a snapshot—a complete picture—of a cloud environment and all configuration attributes, we can detect any drift from the baseline by continuously re-surveying the environment and comparing it to the baseline. When we enforce the baseline for security-critical resources (e.g. an S3 bucket containing sensitive data; a VPC network), we are reverting the resource’s configurations back to the established baseline when drift is detected for those resources.
This is a good time to note that not all bad drift is a security issue. Sometimes drift can cause system downtime events, such as when a legitimate Security Group port is mistakenly deleted and the application goes down. Baselining will detect this drift event and enforcement can put it back, restoring the application automatically.
Scripts and Bots address misconfiguration automatically, but this approach of making automated changes to production infrastructure has at least three significant disadvantages:
- Lack of context. Bots do not have knowledge of prior state in order to restore what was deleted. If someone removes resource tags or a Security Group rules, a bot cannot put things back to their original state. At best, a bot can try to infer what tags or Security Group rules should exist, based on contextual factors such as account ID or region. But given the diversity and complexity of enterprise environments, these inferences are likely to be wrong.
- Complexity at scale. Making configuration changes to production environments without knowledge of prior state involves complex challenges. For example, making a change as simple as enabling VPC flow logging requires the specification of an S3 bucket to store logs to and an IAM policy or S3 bucket policy to restrict access to the bucket. Unless you want to use a single S3 bucket and IAM or S3 bucket policy for all VPC throughout the enterprise (very unlikely), you will need to map every VPC to its respective remediation parameters of an S3 bucket and an IAM or S3 bucket policy. And this is only for a single remediation action on a single resource type. If you want to automate many different types of remediation actions such as specifying Customer Managed Keys for encrypting data or which availability zones to use for auto scaling groups then the complexity of managing these parameters multiplies exponentially.
- Bot proliferation. Third, using bots to address misconfiguration requires a different bot for each type of misconfiguration. I talk later in this post about the disadvantages of “blacklist” approaches with remediation, but suffice it to say, creating a separate bot for every possible type of misconfiguration is an unwieldy task, made even more complex by the need to manage remediation parameters as mentioned above.
Compliance frameworks like HIPAA, PCI, NIST 800-53, SOC 2, GDPR, and ISO 27001 all address how cloud infrastructure shall be configured and used. We need to ensure our cloud resources adhere to policy on day one, and have procedures in place to ensure they stay compliant on day two.
We establish a baseline once our cloud environment has been validated for compliance (including any agreed-upon exceptions or waivers). Therefore, when we enforce the baseline for security-critical cloud resources that have been validated against policy, we’re effectively enforcing compliance for those resources. Where we aren’t using baseline enforcement, we’re detecting drift, which can help compliance and DevOps teams identify issues that need to be addressed. With baselining, we have full and continuous visibility into the security and compliance posture of our cloud environment.
Bots can contribute somewhat to compliance assurance by automatically correcting certain misconfigurations that take cloud resources out of compliance. However, bots don’t provide us with a complete picture of the security and compliance posture of our cloud environment.
⇒ Advantage: baselining
Negative Criteria: Things we don’t want with CSPM
Overhead and maintenance burdens are a fact of life with just about everything in IT, but our goal is to minimize them wherever possible, without making undo tradeoffs.
Baselining requires a solution that can 1) work against the cloud provider’s APIs to gather all information about the resources running and how they’re configured, 2) baseline the environment (i.e., take a configuration snapshot), 3) compare information from subsequent surveys to detect drift, and 4) revert drift back to the baseline where needed. With such a solution, there’s no code to write and no scripts or bots to customize, maintain, and scale, so the overhead for managing such a solution is minimal (I’ll note here that Fugue does this).
A single script or bot designed to remediate a specific AWS Security Group rule misconfiguration can exceed 300 lines of code or require bespoke customization using a UI workflow. Rinse and repeat for every misconfiguration event we want to protect against, and then scale that to protect multiple environments, regions, and cloud accounts for which every bot must be customized. Adopting this method means taking on an ever-growing engineering project that requires a team of cloud automation experts to build and maintain it all.
⇒ Advantage: baselining
Alert fatigue and human error
One of the key drivers of adopting automated remediation is to eliminate alert fatigue and human error associated with the manual identification, prioritization, and remediation of cloud misconfiguration.
Baselining enables us to identify drift from our established baseline, which avoids the false positives and noise that can contribute to alert fatigue and associated human error. By definition, any change from a baseline is suspect because it was made outside of the approved deployment workflow such as a CI/CD pipeline. There’s no need to burn calories determining whether what a scan and alert tool deems problematic actually is or not. Our focus moves to making security-critical resources self-healing with baseline enforcement.
With scripts and bots, any automation action taken should be reviewed to ensure that it did not damage the environment by making a change that resulted in application outage. The risk of human error with bots can be significant because what the security team deems a “safe” target remediation state may or may not be what the application requires, which brings risk of downtime.
⇒ Advantage: baselining
Predictions and blacklists
Both blacklisting and whitelisting approaches have been employed in IT security for decades, and both have their place. Blacklists can be effective in protecting against known exploits like certain malware, but cloud security is focused on the configuration of resources. And with at-scale cloud environments, the number of configurations and combinations of configurations is, for all practical purposes, infinite. An over-reliance on blacklists for addressing cloud misconfiguration is certain to miss critical vulnerabilities.
Baselines enable a whitelist approach to cloud security. Once a cloud environment is validated against policy, it is baselined. This becomes your whitelist of what can and should be running—it’s your gold image for your cloud infrastructure environment. Any drift from the baseline can be reviewed and, where necessary, autonomously remediated back to the baseline. There’s no need to predict every possible misconfiguration.
Scripts and Bots are a blacklist approach to cloud security. Every bot deployed addresses a specific misconfiguration you don’t want to happen. Can your team cover every possible risky misconfiguration with individual scripts or bots? Will your bots correct for disabled encryption or logging, or a deleted Security Group rule? How many bots can your team customize and maintain?
⇒ Advantage: baselining
Unanticipated Application Downtime
There’s been long-running resistance in IT to security automation, but the growing adoption of automated remediation of cloud misconfiguration shows that the risks are serious enough to override this resistance. That said, this resistance is well-founded, as automated remediation has a long history of invoking destructive change and application downtime. It’s good to avoid this when choosing an approach.
Baselining removes the risk of inadvertent destructive changes because it reverts misconfiguration back to the established baseline. This approach helps build trust in security automation with developers and DevOps teams. As stated earlier, baseline enforcement can aid in system reliability, such as when it restores a Security Group rule required by the application but mistakenly deleted.
Because scripts and bots operate independently of the software development life cycle, they can come into conflict with what developers provision and what the application needs to function. When false positives those trigger bots to invoke changes to the cloud environment, there’s a decent chance someone’s going to have a bad day. With dozens or hundreds of bots are in use, it’s impossible to understand all the potential outcomes resulting from actions they may take.
⇒ Advantage: baselining
Conclusion: Baselining is the superior approach to automated remediation for cloud misconfiguration
When it comes to Cloud Security Posture Management automation, baselining, including drift detection and baseline enforcement, has a number of advantages over automation scripts and bots. Whether it’s eliminating cloud misconfiguration, building trust between teams, shifting left, or avoiding maintenance burdens and app downtime, baselining is the clear winner.
Now, I know what you’re going to say: of course we think baselining is better because that’s what Fugue offers. It’s true, we’re biased. But if you disagree with any of the points made here, I’d love to hear from you. My email is drew at fugue dot co.
And if your team is using scripts or bots, or you are considering it, give us a half hour to show you how much better life can be with baselining.
While you’re here…
If you have responsibility for the security and compliance of your cloud environments, Fugue can help. With Fugue, you can:
- Validate cloud infrastructure compliance for a number of policy frameworks like CIS Foundations Benchmark, HIPAA, PCI, SOC 2, NIST 800-53, ISO 27001, and GDPR.
- Get complete visibility into your cloud environment and configurations with dynamic visualization tools.
- Protect against cloud misconfiguration with baseline enforcement to make security critical cloud infrastructure self-healing.
- Shift Left on cloud security and compliance with CI/CD integration to help your developers move fast and safely.
- Get continuous compliance visibility and reporting across your entire enterprise cloud footprint.
You can access an infographic on approaches to automated remediation for cloud security and other resources here.