In this post, Kleopatra Chatziprimou, Cloud Systems Developer at Cloudreach, shares some tips on building cloud-native infrastructures.
An important part of my job as Cloud Systems Developer at Cloudreach is to keep up to date with the latest trends in cloud-native systems design. In November 2018, I attended the O’Reilly Velocity Conference which featured speakers sharing best practices and lessons learned on building cloud-native infrastructure or Infrastructure-as-Code (IaC). The event got me thinking about how IaC spans much more than server racks, Velocity Conference network and storage. It is the combination of cultural philosophies, practices, and tools that increase an organization’s ability to reliably deliver applications and services at high velocity and low cost. In this article, I will discuss challenges around building evolutionary infrastructure and provide a best practice methodology to get started with IaC, including continuous environment testing, infrastructure automation, configuration management and more. Let’s get to it!
Challenges in infrastructure design
Modern distributed systems are inherently complex, spanning multiple technologies, groups, and sometimes different organizations altogether. Most poignantly, they can fail in unexpected ways. The adoption of dynamic infrastructure technologies like the Cloud, containers, and serverless, makes it easy to provision, configure, update, and maintain systems and therefore facilitates continuous quality improvements. IaC can make systems consistent, reliable, and easy to manage but a codebase can easily become complicated, fragile, and messy, hence impossible to change or improve. The ease of provisioning new infrastructure can lead to an ever-growing portfolio of systems, and it takes an ever-increasing amount of time just to keep everything from collapsing. Managing changes in a way that improves consistency and reliability doesn’t come out of the box with IaC. In order to routinely change, extend, and improve infrastructure, teams need to have confidence that changes will work correctly and that the impact of failures is low and easily corrected. Performant infrastructure can only be delivered by building on top of industry core principles and key practices.
A recommended methodology to get started with IAC
Moving at high-velocity allows customers to innovate faster, adapt to changing markets, and grow more efficiently. To achieve this, engineering teams must decide their development approach so that a low proportion of time is spent on fixing issues and more is invested in adding value. A common development approach, mostly observed in startups, is to prioritize speed over correctness, often resulting in fast failures at low cost. The opposite side, with the example of large financial corporations, prioritizes correctness over speed at a high cost. However, the resulting long-lead times and complex processes eventually end-up posing barriers to quality. In such scenarios, both the list of known issues, as well as the time pressure for new releases, keeps growing.
Having dealt with both extremes described above at Cloudreach, I am an advocate of the building of minimum viable products (MVP), frequent releases to production, and adoption of test-driven development with frequent tests and fast feedback from clients. Engineers can progress quickly and ensure the correctness of systems in controlled development cycles. Delivering MVPs is possible only when the customer, development, and operations teams are no longer “siloed” but merged into a single team where the engineers work across the entire application lifecycle, from development and test to deployment to operations, and develop a range of skills not limited to a single function.
A cloud-native technology stack and tooling can assist teams to operate and evolve applications quickly and reliably. These tools also help engineers independently accomplish tasks (for example, deploying code or provisioning infrastructure) that normally would have required help from external teams, and this further increases a team’s velocity. Below are listed the standard cloud tooling categories:
- Application Development: IDE, Build Tool, Unit testing.
- Application Packaging: Docker, war, jar, PRM, .deb.
- Application Runtime Platforms: PaaS, Serverless, Container Orchestration.
- Change Delivery: Jenkins, Pipeline Orchestration, Static Code Analysis.
- Operational Management: Monitoring tools.
- Dynamic Infrastructure: Cloud APIs, Virtualisation, Containers, Serverless.
- Pool of resources: Physical infrastructure (compute, network, storage )
- Stack Management Tooling: Terraform, Cloudformation, GCP Deployment Manager/ Azure Resource Manager, Sceptre .
- Server Configuration tools: Ansible, Chef, Puppet, Saltstack .
- Server Image Creation: Packer.
At Cloudreach, we use the full stack to develop end-to-end products spanning infrastructure agnostic applications, to implement automated infrastructure deployment, in testing and validation, and in design pipelines to package and deploy changes across environments. We also handle configuration in systems where infrastructure is dynamically rebuilt, expanded, and collapsed. I like using Sceptre, a tool developed internally at Cloudreach, to allow customers to provision, modify, and destroy Cloudformation templates in a repeatable manner, allowing developers to optimize their time and concentrate on building better environments. It also adds extensible features built on the facilities offered by raw CloudFormation, which adds to the richness of the AWS ecosystem.
Configuration around the infrastructure and applications is important towards developing maintainable, easily repeatable codebase across environments. The best practice is to add configuration in a standardized and easily maintainable format consisting of independently deployable components with high-functional cohesion with the following attributes:
- Externalised: Visible, auditable, and manageable with standard tools.
- Composable: Can be split between teams, changes can be released independently to different instances, configuration is testable at different, fine-grained levels.
- Unattended: Supports automated usage e.g CI/CD.
- Integrates:Plays nicely with other tools as described above.
To achieve this, engineers have to allocate time to critically approach configuration, identify the variable space and separate it from the code. In my experience, if configuration is missing the above attributes it becomes a monolith. A monolithic codebase is fast to develop but also fragile, prone to errors and difficult to test. Eventually will slow down the delivery as it cannot be extended or shared among teams.
Testing infrastructure should include automated testing of stacks. There are two main approaches to testing;
- Ephemeral instance testing: Run tests and then destroy all ephemeral infrastructure created to run those. This method is cost-efficient but slow, since the whole solution needs to be created from scratch every time.
- Persistent instance testing: Run the tests and leave them. This method is fast to apply changes to existing environments, however, it involves higher operating costs of permanently deployed infrastructure.
Syntax validation and linting are prerequisites to effective testing. They are the only actions that can be done to prepare for testing that does not require the deployment of new infrastructure. Contrary to traditional software development, where mocking can be used to decouple testing components, less value is gained from mocking cloud APIs during infrastructure testing. The test infrastructure can be created within test code (makefile, make targets), within the build scripts (e.g. terraform) and tooling like test-kitchen (i.e. chef tool that orchestrates environment).
Multiple environments can be used to manage the complexity of testing (e.g staging, QA). However, multiple test environments also can cause inconsistencies due to the variability among them. To avoid this issue, engineers are recommended to minimize the configurable surface of the infrastructure. A high number of configurable parameters typically takes away from the power of testing as deployments cannot work in all environments at the same time.
Changes should be promoted using a pipeline, promoting code from one environment to the next. Source controls with branches and tags to decide which version of the code applies to which environment is necessary. As a guideline to the uses of different environments, the following is recommended: Sandbox Environment for Individual Engineers, Test Environment for Product Squad Role, Customer Environment for the Product Squad (breakglass processes applicable), Delivery Service Environment for the Deliver Ops Squad, and Operational Service Environment for the Relevant Ops Squad.
One of the fundamental ingredients in the adoption of infrastructure automation is the notion of immutable changes being packaged in the form of server images. “baked” images or “golden” images are tested once and then used multiple times. As a result, baking is recommended for heavy solutions that do not involve often changes. On the other hand, “fried” images are added at the instance creation time. Hence they are more appropriate on lighter-weight use-cases.
Two development approaches are available for “baked” images:
- Push (e.g Ansible ): The automation language can make installs, gets privileged access to servers and does not need configuration management software to run.
- Pull (e.g. Chef & Puppet): Configuration management software is used and pulls configurations from a central location to configure instance rather than gaining privileged access to servers.
I prefer the pull method because it is simpler to manage configuration from a single central location while it avoids the use of elevated privileges from servers, that increases security and reduces risk.
Combining the practices discussed at Velocity and our experience at Cloudreach, DevOps practitioners are invited to consider the below patterns and anti-patterns when building cloud-native solutions:
Each infrastructure stack should be on a delivery pipeline.
Each application has its own pipeline with its own stack.
Use shared infrastructure stacks for shared resources across the team (e.g shared VPC pipeline); this way engineers can build on top of the extensible code and avoid duplicating effort.
The provider stack should not have knowledge of its consumers, should be kept minimal, split into smaller stacks to avoid rigid, untestable monoliths.
“Shared-nothing” stacks approach, teams do not share anything and use separate infrastructures such as subnets and kubernetes clusters.
Avoid tight coupling
Only run configuration management tools when you have specific changes to make. This way changes will accumulate over time and lead to inconsistencies.