In this blog, Cloud Systems Developer, Mykal Dortch shares his experience tackling unexpected capacity limitations when working with CSPs and suggests several deployment strategies that can help overcome these kinds of issues.
I recently had the pleasure of helping a client migrate several applications from their expensive, on-premise data center into the cloud. Having done many migrations for past clients, I went through my handy checklist and validated that everything was in place to get things rolling. VPN? Check. DNS? Check. Capacity planning, licensing, networking? Check, check, check.
By all accounts, we’d dotted all the i’s and crossed all the t’s. We’d worked together for weeks to verify that all the images were in place across the various environments and that our primary and failover stacks were connecting to the appropriate databases with multi-regional availability. We checked microservices, dependencies, updates, patches, OS versions; effectively everything that might cause post-migration headaches. We were ready.
An Unexpected Problem
On our first full-stack deployment day, we “flicked the switch” to test what would become the production environment. Almost immediately, we ran into issues. I first identified that several database instances deployed, but without a couple of attached volumes. Then, I saw that some of the infrastructure instances were slow to deploy and eventually failed. Unusual, I thought, but the purpose of this “dry run” was to expose any issues that we might have during the go-live period. I combed back through the deployment scripts and reviewed the errors in hopes of identifying the root cause quickly. The good news was that there were no syntax errors to be found. However, we did notice that several of the deploys returned the same kinds of errors: insufficient disk quota and insufficient CPU quota available.
After a quick consultation with the billing department, I compiled a list of quota increase requests and sent them off to the CSP. What happened next came as a complete surprise. Our request was denied! After verifying that it wasn’t April Fool’s Day, I replied back in hopes of getting a more thorough explanation and was informed that due to capacity limitations in my desired region, no further quota increases would be granted at this time. I thought this was a little vague, so I pushed back a little harder to get a timeline on when the resources would become available. The response was a little disheartening: no fulfillment date could be provided.
Having worked with cloud service providers for years, I was surprised at this outcome. Many CSPs are guarded about the full scale of their underlying infrastructure, but generally, give the impression of unlimited capacity. The surprise of discovering this limit was akin to peeking behind a magicians curtain. In one failed deploy, the magic of infinite scale was reduced to a parlor trick.
Updating My Deployment Playbook
Fortunately, all was not lost. With a bit of retooling and some configuration changes, we spread the deploy across multiple regions and levied a CDN to make sure content was readily available to users, no matter their location. While this change did make us rethink how the application was deployed, it also made me rethink my planning steps for cloud deployments. Included now in my deployment playbook are a number of actions specifically designed to avoid exactly this sort of mishap. I’ve effectively divided these additions into a section that I refer to as “trust but verify”. While I still firmly believe in the cloud and all its trappings, I have added these caveats to ensure a better, might I say, “more perfect” deployment procedure.
As a first step, I take my existing list of requirements, and, after allocating resources to each application, instance, or resource, as usual, I add them all together and sort by region. Once arranged and sorted, I check that my needs are at least 50% less than the allocated quota to ensure room for future growth. If not, I bulk submit increase requests with detailed explanations of the resources I plan to deploy, regions, requirements, etc. At this stage, I also like to check-in with the CSPs technical staff to verify if some potential announcement might impact my deployment objectives. Hopefully, this step serves to identify any miscalculations I’ve made about the CSPs future planning. I’ve also got lucky a few times and found out about a little-known, cost-saving setting or yet-to-be-released feature and taken advantage of it in the planning phase: never hurts to do some relationship building.
After the initial request, I test the quotas by automatically creating a set of resources designed to test the quota for a given resource. For example, I may attempt to provision disks with capacities right up to the hard quota. Similarly, for CPUs, memory, and other quota-limited resources, I’ll test the limits to validate that they are, in fact, available for use. The first time I did this by hand but have since created a script that automates the process and guarantees that the created resources are launched, tested, and destroyed with minimal cost and no negative effects. In the event of an error, I can also see the log information and identify which cloud provider, resource, and quota caused the error. This provides valuable information for myself, the client, and the cloud provider, removing any doubt on the deploy-ability of the proposed infrastructure and applications.
In several other use cases, specifically where high-availability was a focus, we’ve also side-stepped large resource quotas by deploying resources and applications across multiple cloud providers. Depending on a customer’s needs, we’ve also spanned across multiple zones and/or regions. The overall idea is to reduce the likelihood of hitting resource quotas by broadening potential deployments to encompass resources from multiple cloud providers and getting extra redundancy and/or disaster recovery in the process. The three most common schema for this configuration are outlined in the diagram below, but I’ve nicknamed them (1) solo regional mesh, (2) multi-cloud regional mesh, and (3) multi-cloud, multi-regional mesh.
Solo regional mesh uses multiple zones in the same region, but sticks to a single cloud provider. Spreading resources out this way has the added benefit of providing disaster recovery without having to deal with any of the nuances between different CSPs. It also presents less network complexity in the form of firewall rules, routes, and public IP addresses. Since many resources are allocated by zone, this multiplies the quota by the number of zones, a huge potential benefit.
Similar to this, multi-cloud regional mesh expands on this idea by creating the resource in two very close regions but varies with the cloud provider. This increases all the benefits of solo regional mesh, increases resiliency, and removes the dependency of a given provider. This isn’t without negatives, as traffic between the cloud providers will likely incur a fee. I’ve limited this in the past by having applications run active on one CSP and passive on the other.
Lastly, multi-cloud, multi-regional mesh spans resources across both regions and cloud providers for maximum effective distribution. In my experience, this is often overkill, but I thought I’d mention it for the sake of being thorough. It’s very resilient, but also can be unnecessarily complicated. This effectively removes quotas as a concern, as applications/resources can be adjusted based on cost, resource availability, and other factors, at the cost of a larger bill.
As an aside to these three deployment strategies, I also use application segregation as a means of quota avoidance. Since most resource quotas are typically given by account/region, deploying large applications into their own account/region protects them from any resource-hogging from other applications that may normally share the same infrastructure. I’ll expound on this in a future post dedicated to application segregation.
Given all this, some might feel the urge to run back to the perceived security of a traditional CapEx server model and eschew the cloud altogether. This would be a mistake. The benefits of moving your applications and infrastructure to the cloud are well documented, with realizable cost and time savings. Taking quotas into consideration for your cloud deployment should present little increased difficulty, and if done early in the planning phase, minimal increased cost. It isn’t my attempt here to present every possible solution, but rather provide some insight into a validation test I’ve now scoped into any potential deploy. May it save you many hours of frustration as well. Be great!