Lessons From Cloudreach’s Journey To A Site Reliability Engineering Model
Lex Holt, Head of Core Operations at Cloudreach, shares some lessons learned from Cloudreach's journey towards embracing SRE and applying it to our Cloud Management engagements with our customers.
Site Reliability Engineering (SRE) is an approach originally popularised by Google that applies software engineering to the challenges of Operations, that can help organizations create more scalable and reliable software and cloud applications. In order to do that, though, there are a number of components and considerations organizations must keep in mind as they begin their SRE journey. At Cloudreach, we are applying some of the insights of Site Reliability Engineering to cloud management, and in many ways are still on the journey to full ‘SRE-ification’. Here are some key lessons we’ve learned along the way.
SRE is equally about culture and technology
As an MSP, Cloudreach faces unique challenges. Our customers come to us with different problems and at different stages of their cloud journey. Additionally, we have varying amounts of visibility into and control over our customers’ technology stacks, which can sometimes make it difficult to create an approach that successfully manages their full cloud environment. Because of these variances, our management approach cannot be one-size-fits-all. Thus, our own approach to SRE has prioritized adaptability and agility in order to adjust to the various challenges that our own organization and customers face.
One of the great aspects of the cloud is that it is soft - everything can be changed in a relatively quick and easy manner, at least in principle, from a technical perspective. This means that we have a wide breadth of options for managing our customers’ cloud environments depending on their needs, and that we can implement any strategy fairly quickly.
But this also means that any misstep can really wreak havoc on an organization’s security and cost management if handled improperly. Because of the agile nature of the cloud, it’s easy for an inexperienced (or even an experienced) developer to make a seemingly small change in a few lines of code that accidentally opens a big security hole, leaving potentially sensitive or private information open to unintended audiences and bad actors.
This is why cloud management is so important, and it can’t just be about a technology change - in some ways, it is more about the organizational and process changes that work to prevent these situations from happening. Cloudreach guides our customers to ensure they implement the right changes to ensure security risks are mitigated and costs do not skyrocket out of an organization’s budget. Working with our customers to not only transform their technology stack but also their culture, we make sure they are ready for the fast-paced and ever-changing world of the cloud with a personalized approach that works for them.
Humans are bad machines, so let them do what humans do best
Throughout our own SRE journey, we’ve learned that reliability is the key to success - after all, this is the ultimate goal of SRE. The challenge here is that humans are unreliable. If it’s possible to give tasks to machines, that will generally make them (or the service that relies on them) more reliable. Turning to automation for maintenance tasks rather than hiring and building out a new team to take care of these simple tasks can save an organization money, but more than that, automation is important because humans are bad machines.
Any time an organization does something that relies on a human behaving like a machine, it is setting itself up for failure. Typically, organizations have a certain way of doing things when it comes to maintenance tasks, like a recipe or assembly line that employees should follow. When something goes wrong or someone makes a mistake in that process, our first thought is often how we can adjust that process to prevent it from happening again. But in the fast-paced world of the cloud and digital transformation, there isn’t always time to make these mistakes. In these instances, companies should ask themselves whether a human should be completing the task at all or if it might be better suited for automation. Not only will this save organizations headaches, but it frees up engineers to focus on bigger and more important issues rather than mundane management tasks as well.
Adaptability is key
Regardless of where a customer comes to us in their cloud journey, it is important to keep an open mind and adopt a willingness to change as they make the move to the cloud and develop a cloud management strategy. With the cloud, things are always changing - new methods and technologies are developed at a more rapid pace than ever before, so a company’s cloud management strategy, culture and tech stack must be able to support these changes.
With our unique approach to cloud management that blends technical expertise with an awareness of the importance of business processes and culture, Cloudreach helps customers navigate these changes and ensures they are set up for success. Contact us today to learn more.