Rapidly innovating and improving the machine learning algorithms powering the Aware platform is a paramount business imperative for Aware. The organization needed a platform where data scientists and analysts could quickly iterate their Tensorflow image recognition models without being constrained by infrastructure limitations.
Image recognition models often require standardization and pre-processing of images before being fed into the training model. Prior to migrating to AWS, data scientists at Aware were typically idle for 3+ hours waiting for each pre-processing cycle to finish. This loss of productivity was forcing data scientists to run fewer iterations, thus limiting the pace of innovation.
Aware wanted their data scientists to spend more than 80% of time iterating on their models; however, in reality they were spending about that much time just managing infrastructure. What they truly needed was a machine learning pipeline which would automate the entire process of training, testing and publishing of new ML models while elastically scaling the underlying infrastructure.
Aware needed a partner to create an architecture which would help eliminate some of the friction Aware data scientists were experiencing. As an AWS Partner Network Premier Partner with Machine Learning Competency, Cloudreach had deep expertise in solving similar problems.
The primary reason AWS was selected as the cloud platform of choice was for its Amazon SageMaker machine learning platform solution. Amazon SageMaker is a fully-managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. The benefits of the solution include:
- Easily “bring our own models,” which could then be trained with very minimal modifications to our existing, Docker-based workflow.
- Build and use custom, in-house machine learning and feature engineering toolsets.
- No need to maintain, manage, and pay for the continued use of large machine learning VMs.
- Availability of high-powered GPU machines
- Support for many ML platforms, including TensorFlow, PyTorch, and R.
- Managed, scalable platform for model deployment.
On top of that, AWS Lambda and AWS Batch provided a straightforward platform for Aware to perform data ingestion and preprocessing on a large scale. Once again, without having to maintain and manage large clusters of VMs.
As the team at Aware was new to AWS, Cloudreach suggested running a quick POC (Proof of Concept) with the following high-level goals in mind:
- No code rewrites: as significant investment had gone into fine-tuning the image recognition model using Tensorflow, rewriting the code into another deep learning library wasn’t a viable option. The existing codebase had to work as is on the new platform.
- Self-managing infrastructure – Aware wanted the new platform to be elastic and easy to manage so that data scientists could focus on improving their ML models and not worry about managing the underlying infrastructure
- Faster pre-processing cycle: Fetching and pre-processing of 100K test images from the image library should be completed in 5 minutes or less
- Continue using MongoDB: as the team had significant experience with MongoDB, switching to another NoSQL database was not an option
- Model reproducibility: Data scientists should have the ability to rerun a model with the same set of training data used during the initial run
- Provide options to pick cost-based execution “lanes” – not all models needed to run on the most expensive environment. Aware needed a flexible environment where data scientists could pick a cheaper option for non-critical jobs.
Cloudreach built a POC solution for Aware using the following AWS services:
- Amazon SageMaker
- Amazon DocumentDB (with MongoDB compatibility)
- Amazon S3
- AWS Batch
- Amazon ECR
- AWS CLI
With a small engineering team, Aware didn’t have the resources to make the move quickly or to learn a completely new platform, while also continuing to produce ML features vital to our growing SaaS platform. Cloudreach’s efforts to accommodate Aware’s existing image pipeline in developing the POC on AWS was essential to the customer’s quick and successful migration. Building off of the foundations Cloudreach provided, the Aware Data Science team was able to complete the migration to AWS in less than 6 weeks.
As a result of moving to AWS, the Aware data science team has dramatically reduced the amount of time spent managing and maintaining infrastructure. With this time saved, coupled with the flexibility of AWS Batch and Amazon SageMaker, the team has successfully deployed vastly more sophisticated ML data pipelines, allowing us to more effectively solve complex business needs.
Cloudreach’s expertise was critical to the successful migration of Aware’s ML existing pipeline to AWS. Not only did their proposed POC solution reduce image preprocessing from hours to minutes, by closely working with Cloureach engineers, the Aware team was able to influence the solution in such a way that it was easily extensible to other use cases. Cloudreach’s openness to this collaborative process was essential to the long-term success of the migration.