TfL Hackday - The Data Challenge
I recently got involved in a Hackday that energised both me and room full of very clever people.
The power of what can be achieved through collaboration, creativity and the latest in cloud technology to enable, rather than inhibit, was a great thing to experience. Let me start at the beginning, the challenge…
"Can you persist data being sent every 1/4s by 12,000 sensors scattered across London?"
This was a question that was asked to us by Transport for London (TfL) back in August 2015. At the time, TfL were investigating the use of Amazon Web Services (AWS) as a possible candidate to host a replacement to their Traffic Data Enquiry System (TDES). Cloudreach was brought in to build a proof of concept (PoC) system to demonstrate that a cloud based system could be a good base to build upon for future requirements, giving the particular benefits of flexibility and scale.
What is TDES?
TDES is the system that is responsible for collecting and making the data from those 12,000 sensors available for analysis. The sensors themselves are simple magnetometers placed under the road that have been configured to periodically check whether a metallic object is present. This results in a stream of 1s and 0s that represents the traffic flow on a given road.
Each sensor sends this stream to the traffic junction that it is attached to. The data is then aggregated for all sensors into 5 minutes batch files. These are finally dispatched to TDES for long term persistence.
Building a Cloud based replacement for TDES
Our cloud based PoC makes use of various AWS services such as Lambda, DynamoDB, S3 and Redshift to provide the data in an unstructured (raw text files) and structured (SQL interface) format. We maximised the use of AWS Platform as a Service (PaaS) offerings to minimise operational cost and improve integration between all infrastructure components.
The result was a system that could ingest incoming 5 minute batch files and make the content available for analysis in less than 40s. This was a big improvement over the existing solution that took almost half a day.
With access to the TDES data made easy thanks to the cloud, this was the first time that TfL could make this untapped source available to third parties.
A Traffic Data Hackday was held on the 6th of April at the AWS offices in London.
We were kindly invited to assist during the day having worked on the PoC. This was also an opportunity for us to revisit the system that we had built, the data it had generated and investigate new ways to improve it in the future. One of the unsolved challenges that we’d faced during the PoC was how to load large batches of historical data. Assisted by some talented data scientists, engineers and AWS solution architects, our team set out to tackle this problem using the power of AWS’s Elastic MapReduce (EMR) service.
Improving the original PoC
The PoC system is ideal as a live system ingesting real time data. However, it was not designed to handle a large set of historical data because of how the sensor data is stored.
As seen above, each row in the file represents the output from a single junction (blue box) that can have up to 8 sensors attached to it (red box). During the PoC, TfL had asked us if we could transform the magnetometer data from a series of 1s and 0s to a single event containing the vehicle headway (# of consecutive 0s) and its length (# of consecutive 1s).
This operation was performed using an AWS Lambda function which parses the raw data input file and outputs a comma separated values file containing the sensor ID, timestamp, headway and length. However, the way the AWS Lambda function was designed meant that data files had to be processed sequentially, as you needed to keep track of ongoing events between two consecutive 5 minute batch files.
This worked well for the live system but not so much for historical data batches.
Transforming the data in new and wonderful ways
Instead of transforming the raw data to headway/length on the fly using Lambda, our team used EMR to allow us to process multiple sensor data batch files in parallel, rather than the PoC’s sequential workflow. This was achieved by changing the ETL process to:
- For each 5 minute raw batch data file, generate a file per sensor that contains the continuous stream of 1s and 0s only for that sensor over those 5 minutes.
- Concatenate all 5 minutes files for a single sensor and sort the rows by timestamp.
- Run each concatenated sensor file through the same "1s and 0s stream to headway/length" transformation code that was used by the Lambda. This time, we no longer need to keep track of ongoing events between files as we now have the full continuous stream for each sensor.
The result is the ability to process a month’s worth of all sensor data in less than 1h. That is about 8,928 number of 5 minute batch files containing approximately 226,200 junction data points each!
What does this all mean?
It is important to highlight the opportunities that a public cloud vendor such as AWS offers when tackling big data problems.
The challenges have evolved. When once we were asking "can we store this data?", now we’re asking how we can drive new value from our easily accessible data.
Instead of asking, "how many months/pounds is it going to take before we can process the data?", now we are wondering what insights we can gain from this quicker, cheaper processing of large datasets, that will make our decision making more effective.