Freight Management: Solving real-world problems using GCP data tools
Helping a freight management customer prepare better for heavy traffic and congestion
We were asked by GCP to demo the power of their data tools to a prospective customer who are new to the world of Public Cloud and want to find new ways to use their data to drive efficiency, reduce cost and provide real time reporting for their business needs.
Traffic congestion data exploration on GCP
We were a team of three engineers with three days, and the goal of displaying and analysing live and historic traffic congestion data and exploring GCP data tools. Tools we explored and used include Cloud Functions, Cloud Composer, Cloud Dataprep, Cloud Dataflow, BigQuery, Cloud Storage, Data Studio and Cloud Datalab. For traffic data we used a trial of the HERE API.
Extract, transform, load
We began by searching for a source of traffic data and arrived on HERE's API 90 day trial. The API provides JSON with many details about roads within a specified area, including what we were interested in, the JamFactor, a measure from 0 to 10 describing how congested a section of road is. The API provides a number of other metrics and relevant information, including the stretch of road the measurements are about as a series of (lat, lon) pairs.
Dataprep and BigQuery
With access to raw data in hand, we needed a way to take it from a heavily nested JSON to a format we could load into BigQuery. Our options were to either write our own standalone script to parse the JSON and load it into BigQuery, create a Dataflow pipeline, or to explore the Dataprep service which sits on top of Dataflow.
Dataprep is described as an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis. With a bit of clicking around to familiarise ourselves with the tool, we found Dataprep to be capable of wrangling the data as required. We were able to create a recipe which created a flat table containing only the relevant parts from the JSON. Dataprep integrates well with BigQuery and is able to directly export the resulting table for us, which covered both the transform and load parts of our ETL process. We made use of scheduled runs to have Dataprep run the same job every 10 minutes.
With features such as automatically plotted histograms and an easy to use interface (no coding required), we were impressed with the Dataprep service and feel that it lives up to its description. We can see it being useful to data analysts who are less familiar with code and for data scientists who want to get an initial overview of their data and for ad-hoc data wrangling. In a productionised system of this type, we expect exporting the recipe to Dataflow would provide more control and customisation, but for PoC purposes the tool was useful.
BigQuery is a fast, highly scalable, cost-effective, and fully managed cloud data warehouse for analytics, thus making it an ideal service to store our transformed data. This was the simplest part of the architecture to set up - we created a new dataset and a table within that using the fields as displayed in the Dataprep image above. While Dataprep can automatically create a table for you, we created ours manually to make use of atimestamp partitioned table for better query efficiency. Bigtable could have been another option, as we had no immediate plans to make use of the BigQuery’s SQL specific features. We chose BigQuery out of interest in learning more about its functionality as a data warehouse solution.
Cloud Composer / Cloud Functions
With Dataprep running every 10 minutes, we needed to have fresh data for it to run against. We explored two ways to achieve this: using Cloud Composer and using Cloud Functions. In both cases the underlying Python code was identical, it simply called the HERE API and uploaded the resulting JSON file to Cloud Storage.
Composer is described as a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines and is a managed version of the Apache Airflow tool. Functions is Google’s Function-as-a-Service offering, allowing you to execute code on demand. The main benefits are that you are only paying for the compute resources when it is being executed and there are no instances to manage.
We worked in parallel to explore both options and ultimately settled on Functions as it was the first to be running successfully. With experience using Airflow and AWS Lambda, we were able to draw the following conclusions about using Composer and Functions for this proof-of-concept.
|Functions||- Python 3.7 support
- Native requirements.txt support
|- No native Cron functionality
- Online code editor could be better
|Composer||- Quick and easy to spin up a cluster and with good default Python libraries installed||- Python 2.7 only
- Limited ability to interact with Airflow services as they are abstracted away
Creating the Function and getting it to run was simple and intuitive. However, scheduling it to run every 10 minutes turned out to be more hassle than we had anticipated, particularly compared to AWS Lambda. Functions does not have a well integrated way to achieve cron triggers, with alternative methods including using App Engine cron to trigger a Pub/Sub message which triggers the function and Stackdriver uptime checks to call to a HTTP endpoint for the function. Improving the cron scheduling for Functions would greatly simplify the service, and it feels like the it’s quite far behind AWS Lambda at the moment. To solve our scheduling issue, we made use of a running instance to simply trigger the function from the CLI using cron, which whilst not production ready it got the job done!
Apache Airflow is much more complex than a simple function, but in return is much more powerful as an orchestration tool. We won’t look deeply at this here as there are many blog posts on the subject online. Getting set up with Composer was simple using the web console and the default options. After that, we experienced some issues with jobs not being recognised, and with limited access to the internals this was hard to immediately diagnose. Due to time constraints and the Function being the quickest to set up, we did not look into this much further. However, Composer shows promise and is worth investigating further as a way to run a managed Airflow cluster.
With new data being loaded into BigQuery every 10 minutes, we needed a way to analyse it. We were interested in mapping the most recent data and creating a chart of past congestion levels. We settled on Datalab, but also briefly experimented with Data Studio.
Data Studio claims to turn your data into informative dashboards and reports that are easy to read, easy to share, and fully customizable. It is similar to Google Docs in its UI, sharing, and collaborative features. We started by attempting to map the latest data by attaching our BigQuery table as a data source and using the built in map feature. Unfortunately, the map feature proved to be very limited for our purpose. It is only able to display aggregates at a country level outside of the USA, rather than the road specific view we wanted. Other data presentation features of Data Studio appear more developed. Connecting to our BigQuery table as a data source was simple, and as a collaborative tool it could fit well with other GSuite tools. However, it was not fit-for-purpose in this instance.
Datalab is described as a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. It is based on the popular Jupyter project for code notebooks that allow you to create and share documents that contain live code, equations, visualizations and narrative text. Datalab supports Python 2 and 3, and it has strong integration with GCP data storage services, including BigQuery.
As an example, first we loaded the BigQuery plugin:
and we are then able to run a Datalab cell such as
%%bigquery df SELECT Timestamp, Latitude, Longitude, JamFactor FROM `traffic_project.traffic_dataset.traffic_table` WHERE Timestamp BETWEEN TIMESTAMP('2018-08-04') AND TIMESTAMP('2018-08-05') AND Direction = "+"
which in just over a second returns a Pandas DataFrame with all the data for a particular day in a particular direction.
We used this data to generate rolling plots of the distribution of the JamFactor value over time to get an initial insight into the data:
We then wrote some code to find the nearest timestamp in the data to any user provided timestamp (ts) and to plot the data on a map for the nearest time. We used the gmplot Python library to generate the (lat, lon) points from the data on top of Google Maps and coloured them according to the JamFactor.
ts = "2018-08-05 17:00:00+00:00" nearest_ts_inx = np.argmin(np.abs(df['Timestamp'] - pd.Timestamp(ts))) nearest_ts = df.iloc[nearest_ts_inx]['Timestamp'] df_at_ts = df.loc[df['Timestamp'] == nearest_ts] gmap = gmplot.GoogleMapPlotter(51.2, 0.74, 11) for group in df_at_ts.groupby('JamFactor'): lat = group['Latitude'].values lon = group['Longitude'].values jf = group gmap.scatter(lat, lon, colors[int(jf)], size=100, marker=False) gmap.draw('my_map.html')
Finally, to prove the potential value of the data, we applied a simple ARIMA model to the time series of average JamFactor we had previously created. Being able to predict congestion would allow a customer to make decisions in advance about their logistics network, allowing them to operate more efficiently and less reactively. In a productionised system, more useful variables and a refined model would be a good starting point for this feature.
Datalab proved to be a quick way to get an initial insight into the data set. It was simple to set up and connect to (although this is currently available only through the CLI). However, the interface in Datalab is fairly cut down in comparison to Jupyter and in particular the recent Jupyter Lab interface.
Real time traffic analysis using public data
In just 3 days, our skilled team of data experts and engineers built and showcased the power of GCP by analysing live and historic traffic congestion on a major road in the UK to predict delays to freight. Using simple data visualisation tools the team were able to set rules to predict whether connecting freight should be delayed or not.
By making use of GCP’s data technologies we were able to quickly prototype this demo. A benefit of of the GCP platform is that is leverages a lot of existing open source tooling, such as Jupyter and Airflow. This provides familiarity and makes it very easy to get up and running. However, we’d expect to use slightly different technologies than the ones outlined here to achieve the same results in a production system. We would most likely take the time to fully leverage Composer to orchestrate this data workflow and switch to a streaming pipeline with Dataflow rather than relying on a scheduled Dataprep job.