As a cloud consulting company, we witness enterprise clients with a lot of data; and typical for most of these clients is that data is siloed with universal access to the information not easily transparent. Client libraries are essentially islands of misfit toys.
During an internal hackathon, Nick Centola and I decided to take up the challenge of creating an enterprise class solution that would extract, transform and load (ETL) data from multiple sources to a data warehouse, with the capability of performing advanced forecasting and in turn be 100% serverless by design that inherently keeps running cost to a minimum.
We decided to keep the scope relatively simple and used the publicly available Citi Bike NYC dataset. The Citi Bike NYC dataset has monthly trip data exported as CSV files and public and a near real-time API, which from our experience is a pattern we often see in enterprises. The diagram below represents what we were trying to achieve.
Extract Transform Load (ETL)
At 2nd Watch, we love Functions-as-a-Service (FaaS) and Cloud Functions as we can create very scalable solutions, have no infrastructure to manage, and in most instances, we will not have to worry about the cost associated with the Cloud Functions.
There were two ETL jobs to write. One was to take the zipped CSV data from the public S3 trip data bucket and land it in our Google Cloud Storage Bucket for an automated daily import into BigQuery. The other function was to grab data from the stations’ near real-time restful API endpoint and insert it into our BigQuery table.
Nick is most efficient with Python; I am most efficient with NodeJS. As both languages are acceptable production code languages for most organizations we work with, we decided to write a function in our respected preferred languages.
The data that we pulled into BigQuery was already clean. We did not need to enrich or transform the data for our purpose – this is not always the case, and cleaning and enriching data are areas where we usually spend most of our time when building similar solutions for our customers.
We wanted to enable a relatively simple forecast on bike demand on individual stations across New York City. BigQuery ML is incredibly powerful and has more than 30 built-in machine learning models. The model of choice for our use case would be the ARIMA model, which takes time series data as an input. I won’t go into too much in detail on why the ARIMA model is a good model for this as compared to the multitude of google cloud functions; the full form of the acronym describes why; Auto Regressive (AR) Integrated (I) Moving Average (MA).
Bringing it all together, we created our LookML models in Looker and interacted with the data exceptionally easily. We made a couple of heat map-based visualizations of New York City to easily visualize the popular routes and stations and a station dashboard to monitor expected supply and demand over the next hour. With the bike stations API data flowing into BQ every 5 seconds, we get a close-to-real-time dashboard that we can use for the basis of alerting staff of an inadequate number of bikes at any station across NYC.
The station forecast shows the upper and the lower bound forecast for each hour over the next month. We use the upper bound forecast for our predicted “amount of bikes in the next hour” and pull in our available bikes from the real-time API. If you use your imagination, you can think of other use cases where a similar prediction could be relevant; franchise restaurant ingredient forecasting or forecasting at retailers for inventory or staffing needs to service customers – the possibilities are endless.
One of the coolest things we did from Nick and my perspective was to drive model training and forecasting straight from Looker and LookML allowing us to essentially kick off our model training every time we receive new data in BigQuery – all from the convenient interface of Looker.
As this was a quick prototyping effort, we took a few shortcuts compared to our delivery standards at 2nd Watch. We did not use infrastructure as code, a best practice we implement for all production-ready customer engagements. Second, we decided not to worry about data quality, which would be something we would clean, enrich, and transform based on your documented business requirements. Third, we did not set up telemetry that would allow us to respond to things like slow queries and broken ETL jobs or visualizations.
Is this hard?
Yes and no. For us it was not – Nick and my combined experience accumulates to thousands of hours building and documenting data pipelines and distributed systems. If you are new to this and your data footprint includes more than a few data sources, we highly recommend that you ask for enterprise expertise in building out your pipeline. You’ll need a team with in-depth experience to help you set up LookML as this will be the foundation for self-service within your organization. Ultimately though, experiments like this can serve to create both business intelligence and allow your organizations to proactively respond to events to meet your corporate and digital transformation initiatives.
Do you want to see a demo of our solution, check out our webinars below:
- Webinar: 6 Essential Tactics for your Data & Analytics Strategy
- Webinar: Building an ML Foundation for Google BigQuery ML & Looker
Aleksander Hansson, 2nd Watch Google Cloud Specialist