Building Data Science Platforms – A Solution for Powercor
Office of Naval Research - "120630-N-PO203-241"
On Thursday 8th, September 2016 I gave a talk on Powercor’s data-science platform at the AGL-Office Melbourne Data-Science Meetup. The talk briefly covered how the architecture of the platform was conceived and how the platform was implemented. Below is an excerpt from my talk to give some insight into building a platform that can handle the rigour of data modelling and analysis.
My presentation was the last section of a three-part presentation by some of the members of the team at PowerCor that helped put together the data-science platform, research, design, and implementations. The other speakers were Peter McTaggart and Adel Foda with an introduction by Jonathan Chang.
You can find the same content on the Silverpond blog.
Platform
Problems → Think → Enlightenment → Build → Solution
Problems
If you aim to build a data-science platform you are primarily motivated to enable and facilitate doing data science. This generally means that you want to be able to run models and perform analysis over your organisation’s data at scale and at speed. Although scale and speed are the constraints that immediately jump out, they are by no means the only factors that will determine the success of such a platform. All of the following will be critical to the construction, uptake, maintenance, operation, and further development of a data-science platform:
- Reproducibility
- Scalability
- Historicity
- Interoperability
- Portability
- Longevity
- Comprehensibility
- Tractability
- Sensitivity
Why are these aspects significant?
They can be thought of as requirements. Without each one a platform is untenable.
Think
Factoring and Reduction of Problems
Although the requirements as stated are all individually important, they are not orthogonal. There are various factors for success that can be tackled individually that should allow all of the requirements to be addressed. After enough navel-gazing, these principles emerge:
- Encapsulation
- Idempotency / Repeatability
- Purity
- Compositionality
Enlightenment
Build Systems ~ Functions
The conclusion we can draw from these factored requirements is that
we essentially want a build-system. If you’re familiar with make
then you’ll know what I’m talking about. On top of this, the system
should be as simple as possible - Ideally being able to be conceptually
represented with a pure function:
result = function( arguments )
There are several arguments that are common for all models running on the platform:
report = build( data-collected-to-date, date-range-of-query, …)
With the additional argument being model-specific. For example:
field-inspection-priorities = rank-faults( data, 2014-2015, threshold → 1ohm )
With this concept at the heart of our platform, compositionality becomes the enabling factor of construction and reuse.
With such a methodology at play, many of the traditional pain-points such as synchronization-issues, difficulties running reports over “new-historical” data, caching, work-distribution, and validation simplify greatly, or evaporate entirely.
Solution
A solution obviously doesn’t exist in a purely conceptual space however, and there is a lot of work required to actually engineer the architecture that enables such theoretical principles. At Powercor we adopted a layered-architecture to provide different “contexts” for implementation of the platform:
- Environment
- System
- Controller
- Pipeline
- Task
- Cluster
- Boundary
- Process
This not-only provided clarity surrounding what each piece of development was mandated to interact with, but also leveraged the expertise and roles within the team so that each individual could focus on their own work and trust that the boundaries were codified with enough rigour that their work and the work of their colleagues would play nicely together. Thus the roles of data-scientist, infrastructure-engineer, platform-engineer, tester, and tech-lead were kept as focused and clearly-defined as possible.
The components that allowed us to build such contextual layers looked like the following:
Cloud Platform | AWS |
Data Storage | AWS S3 |
Data-Event Propagation | AWS SQS / Apache Kafka |
Distributed Computation | Apache Spark / AWS EMR |
SQL Data-Query Interface | Apache Drill |
Container Execution | Docker / Docker-Swarm |
Data Dependency Resolution | Luigi |
Interactive Exploratory Environment | Zeppelin |
Metadata Storage | Postgres |
Infrastructure Automation | Ansible |
Version Control | BitBucket |
Internal Applications | Internal |
Illuminated
The project was a successful collaboration between Powercor, Silverpond, and Peter, to create a new platform for a powerful data-science capability within the business. This result was achieved in a short time-span and with Powercor’s data, enabled deeper understandings than were previously within reach.
Thanks to Powercor and Peter for the opportunity to share our collaboration, as well as Data Science Melbourne and AGL for hosting a fantastic event.