Shopify Ecosystem

PowerReviews Engineering Digest: How We Modernized Our Data ETLs

powerreviews-engineering-digest:-how-we-modernized-our-data-etls

At PowerReviews, our engineering team works on a diverse range of projects. It’s a great place to learn a ton, build skills and grow your career. Read on to learn about just one of these projects – which we were recently working on.

We are always looking for ways to evolve our technology stack with an eye towards scalability.  A recent challenge we faced was the stability of our data ETL processes, which are critical to supporting our analytics platform. 

For context, our ETL processes handle up to a million updates every day and account for 250gb of daily data transfer.  So to say they are a big deal is an understatement.

We recently adopted AWS’s Data Migration Service to overhaul how we move data from our core transactional databases to our Analytics Warehouse in Snowflake. This adoption reduced a multi-day, high-maintenance, data load process to a mostly automated, hours long job.

We wanted to share how we accomplished this 1) to give a flavor of the sorts of challenges we face every day at PowerReviews and 2) in case you are facing a similar challenge.

We started by evaluating different approaches and technologies for streamlining our warehouse loading process. These included AWS Kinesis, RDS Postgres Snapshot Exports, and using the wal2json plugin offered by the Postgres community. 

Each of these solutions were evaluated against our previous ETL implementation in the following areas: operational cost, scope of architecture changes required to implement, and overall completeness of the data capture capability offered by the technology.

We landed on AWS’s Database Migration Service (DMS) because it offered us a managed solution towards leveraging Postgres Write-ahead Logging (WAL) replication to source data for our analytics platform.

AWS DMS – at its core – is an abstraction layer over Postgres’s WAL replication that allows us to concentrate on the data itself, leaving the low level details of the implementation to AWS to manage. In general we look toward managed solutions to keep our DevOps team lean and mean.

With our technology choice made, we went to work on an implementation. Our DMS solution connected DMS replication instances with each of our databases.

These instances included DMS task definitions which defined which tables to capture change data and deliver that change data as JSON files to S3 storage buckets. The file uploads to S3 trigger notifications that initiate loading of files into Snowflake, our data warehouse technology.  

The file loading process then orchestrates reconciliation of insert/update/delete actions to our analytics source of record.  

Another choice we made along the way was to use a tool called dbt. dbt lets our data analysts take ownership of the entire analytics engineering workflow.

With dbt in place, we are able to hand off further ingestion of our data to our analysts – who handle the incorporation of this data into aggregate views. These drive our public facing analytics reports in Tableau.

As a result of our migration to DMS, and the introduction of dbt, we’ve realized a ton of benefits to the business. 

These include moving our engineering focus from tuning of bulky ETL queries, managing database performance, and chasing down data discrepancies to instead expanding the reach of our analytics platform and exploring new and valuable insights in our data that we can provide to our customers.

If you’re facing similar ETL challenges, we would love to hear from you. And if you’re interested in tackling similar tasks, we are in growth mode so are currently hiring extensively in engineering.

Special thanks to our friends at PowerReviews for their insights on this topic.
I'm also on
Latest posts by Steve Hutt (see all)