Green Bean Solutions Data Projectpresent
Green Bean Solutions Demo Video
We make data eat its greens.
The client was Terrific Totes, a company that makes branded and customised tote bags for retail and corporate clients.
Our brief was to take their existing commercial data, generated by their Sales, Finance, Purchasing, Dispatch, Production and Design teams, and transform it into a format that is easy for their business analysts to interrogate to allow the company to build insight and improve performance.
An automated pipeline was required to extract any updates made to Terrific Totes’s relational database, Totesys, transform the data, and load it into a data warehouse, ready for querying. This was required to be processed within a maximum of 30 minutes of the new data’s appearance, whilst maintaining a complete history of changes.
The Team
Ruby Hirsch
Tom Bracey
Daniele Conti
Minal Patil
Donald Futers
Alistair Shepherd
Technologies
We used a wide range of tech to build our data pipeline. Interactions between our Python script and Terrific Totes’ existing SQL database are handled using the Postgres driver module pg8000. This allows us to connect to the database and continually identify any new data. A serverless Lambda Function in Amazon Web Services, or AWS, ingests the new data and converts it into the flexible JSON format before uploading to an S3 bucket in the cloud. Once all the data is ingested by the first S3 bucket, a second AWS Lambda function is automatically invoked. This function uses pandas to transform the data from the original, write-optimized data scheme, reorganising and in some cases combining tables and automatically locating the most up-to-date version of each row, according to the star schema of the destination warehouse. The processed data is stored in the highly efficient, columnar parquet format, further reducing the amount of storage space needed, and saved to a second S3 bucket. At this point, a third Lambda function loads the rows prepared in the parquet files into our AWS RDS-based data warehouse. The workflow of this pipeline is orchestrated by an AWS Step Functions state machine, triggered automatically at regular intervals by a job scheduler. If errors occurred at any stage of the process, these would be picked up on AWS CloudWatch logs, attached to each of the Lambda functions, and serious errors would trigger email alerts. Finally, once we had finished building the pipeline, we used Apache Superset to visualise all the important business insights we picked up from the data.
We used AWS because it provides a comprehensive and integrated suite of cloud-computing services, which allowed us to efficiently bring together the different parts of the project while reducing setup time and effort.
Challenges Faced
There were challenges, from the restrictive layer size limits in AWS Lambda, to the need to provide sensitive login details to remotely executed scripts, calling for a combination of GitHub Actions secrets and AWS Secrets to allow our ETL functions and test suites to make the connections they needed to, without compromising on security.
A particularly consequential challenge arose in directing the second and third stages of our ETL process. The extraction stage recognises new rows in the Totesys database by comparing their ‘last updated’ time to the time of its own last check. If the load and transform stages had employed similar logic, it would only have repeated work, costing Terrific Totes money and placing unnecessary strain on server resources.
We opted for a low-overhead output format: a tiny packet detailing which tables have and haven’t received updates, along with the date and time at which the current extraction began. That timestamp is carried through the whole execution of the pipeline; it’s how we identify all of the files stored in the buckets for the current batch, with the result that the processing and loading functions have no more preliminary work to do. They’re told exactly what to look for and where to look, and can just get on with it. And in the worst-case scenario, the packet’s human-readable format also makes it easy for maintenance teams to bring the system back up to date if they’re alerted to errors occurring in mid-process, whether they want to do so by hand or automatically using the execution logs.