Team Sorceress Data Engineering Project
present

Conjuring up analytic gold!

Our goal as Team Sorceress was to create a streamlined and scheduled Extraction Transformation Loading (ETL) Project for the fictional Terrific Totes. By combining several orchestration tools including AWS services such as Lambda, EventBridge, and the Simple Notification Service (SNS) we successfully transformed raw data into a predefined star schema suitable for Online analytical processing (OLAP), this schema was provided by Northcoders.

We were required to implement both data lake and warehouse storage strategies hosted by AWS. An S3 bucket for raw data served as our data lake, while the data warehouse consisted of a transformation bucket, that stored our processed data.
To ensure data quality and integrity, all data within our pipeline remained immutable. The processed data was then used to populate our chosen fact table and its corresponding dimensions tables.
Our ETL pipeline involved the conversion of data between between structured and semi-structured data formats; to counter performance issues, we transformed our Pandas Dataframes into the semi-structured Parquet file formats given its efficient built-in file compression, and columnar storage format which allow for fast query processing.
It was critical to implement deduplication to enhance the quality of data for analysis. To avoid mutating our AWS infrastructure, real credentials were bypassed with Moto.

The project culminated in the transfer of processed data into Amazon QuickSight which seamlessly integrated with our data warehouse allowing us to visualise product performance, and customer behaviour according to desired metrics such as location, financial quarter, and product design.

The Team

  • Team member imagePreview: Team member image

    Zahra Ali

  • Team member imagePreview: Team member image

    Adrian Bingham-Walker

  • Team member imagePreview: Team member image

    Alicia Rodriguez

  • Team member imagePreview: Team member image

    Venkata Mora

  • Team member imagePreview: Team member image

    Filipe Orfao

Technologies

Technologies section imagePreview: Technologies section image

We used: AWS :eventbridge cloudwatch s3 bucket lambda parameter store secrets manager simple notification service, moto,boto

For testing and code quality, we used Pytest to write simple, scalable tests ensuring code reliability and reducing errors, and Black to enforce code style consistency.

Challenges Faced

We faced several challenges, including performance issues, data cleaning, and ensuring data integrity. To address the performance issues we transformed Pandas DataFrames into Parquet files for built-in compression and efficient storage. This sped up the lambda function dramatically and improved the flow of our process.