Team Sorceress Data Engineering Projectpresent
Conjuring up analytic gold!
Our goal as Team Sorceress was to create a streamlined and scheduled Extraction Transformation Loading (ETL) Project for the fictional Terrific Totes. By combining several orchestration tools including AWS services such as Lambda, EventBridge, and the Simple Notification Service (SNS) we successfully transformed raw data into a predefined star schema suitable for Online analytical processing (OLAP), this schema was provided by Northcoders.
We were required to implement both data lake and warehouse storage strategies hosted by AWS. An S3 bucket for raw data served as our data lake, while the data warehouse consisted of a transformation bucket, that stored our processed data.
To ensure data quality and integrity, all data within our pipeline remained immutable. The processed data was then used to populate our chosen fact table and its corresponding dimensions tables.
Our ETL pipeline involved the conversion of data between between structured and semi-structured data formats; to counter performance issues, we transformed our Pandas Dataframes into the semi-structured Parquet file formats given its efficient built-in file compression, and columnar storage format which allow for fast query processing.
It was critical to implement deduplication to enhance the quality of data for analysis. To avoid mutating our AWS infrastructure, real credentials were bypassed with Moto.
The project culminated in the transfer of processed data into Amazon QuickSight which seamlessly integrated with our data warehouse allowing us to visualise product performance, and customer behaviour according to desired metrics such as location, financial quarter, and product design.
The Team
Zahra Ali
Adrian Bingham-Walker
Alicia Rodriguez
Venkata Mora
Filipe Orfao
Technologies
We used: AWS :eventbridge cloudwatch s3 bucket lambda parameter store secrets manager simple notification service, moto,boto
For testing and code quality, we used Pytest to write simple, scalable tests ensuring code reliability and reducing errors, and Black to enforce code style consistency.
Challenges Faced
We faced several challenges, including performance issues, data cleaning, and ensuring data integrity. To address the performance issues we transformed Pandas DataFrames into Parquet files for built-in compression and efficient storage. This sped up the lambda function dramatically and improved the flow of our process.