Dimensional Transformers Data Project
present

Dimensional Transformers Demo Video

From Data to Discovery: Shaping the Future with Engineering

Our project automates the ETL workflow for Terrific Totes, leveraging Python, AWS, database modeling, and CI/CD practices. It extracts data from a PostgreSQL database on AWS RDS, transforms it with Python Lambda functions, and loads it into a PostgreSQL data warehouse. With secure credential management, CloudWatch monitoring, and Terraform and GitHub Actions automation, this scalable solution ensures quality through rigorous testing and PEP8 compliance, delivering actionable insights via a Tableau dashboard.

The Team

David Sheffield
Dorota Sawicki
Hamza Nazar
Laura Pugsley
Rohail Zaheer
William Robb

Technologies

Python, AWS, Postgres, Terraform, Parquet, Tableau, Pytest

Preview: Python, AWS, Postgres, Terraform, Parquet, Tableau, Pytest

We used: Python, AWS, Postgres, Terraform, GitHub Actions, Parquet, Tableau, Pytest

Terraform: Used for managing and deploying infrastructure.
Python: Employed for programmatically interacting with and manipulating data.
SQL: Utilized due to the database's structure and requirements.
Parquet: Chosen for its efficient data storage capabilities.
GitHub Actions: Implemented to automate workflows and minimize manual intervention throughout the project.
AWS: Provided the hosting environment for all cloud infrastructure.
Pytest: For test driven development.
Tableau: For effiecient data visualisation.

Challenges Faced

Statelessness of Lambda Functions: Lambda functions are stateless, so we needed a way to track the last updated date for data extraction. We solved this by storing the date in AWS Secrets Manager, allowing us to persist the value between invocations. On the first run, we reset the date to process all source data.
Fact Table Load Order: We encountered an issue where the fact table was loading before the dimension tables, causing reference errors. To fix this, we implemented a sleep mechanism after loading the dimension tables, ensuring they were populated first and maintaining data integrity.
Lambda Layer Dependency Issues: We faced challenges with Lambda layer compatibility for PyArrow and SQLAlchemy due to package size. For PyArrow, we used an AWS Managed Layer. For SQLAlchemy, we created a custom GitHub library to package and deploy the specific version as a Lambda layer, resolving the compatibility issues.

GitHub: https://github.com/hamza8599/nc-final-project

Dimensional Transformers Data Projectpresent

From Data to Discovery: Shaping the Future with Engineering

The Team

Technologies

Challenges Faced

Dimensional Transformers Data Project
present