Dimensional Transformers Data Project
present

Dimensional Transformers Demo Video

Dimensional Transformers Demo Video

From Data to Discovery: Shaping the Future with Engineering

Our project automates the ETL workflow for Terrific Totes, leveraging Python, AWS, database modeling, and CI/CD practices. It extracts data from a PostgreSQL database on AWS RDS, transforms it with Python Lambda functions, and loads it into a PostgreSQL data warehouse. With secure credential management, CloudWatch monitoring, and Terraform and GitHub Actions automation, this scalable solution ensures quality through rigorous testing and PEP8 compliance, delivering actionable insights via a Tableau dashboard.

The Team

  • Team member imagePreview: Team member image

    David Sheffield

  • Team member imagePreview: Team member image

    Dorota Sawicki

  • Team member imagePreview: Team member image

    Hamza Nazar

  • Team member imagePreview: Team member image

    Laura Pugsley

  • Team member imagePreview: Team member image

    Rohail Zaheer

  • Team member imagePreview: Team member image

    William Robb

Technologies

Python, AWS, Postgres, Terraform, Parquet, Tableau, PytestPreview: Python, AWS, Postgres, Terraform, Parquet, Tableau, Pytest

We used: Python, AWS, Postgres, Terraform, GitHub Actions, Parquet, Tableau, Pytest

Terraform: Used for managing and deploying infrastructure.
Python: Employed for programmatically interacting with and manipulating data.
SQL: Utilized due to the database's structure and requirements.
Parquet: Chosen for its efficient data storage capabilities.
GitHub Actions: Implemented to automate workflows and minimize manual intervention throughout the project.
AWS: Provided the hosting environment for all cloud infrastructure.
Pytest: For test driven development.
Tableau: For effiecient data visualisation.

Challenges Faced

Statelessness of Lambda Functions: Lambda functions are stateless, so we needed a way to track the last updated date for data extraction. We solved this by storing the date in AWS Secrets Manager, allowing us to persist the value between invocations. On the first run, we reset the date to process all source data.
Fact Table Load Order: We encountered an issue where the fact table was loading before the dimension tables, causing reference errors. To fix this, we implemented a sleep mechanism after loading the dimension tables, ensuring they were populated first and maintaining data integrity.
Lambda Layer Dependency Issues: We faced challenges with Lambda layer compatibility for PyArrow and SQLAlchemy due to package size. For PyArrow, we used an AWS Managed Layer. For SQLAlchemy, we created a custom GitHub library to package and deploy the specific version as a Lambda layer, resolving the compatibility issues.

GitHub: https://github.com/hamza8599/nc-final-project