Lambda Legends Data Projectpresent
Lambda Legends Demo Video
Work of Legends
This is a data engineering project which implements an end-to-end ETL (extract, transform, load) pipeline. It extracts data from a database, transforms it to a star schema and finally loads it into an AWS warehouse.
Current features:
Data Extraction: Uses a Python application to automatically ingest data from the totesys operational database into an S3 bucket in AWS.
Data Transformation: Uses a Python application to process raw data to conform to a star schema for the data warehouse. The transformed data is stored in parquet format in a second S3 bucket.
Data Loading: Loads transformed data into an AWS-hosted data warehouse, populating dimensions and fact tables.
Automation: End-to-end pipeline triggered by completion of a data job.
Monitoring and Alerts: Logs to CloudWatch and sends SNS email alerts in case of failures.
The Team
Pratik Shrestha
Rrezon Mripa
Joshua Man
Mirriam Karimi
Eloise Holland
Technologies
We used: pg8000, pandas, boto3, aws wrangler, pytest, moto, terraform, git, github actions
pg8000 for connecting and querying the PostgreSQL database.
Pandas for manipulating and transforming data into tables.
Boto3 for interacting with AWS services.
AWS wrangler for simplifying the process of writing transformed dataframes back to S3 in parquet format during the Transform phase.
Pytest for testing.
Moto for mocking AWS services during testing.
Terraform for defining and provisioning the AWS infrastructure
Git: enabled version control for tracking changes in our project code
GitHub Actions: Automated testing and deployment workflows to ensure code quality and streamline the CI/CD pipeline.
Challenges Faced
We face challenges during the extraction of data as we wanted to avoid saving the data on our local machines. We also faced challenges with terraform changes not automatically reflected in our lambda functions.