Team Ketts Lough Final Project
present

Team Ketts Lough Demo Video

Team Ketts Lough Demo Video

ETL Data Platform for ToteSys to AWS Data Warehouse

This project involved designing and implementing a data platform to extract, transform, and load (ETL) data from ToteSys (a PostgreSQL database) into a Data Warehouse hosted on AWS. The system runs on a 30-minute schedule and follows these steps:

Extraction: A Lambda function extracts data from ToteSys and stores it in the Ingestion S3 bucket.

Transformation: Another Lambda function processes the latest data from Ingestion S3, converts it into a star schema, and saves the transformed data in the Processed S3 bucket.

Loading: A final Lambda function loads the processed data from Processed S3 into the Data Warehouse.

The Team

  • Vojtech NozickaPreview: Vojtech Nozicka

    Vojtech Nozicka

  • Matthew ReynoldsPreview: Matthew Reynolds

    Matthew Reynolds

  • Marcin SodelPreview: Marcin Sodel

    Marcin Sodel

  • Hussein AlsakkafPreview: Hussein Alsakkaf

    Hussein Alsakkaf

  • Team member imagePreview: Team member image

    Yanrong Zhang

Technologies

Python, AWS Services, S3, Lambda, CloudWatch, EventBridge, Pandas, JSON, Parquet, TerraformPreview: Python, AWS Services, S3, Lambda, CloudWatch, EventBridge, Pandas, JSON, Parquet, Terraform

Tech Stack & Key Components: Python: v3.12 AWS Services: - S3: Storage for raw and transformed data. - Lambda: Serverless compute for ETL tasks. - CloudWatch: Monitoring and logging. - EventBridge: Triggers for scheduled execution. - Step Functions: Workflow orchestration. - IAM: Manages permissions for various services - Secrets Manager: Securely stores credentials Pandas: Used for data transformation. Data Formats: JSON (raw ingestion) and Parquet (processed storage). Infrastructure as Code (IaC): Terraform for provisioning AWS resources. CI/CD: Automated deployment and testing.

This system ensures efficient and reliable data movement from ToteSys to the Data Warehouse, supporting analytics and reporting needs.

Challenges Faced

Aside from the expected challenges like limited time, handling large datasets and its continuous updating or managing timestamps to filter data appropriately, one of the more unexpected ones was handling the date dimension table. Our initial approach was to dynamically create dates, but due to the long processing time required, we decided to use a static date range to improve the performance. This allowed us to ensure that the Lambda only created the date dimension once and did not recreate it if it already existed.