team-01-data-squid Final Project
present

team-01-data-squid Demo Video

For ink-redible insights, dive into data with Data Squid

As part of this three-week project, we developed a robust ETL (Extract, Transform, Load) data pipeline. Completed by Carlos Byrne, Liam Biggar, Nicolas Tolksdorf, Shay Doherty, Girish Joshi, Ethan Labouchardiere, it extracts data from an operational database (totesys) and loads it into an AWS-based data lake and data warehouse. Our architecture leverages AWS services such as S3 for data storage, Lambda functions for data processing, EventBridge for orchestration, and CloudWatch for monitoring. We implemented a CI/CD pipeline using GitHub Actions and utilized Terraform for infrastructure as code. Our project focuses on creating a scalable and automated data platform to support analytical reporting and business intelligence, with the initial implementation of a Sales star schema as the Minimum Viable Product (MVP).

The Team

Liam Biggar
Ethan Labouchardiere
Carlos Byrne
Girish Joshi
Shay Doherty
Nicolas Tolksdorf

Technologies

Python, PostgreSQL, SQL, AWS Lambda, AWS S3, Parquet, JSON, Pandas, AWS Event Bridge, AWS Cloud Watch, Terraform

Preview: Python, PostgreSQL, SQL, AWS Lambda, AWS S3, Parquet, JSON, Pandas, AWS Event Bridge, AWS Cloud Watch, Terraform

Core Scripting & Data Handling: - Python – Core scripting language - PG8000 – Secure PostgreSQL interaction - SQL – Querying for data extraction Serverless & Storage: - AWS Lambda – Executes Python code - AWS S3 – Stores raw & processed data - Parquet – Optimised columnar storage Data Processing: - JSON – Used for structured data exchange - Pandas – Data manipulation & transformation - AWS RDS – Managed relational database Orchestration & Automation: - AWS Event Bridge – Schedules & triggers Monitoring & Alerts: - AWS Cloud Watch – Logging & Monitoring - AWS SNS – Email notifications for failures Security & Integration: - AWS Secrets Manager – Manages credentials securely - Boto3 – Connects Python to AWS services Deployment & Infrastructure: - Terraform – Defines & manages AWS infrastructure - GitHub Actions – Automates CI/CD pipeline.

Python: Powered the ETL (Extract, Transform, Load) functions for data processing.
PG8000 and SQL: Facilitated database interactions and queries.
AWS Lambda: Automated the execution of various ETL tasks.
S3: Acted as the data lake for storing raw and processed data.
JSON and Pandas: Used for transforming and manipulating data structures.
Parquet: Stored processed data efficiently before loading it into the data warehouse.
AWS EventBridge: Managed event-driven task execution.
CloudWatch: Enabled logging and monitoring of the pipeline.
SNS (Simple Notification Service): Sent error alerts for better error handling.
Boto3: Simplified AWS service interactions.
Secrets Manager: Provided secure storage for sensitive credentials.
Terraform: Facilitated infrastructure management as code.
GitHub Actions: Automated CI/CD (Continuous Integration and Deployment) for the pipeline.

Challenges Faced

Transforming datetime columns into correct format
Package size limit for the Lambda layers
IAM role testing and permissions
CI/CD deployment
Getting testing coverage above 90%

FAQs

Q: Why are we called team Data Squid?
A: Why not??

team-01-data-squid Final Projectpresent