Lullymore-west Final Project
present

Lullymore west Demo Video

"This should only take 30 minutes"

As part of the 13-week “Data Engineering in Python” bootcamp with Northcoders, we worked on developing an Extract, Transform, Load (ETL) pipeline for the fictional company Totesys. The goal was to automate data ingestion, transformation, and loading using AWS services and infrastructure as code.

Our pipeline was designed to run on a 30-minute schedule, with each step orchestrated by EventBridge. The plan involved extracting data from an RDS operational database using a Lambda function, with credentials securely managed in Secrets Manager. The extracted data would then be stored in S3 as JSON before undergoing transformation via another Lambda function, which used Pandas to parse and convert the data into Parquet format within a specified star schema. Parameter Store was used to manage parameters for these processes.

Once transformed, the data would be stored back in S3, and a final Lambda function would load it into a remodelled Data Warehouse for future business intelligence analysis. CloudWatch was also set up for monitoring and logging.

However, due to time constraints, the full implementation of the pipeline was not completed.

The Team

Iris Araneta
Ollie Lawn
Callum Bain
Dani Ghinchevici
Lucy Milligan
Joss Sessions

Technologies

Terraform, AWS (S3, Eventbridge, Cloudwatch, Lambda), Python (including Pandas, Psycopg2, FastParquet)

Preview: Terraform, AWS (S3, Eventbridge, Cloudwatch, Lambda), Python (including Pandas, Psycopg2, FastParquet)

We used: Terraform, AWS (S3, Eventbridge, Cloudwatch, Lambdas, Secrets Manager, SSM Parameter Store), Python (including Pandas, Psycopg2, FastParquet), GitHub Actions

Terraform: Used for infrastructure-as-code (IaC) to ensure reproducibility and automated deployment.
S3: scalable data storage.
EventBridge: to automate the workflow and trigger lambda functions from s3 "PutObject" notifications. Although we initially tried using a State Machine, we transitioned to using EventBridge for ease of deployment.
CloudWatch: logging and monitoring of the lambda functions behaviour and overall pipeline execution.
Lambda: Allowed us to write python applications that could be triggered by events, with associated utility functions for the ingest, transform and load stages of the pipeline. It's also serverless and therefore cost efficient.
Secrets Manager: helped manage credentials securely
SSM Parameter Store: helped us keep track of the behaviour within the lambda functions (e.g. storing of timestamp variables needed to check if the database has been updated since the last invokation).
Python: core programming language for data extraction, transformation, and loading (ETL).
Pandas: used for data manipulation and transformation, using the ingested data in JSON format.
Psycopg2: used for interacting with the PostgreSQL Totesys database, hosted in AWS. This was chosen for its efficiency and versatility.
Fast Parquet: used for efficient storage in Parquet format and is a smaller package size than Pyarrow (smaller size required for the Lambda layers).
GitHub Actions: continuous integration and deployment (CI/CD). Ensured automated testing, security checks and PEP8 compliance. This was also used to deploy the terraform code.

Challenges Faced

Given it was the groups first time creating an ETL pipeline, there were numerous challenges we faced, including:

AWS Terraform configuration: Small oversights in our Terraform scripts led to unexpected difficulties.
Transitioning from using a State Machine to EventBridge trigger
Lambda response formatting: troubleshooting the response formatting for unpacking of data by second lambda
Pandas layer attachment: Identifying and attaching a suitable AWS Lambda layer that hosts the Pandas library proved to be a challenge.
Parameter Store and Secrets Manager: Managing sensitive information securely using AWS Parameter Store and Secrets Manager was essential but complex.
Testing: testing the interaction of the lambda functions with operational databases (unit testing vs integration testing)
Completing the project within 2 weeks

We are really pleased with how we worked together as a group and the progress we made, despite not completely finishing the project. Although we have learnt a lot over the course of the bootcamp, the project also highlighted how much more there is to learn within the data engineering field!

Lullymore-west Final Projectpresent

"This should only take 30 minutes"

The Team

Technologies

Challenges Faced

Lullymore-west Final Project
present