Rossolimo Project
present

The Good, The Great, and The Excellent

Team Rossolimo successfully designed and deployed an ETL pipeline, which extracted data from Totesys' remote database and loaded it into an AWS S3 bucket using a Lambda function with Python code. The data underwent transformation and formatting via the pandas library into dataframes storing as parquet files in our S3 bucket by a subsequent Lambda function. In the final stage, the processed data was loaded into an external data warehouse through an additional Lambda function, utilising Postgres SQL with the Star Schema Warehouse Design.

Automated deployment of Infrastructure using Infrastructure as Code (IaC) tools such as Terraform allows scalability and consistent resource management. Workflow efficiency was also improved with Continuous Integration/Continuous Deployment (CI/CD) using Github Actions.

Each Lambda function was meticulously logged using AWS CloudWatch. The resulting data warehouse can be seamlessly queried using Tableau for in-depth data analysis and visualisation.

With test-driven development at the forefront of our operating practices, we designed a comprehensive testing suite for every aspect of our project. For example, making significant use of mocking and patching for cloud based technologies such as moto for boto3, achieving over 90% coverage of the codebase.

Developing strong teamwork and collaboration skills, we rotated the role of Scrum Master with our daily stand-ups and end of sprint retro. Through the Agile framework, we communicated openly and shared ownership of tasks visualised our GitHub Projects and Jira Kanban board. This transparent progress tracking ensured alignment with project objectives and deadlines

The Team

  • Mostyn JeffersonPreview: Mostyn Jefferson

    Mostyn Jefferson

  • Michael ConnollyPreview: Michael Connolly

    Michael Connolly

  • Nicholas SlocombePreview: Nicholas Slocombe

    Nicholas Slocombe

  • Heiman KwokPreview: Heiman Kwok

    Heiman Kwok

  • Team member imagePreview: Team member image

    Leonette Dias

Technologies

Python, SQL, Terraform, AWS cloud services: Lambda, Cloudwatch, S3, EventBridge Infrastructure as Code: TerraformPreview: Python, SQL, Terraform, AWS cloud services: Lambda, Cloudwatch, S3, EventBridge Infrastructure as Code: Terraform

Code:
Python: For lambda functions
SQL: For database interaction
YML: Orchestration (Github Actions)
Make: Orchestration (Makefile)
Terraform: Infrastructure

Cloud services:
Lambda: to run Python ETL functions
Cloudwatch: Monitoring of lambda functions and event bridge
SNS: Emailing alerts on errors
S3: Data storage
EventBridge: Scheduler for step function
Step Functions: Trigger lambdas in sequence
Secrets Manager: Storing sensitive information
IAM: Managing access between different AWS services

Methodology:
Jira: Kanban board, and scrum methodology
Miro: Diagrams, charts and planning
Slack: Communication, pair programming
Tableau: Business insights and analysis

Challenges Faced

Collaboration and communication

Pair and individual programming to programming in a team
Solutions:
Daily Standups and meetings
Collaborative tools eg Jira, Miro
Learning curve

New/unfamiliar software, languages and problems
Solutions:
Shared ideas as a group for approach to problems
Pair programming
Documentation

Dynamic Times for lambda functions - allowing both full data dumps, and time-limited ones
Graceful Failure - allowing lambdas to continue despite failing
Utility Lambda File Structure - redesigning file structures as they grew
Layer Size - AWS has very small limits on layer sizing
Dynamic Dim Tables - some of the dim tables were updating, some weren't, which we hadn't initially accounted for
IAM Permissions - archaic and byzantine
SQL load function rewrite for error handling - lack of error handling capability in pandas to_sql method required a full rewrite of load functions
Dynamic Data Testing - testing data the is continuously evolving
Test Database - postgres database hosted in github actions