Rossolimo Projectpresent
The Good, The Great, and The Excellent
Team Rossolimo successfully designed and deployed an ETL pipeline, which extracted data from Totesys' remote database and loaded it into an AWS S3 bucket using a Lambda function with Python code. The data underwent transformation and formatting via the pandas library into dataframes storing as parquet files in our S3 bucket by a subsequent Lambda function. In the final stage, the processed data was loaded into an external data warehouse through an additional Lambda function, utilising Postgres SQL with the Star Schema Warehouse Design.
Automated deployment of Infrastructure using Infrastructure as Code (IaC) tools such as Terraform allows scalability and consistent resource management. Workflow efficiency was also improved with Continuous Integration/Continuous Deployment (CI/CD) using Github Actions.
Each Lambda function was meticulously logged using AWS CloudWatch. The resulting data warehouse can be seamlessly queried using Tableau for in-depth data analysis and visualisation.
With test-driven development at the forefront of our operating practices, we designed a comprehensive testing suite for every aspect of our project. For example, making significant use of mocking and patching for cloud based technologies such as moto for boto3, achieving over 90% coverage of the codebase.
Developing strong teamwork and collaboration skills, we rotated the role of Scrum Master with our daily stand-ups and end of sprint retro. Through the Agile framework, we communicated openly and shared ownership of tasks visualised our GitHub Projects and Jira Kanban board. This transparent progress tracking ensured alignment with project objectives and deadlines
The Team
Mostyn Jefferson
Michael Connolly
Nicholas Slocombe
Heiman Kwok
Leonette Dias
Technologies
Code:
Python: For lambda functions
SQL: For database interaction
YML: Orchestration (Github Actions)
Make: Orchestration (Makefile)
Terraform: Infrastructure
Cloud services:
Lambda: to run Python ETL functions
Cloudwatch: Monitoring of lambda functions and event bridge
SNS: Emailing alerts on errors
S3: Data storage
EventBridge: Scheduler for step function
Step Functions: Trigger lambdas in sequence
Secrets Manager: Storing sensitive information
IAM: Managing access between different AWS services
Methodology:
Jira: Kanban board, and scrum methodology
Miro: Diagrams, charts and planning
Slack: Communication, pair programming
Tableau: Business insights and analysis
Challenges Faced
Collaboration and communication
Pair and individual programming to programming in a team
Solutions:
Daily Standups and meetings
Collaborative tools eg Jira, Miro
Learning curve
New/unfamiliar software, languages and problems
Solutions:
Shared ideas as a group for approach to problems
Pair programming
Documentation
Dynamic Times for lambda functions - allowing both full data dumps, and time-limited ones
Graceful Failure - allowing lambdas to continue despite failing
Utility Lambda File Structure - redesigning file structures as they grew
Layer Size - AWS has very small limits on layer sizing
Dynamic Dim Tables - some of the dim tables were updating, some weren't, which we hadn't initially accounted for
IAM Permissions - archaic and byzantine
SQL load function rewrite for error handling - lack of error handling capability in pandas to_sql method required a full rewrite of load functions
Dynamic Data Testing - testing data the is continuously evolving
Test Database - postgres database hosted in github actions