Blackwater Data Engineering Project
present

Blackwater Demo Video

Blackwater Demo Video

Streamlining data, empowering decisions

Team Blackwater were tasked with creating and deploying an ETL pipeline by reading data from the Totesys remote Database into an AWS S3 Bucket via Lambda python code. This data was transformed and formatted before being loaded into another S3 bucket by another Lambda function. Finally, the output was loaded to an external Data Warehouse by a final Lambda function. All of the Lambda functions were logged via Cloudwatch. Ultimately, the final Data Warehouse could be queried using Quicksight for comprehensive data analysis and visualisation.

The Team

  • Team member imagePreview: Team member image

    Richard Wilding

  • Team member imagePreview: Team member image

    Tom Jackson

  • Team member imagePreview: Team member image

    Mike Lee

  • Team member imagePreview: Team member image

    Alex Coe

  • Team member imagePreview: Team member image

    Ben Turner

Technologies

Technologies section imagePreview: Technologies section image

We used: Python including the following packages: pg8000, boto3, pandas, awswrangler, pytest, moto, bandit, safety, black, flake8 and coverage. AWS including Lambda, S3, SecretsManager, Cloudwatch, SNS, EventBridge, IAM, QuickSight. PostgreSQL. Terraform. Git. GitHub Actions and Secrets. Trello.

Python was used to write the extract, transform and load (ETL) functions, utilising additional libraries when relevant:
pg8000 - to connect to PostgreSQL databases (Totesys / Data Warehouse (Production and Test)
boto3 - to interact with other AWS resources (primarily S3 and SecretsManager)
pandas / awswrangler - to read from and write to S3 buckets and to transform the extracted data
pytest - to test the Python code
moto - to mock the AWS infrastructure while testing
bandit - to ensure there were no vulnerabilities in the code we had written
safety - to ensure there were no vulnerabilities in the Python libraries
black and flake8 - to check and reformat our code to ensure PEP8 compliance
coverage - to ensure our code was well tested
AWS was used to host our ETL pipeline and we made use of the following:
Lambda - to run our Python ETL functions
S3 - to store the extracted and transformed data
SecretsManager - to store database credentials for the PostgreSQL databases
Cloudwatch - to receive logs of the status of our functions and to raise alarms when errors received
SNS - to send emails when Cloudwatch alarms were raised
EventBridge - to schedule the extract lambda to run every 5 minutes
IAM - to give permissions allowing all of the AWS resources to work as an ETL pipeline
QuickSight - to visualise data in the data warehouse
PostgreSQL was used for both the Totesys database and the Data Warehouse, we also used it to create a database to test our load function
Terraform was used to deploy all of the AWS infrastructure and upload the Python code
Git version control was used, via the command line and GitHub
GitHub Actions and Secrets were used as part of our CI/CD pipeline
All of our work was planned and assigned using a Trello board

Challenges Faced

Ensuring that our functions could access required credentials securely locally, when deployed to AWS and during CI/CD pipeline.
Git branching and merging was challenging at times.
Testing our load function required building our own copy of the data warehouse.
Getting layers deployed to AWS when using large packages such as awswrangler/pandas that exceeded the maximum layer size.
Handling records containing apostrophes required changes to the transform function.