Hedgehogpresent
Team Hedgehog Demo Video
"Data Engineering: Turning '404 not found' into '200 OK'." - ChatGPT
This Northcoders Data Engineering Final Project is a pipeline which transforms an SQL database of a shop's sales records (in Online Transaction Processing "OLTP" format) to a structured data warehouse (in Online Analytical Processing "OLAP" format); all hosted in amazon web services (AWS). See the end of the ReadMe for images of the entity relationship diagrams (ERDs) for the initial and transformed databases.
The Pipeline:
1. The pipeline's "extraction" Lambda function collects both archive and new data entries by scanning the database periodically for updates. It converts new, unique data to CSV files which are stored in an S3 bucket; and logs in CloudWatch. The database credentials are stored in Secrets Manager; and Systems Manager is used to store timestamps.
2. Any bucket upload event triggers a second "processing" Lambda function which transforms and normalises the data using Pandas DataFrames and stores them in parquet format in a second S3 bucket.
3. Finally, a third "storage" Lambda function scans the second bucket periodically for updates, which the pipeline converts back to SQL and loads to a data warehouse in star format.
The entire pipeline infrastructure is managed using Terraform.
The Team
Philippa Clarkson
Dylan Hickman Singh
Victoria Messam
Tom Avery
Phillip Taylor
Technologies
We used: Python; PSQL; AWS Lambda, S3, CloudWatch, SSM, SecretsManager; Terraform; GitHub; Trello.
These were the most efficient resources for our required project.
Challenges Faced
Terraform Layers, Github Merges were tricky initially, CSV delimiters.
FAQs
Why did we choose Hedgehog as a team name?
Philippa was late to the first team meeting because a hedgehog had crawled into the foundations of her house and she was busy pulling up floorboards to find it.