Cloudy With A Chance of Terraform Data Projectpresent
Cloudy With A Chance of Terraform Demo Video
Data journey - from raw to ready!
A small scale data platform that employs OLTP database, a few data lakes (AWS S3), a few Lambdas (to extract, transform and load data) and a RDS warehouse. Used Pandas to manipulate data. Leveraged Terraform to create Cloud infrastructure, including scheduling EventBridge and enabling CloudWatch.
The Team
Irina Ponomarenko
Ciaran Kyle
Tes Ryu
William Tait
Bradley Clayton
Technologies
We used: Python, PostgreSQL, AWS S3, AWS Lambda, AWS Secrets Manager, AWS CloudWatch, AWS SNS, Terraform, Python modules: Pandas, pg8000, boto3, patch, mock, logging ...
We chose these technologies for our ETL pipeline as they were the core technologies we covered throughout the Data Engineering course. Python, with libraries like Pandas and boto3, allows us to manipulate data and integrate with AWS services. PostgreSQL offers querying capabilities for structured data. AWS Lambda enables serverless processing, and Secrets Manager ensures secure handling of credentials. We use CloudWatch for monitoring and SNS for notifications. Terraform automates our infrastructure so we can quickly and easily apply changes to the pipeline architecture. Python modules like patch and mock help us test and debug effectively, ensuring the reliability of our pipeline.
Challenges Faced
The project provided ample challenges every step of the way! We knew that we had what it takes to complete the pipeline, but there were some notable hurdles that proved particularly challenging:
One of the first was the the creation of the dependency layers for the Lambda functions using Terraform. It took our best brains a lot of blood sweat and tears but they eventually cracked it.
Another was deciding how to trigger each of the lambda functions in the pipeline; our initial focus was on how to successfully complete the initial extraction and to fill our ingestion bucket with data. It only become apparent after accomplishing this that we would have to go back and edit our extract lambda in order to provide the utility to provide data in batches for our transform lambda. We then had similar issues with having to refactor our transform lambda when it became apparent that our load lambda was throwing errors related with 'duplicate keys' whenever we restarted our pipeline. So there was a fair bit of backtracking as at the beginning of the project we couldn't quite see the forest for the trees. Of course we've learnt a lot from this process.