Vox Indicium Data Engineering Project Presentation.
Continuous Development
The project is an Extract Transform Load (ETL) pipeline consisting of three AWS Lambdas coded in python and two S3 Buckets, all with appropriate event monitoring and logging, and concluding with Amazon Quicksight allowing interrogation of the data. The ETL pipeline reads from a database, transforms the data into star schema format before loading the data into a data warehouse, with each of the three Lambda functions corresponding to each of the three stages. The first Lambda reads the database and stores CSV files in a bucket (Extract), the second lambda reads the CSV files and transforms them in to star schema format before saving them in that format as parquet files in a second S3 Bucket (Transform), and the final Lambda reads the parquet files and loads the data in to a data warehouse (Load). The first time the database is read the data warehouse is populated and on subsequent calls the database is queried only for changes. The first Lambda is on a timer whose call time can easily be managed through a Cloudwatch event rule. Subsequent Lambdas are triggered through detection of a text file being deposited in the relevant S3 Bucket, with the text file being transferred to the S3 once all the other necessary files have been transferred. The current parameters ensure database changes are reflected in the data warehouse in less than 30 minutes.
Continuous Integration and Continuous Development was used through a combination of Terraform for managing AWS resources and Github Actions for deployment of code. AWS Secrets Manager was used to store and retrieve sensitive information. All Python code was security tested, unit tested, PEP8 compliant and had an average coverage across code of around 90%.

Vox Indicium Project Video
Team Vox Indicium

Simon Lee

Michael Fay

Andrei Gradinaru

Erikas Kacerauskas

Mark Walsh

Amy Yang
Tech Stack
We used AWS S3, Lambda, EC2, Cloudwatch, IAM, Secrets Manager, Parameter Store, Python, SQL (Protgres), Terraform, Github and Github actions and Trello.
These technologies allowed us to organise effectively as a team to a deliver an effective ETL pipeline.
Now we have a taste its time for a bigger challenge.
