In da Bag Data Engineering Project Presentation.

Sealing Success, In da Bag!

We created a lambda function to execute the code to extract data from the operational database and store it in a DataFrame, we used awswrangler to convert it into a parquet format to store in an S3 (Ingestion) bucket. The extraction Lambda runs every 10 minutes and will pull the last 15 minutes of data. If the lambda is deployed for the first time and the Ingestion bucket is empty it will query the database to grab all the data.

We had another function which grabbed the data from the Ingestion bucket whenever something was inserted and converted it into DataFrames so we could easily manipulate the data so it is in the correct shape to be inserted into the data warehouse. We uploaded these files to a different S3 bucket, again in the parquet format.

When an object was added into the processed bucket, the loading lambda would activate and read the data. It would then dynamically create the SQL query based on the table, column names and values. After the data is inserted the folder the objects were in would be renamed so the Lambda knows to ignore those files in the future.

Team In da Bag

Preview: Kabilan Thayaparan — Kabilan Thayaparan

Preview: Chuanjiao Zong — Chuanjiao Zong

Tech Stack

We used Python, Pandas, AWS Wrangler, Pytest, Pyscopg2, AWS, Quicksight, Terraform and GitHub Actions.

Python is the most commonly used language for data engineering.
Pandas makes it easy to read and manipulate data into various formats/shapes.
Pytest is the testing library we have most experience in and makes it easy to test for exceptions.
Pyscopg2 allows us to connect to a Postgres database and made it easy to create parameterised queries to prevent SQL Injection.
We used several AWS services to deploy our pipeline to the cloud in a relatively cheap way.
Quicksight allowed us to visualise the data we had in our warehouse so we could analyse the data.
Terraform was the best solution to create our infrastructure on AWS by doing most of the heavy lifting for us.
GitHub Actions allowed us to run code quality checks and unit tests before we deployed our code to AWS making sure it wouldn't break the pipeline.

Mocking DB Connection & Cursor - Did some research and found out how to do this using MagicMock & Namespace.
Lambda size limits on AWS - Had to use layers.
Tests were passing locally on our machines but failed in the CI/CD pipeline due to different settings/configurations.
Sometimes there would be duplicate data we had to deal with or some tables would have NA data type which we weren't able to insert into the warehouse.