Sperrins Syndicate Final Project
present

Sperrins Syndicate Demo Video

Sperrins Syndicate Demo Video

Bound by the peaks, driven by purpose.

Terrific Totes is a fictional company that operates an OLTP database and a data warehouse used for reporting and visualisations. The goal of this project was to develop applications that Extract, Transform, and Load (ETL) data from an OLTP database into a data lake and warehouse, hosted in AWS. Our solution is reliable, scalable, and fully managed using Infrastructure-as-Code.

The features of our project include:

**Automated Data Processing**
- EventBridge Scheduler: Triggers data ingestion every 5 minutes.
- Step Machine with JSON Payloads: Orchestrates the workflow.
- Three Lambda Functions (& Layers):
- One Python application ingests all tables from the OLTP database. It then implements incremental refresh so that only new and/or updated data is processed for optimisation of resources.
- Another Python application remodels the data into a predefined schema and stores it in the "processed" S3 bucket in Parquet format.
- A third Python application loads the transformed data into a data warehouse, with new and updated data reflected in the target database within 30 minutes. Fact tables retain a full history of how the data has evolved.
- CloudWatch Monitoring & SNS Alerts: Logs errors, tracks performance, and sends critical failure notifications via SNS.

**Secure Data Management**
- IAM Policies: Implements the principle of least privilege.
- Secrets Manager: Manages database credentials securely.

**Data Storage in S3**
- Raw Data Bucket: Stores ingested data in its original form, acts as a data lake.
- Processed Data Bucket: Holds transformed data in an immutable, well-structured format.

**Code Quality & Security**
- Python code is PEP8 compliant, thoroughly tested (with unit and integration tests), and checked for security vulnerabilities using pip-audit and bandit.

The Team

  • Anas ElsafiPreview: Anas Elsafi

    Anas Elsafi

  • Angelo RohanathanPreview: Angelo Rohanathan

    Angelo Rohanathan

  • Chiheng JiangPreview: Chiheng Jiang

    Chiheng Jiang

  • Leila Carrier-SippyPreview: Leila Carrier-Sippy

    Leila Carrier-Sippy

  • Sezgi KhajotiaPreview: Sezgi Khajotia

    Sezgi Khajotia

  • Youness SenhajiPreview: Youness Senhaji

    Youness Senhaji

Technologies

AWS, Terraform, Python, PostgreSQL, TableauPreview: AWS, Terraform, Python, PostgreSQL, Tableau

We used: AWS, GitHub, Terraform, Python, PostgreSQL, Tableau

By using the technologies introduced and covered throughout the Data Engineering course, we were able to thoroughly consolidate our knowledge of the course materials.

The key languages used in this project were Python (75%), Terraform (24%) and Make (~1%) as reflected on our GitHub repository.

AWS services were used to build the infrastructure for our ETL pipeline. The services we used included: CloudWatch, Lambda, EventBridge, Step Functions, SNS, S3, Secrets Manager.

In addition, by using terraform and GitHub, we were able to set up our entire infrastructure-as-code and create a project with fully automated deployment.

PostgreSQL provided easy integration with python, enabling us to extract and load data to our databases.

Tableau was chosen as the application in which to generate our data visualisations, due to ease of availability on our operating systems.

Finally, the following libraries were also used in our python code, providing essential additional functionality: boto3, moto, pandas, numpy, pg8000, freezegun, coverage, bandit, black.

Challenges Faced

Here are the top three challenges that we overcame:

  1. The first challenge was because the transform lambda sometimes needed two tables to join to create a dim table. Due to our pipeline using incremental refresh, the transform lambda did not always have access to the latest data of the table it needed to join with. The solution was to add logic to the ingest lambda that would send a full current snapshot of the table to join with when it detected changes in the relevant tables.
  2. The next challenge was to do with the panda layer needed for the transform and ingest lambdas. We had compatibility issues due to the lambda running different operating systems to our machine and memory issues due to the size of the panda module exceeding the memory constraints of the lambda layer. The solution was to use an AWS managed lambda layer for the panda dependency.
  3. Finally, we had issues with module imports for Python files that imported multiple functions across a complex folder structure. The solution was to use sys.path.append so that both the cloud and local versions of our code ran without errors.