Various Artists
present

Various Artists Demo Video

We'll find a way

Our project constitutes an Extract, Transform & Load pipeline from an operational database for our client Terrific Totes: a company that makes branded and customised tote bags for retail and corporate clients.

The data is archived in a data lake, and is made available in a remodelled OLAP data warehouse hosted in AWS. Movement of data is handled by discrete AWS Lambda functions scripted in Python triggered by scheduling, and object creation events, in EventBridge. The data is extracted, via SQL queries, into CSV files before being transformed, through Pandas data frames, into Parquet format, moving through the ingestion and processed AWS S3 buckets, each keeping a complete timestamped record of all past data states. At each stage in the process; AWS SNS alerts are in place to trigger emails in the case of errors, with the entire process deployed with Terraform via GitHub Actions enabling continuous integration and deployment.

Utilising best team practices at every stage, such as; Kanban through GitHub Projects, Agile methodologies, daily stand-up meetings and version control in GitHub, we were able to break the project into individual, digestible tasks, deliver incrementally and closely monitor progress. Owing to the frequency of our ETL deployment, we’re able to display useful projections from our data warehouse through Business Insight Dashboard applications, such as QuickSight and MatPlotLib, providing up to date, accessible and relevant information, tailored to the business requirements of our client.

The Team

Angus Hirst
Atif Hussain Khan
Louis Kelly
Lucy Adams
Dave Geddes
Weronika Elzbieta Falinska

Technologies

We used: Python, PG8000, SQL, AWS SDK for Python (Boto3), AWS S3, AWS EventBridge, AWS SNS, AWS CloudWatch, AWS Lambda, Pandas, AWS SDK for Pandas (AWS Wrangler), NumPy, AWS Secrets Manager, JSON, pytest, moto, datetime, unittest, QuickSight, MatPlotLib, Terraform, YAML, GitHub Actions, GitHub Projects.

Python - Utilised for scripting our Extract, Transform, and Load functions, along with a library of relevant utility functions. These are implemented in the code as needed.

PG8000 - A Python PostgreSQL driver used for interacting with our RDBMSs.

SQL - Structured Query Language is applied to construct queries for interacting with our relational databases, making it suitable for tasks such as extracting, adding, and updating data.

AWS SDK for Python (Boto3) - Employed to interact with AWS services directly from within our Python code.

AWS S3 - An object storage service used to create Data Lakes, ensuring accurate records of data ingested from the RDBMS, as well as maintaining a repository of transformed data subsequently uploaded to the Data Warehouse.

AWS EventBridge - An event management service that enabled us to schedule and trigger events in the cloud. We also utilised events to initiate other events within our ETL process.

AWS SNS - A notification service that sends emails to inform us about Lambda events and to alert us in case of failures or errors.

AWS CloudWatch - Used for monitoring and logging cloud activity, creating a historical record of regular and up-to-date data states.

AWS Lambda - An event-driven compute service that empowers us to execute our Pythonic code at different stages of the ETL process.

Pandas - A software library that we utilised to process our data tables into DataFrames, allowing us to manipulate the data before performing conversions into required file formats using the AWS SDK for Pandas (AWS Wrangler).

AWS SDK for Pandas (AWS Wrangler) - Employed to seamlessly convert our Pandas DataFrames between CSV and Parquet file formats. This conversion facilitates storage in our Ingestion and Processed S3 buckets, respectively.

NumPy - Used to enrich table data where required.

AWS Secrets Manager - Utilised for securely storing and retrieving our database credentials and table names.

JSON - The format used for storing our AWS secrets in Secrets Manager, and later extracting them into a useable format.

pytest - For testing our code to ensure it works as required, and can be deployed with confidence.

moto - A Python library that provides mock objects for unit testing. It enables us to verify the correct invocation of functions without executing their actions. Additionally, Moto facilitates testing of connections without establishing real connections to the underlying components.

datetime - A Python library, enabling us to effectively manage timestamps and transform them according to our needs.

unittest - For building unit tests when creating our own MockDB file.

QuickSight - A software tool we used to visually display our data on a dashboard, which is intended for the client's use.

MatPlotLib - Another software tool we used to visually display our data on a dashboard, which is intended for the client's use.

Terraform - An infrastructure-as-code software tool used to create our individual AWS components and integrate them into the infrastructure that supports our ETL process.

YAML - Used to script the configuration file, defining the workflow of GitHub Actions during the setup of our infrastructure for code deployment.

GitHub Actions - Following the configuration outlined in the YAML file to implement our CI/CD pipeline. This occurs whenever updates are merged into the main branch during code deployment..

GitHub Projects - Utilised for planning, tracking project progress, and notifying when code is up for review. Playing an essential role in our daily stand-up meetings where we share the progress of our individual tasks and discussed the overall status of the project.

Various Artistspresent

We'll find a way

The Team

Technologies

Challenges Faced

Various Artists
present