Gooseberries Data Engineering Project Presentation.

Elevating Data Engineering with (aws)ome Innovations

We worked on a comprehensive Data Engineering project that provided a hands-on opportunity to apply a range of technical skills and knowledge acquired over the course of the project. The primary objective of the project was to design, develop, and deploy a data platform that efficiently handles Extract, Transform, and Load (ETL) processes, leading to the creation of a well-structured data warehouse.

Our project's core was built around a Minimum Viable Product (MVP), a foundational framework that encompassed essential processes and capabilities. This MVP served as the basis for showcasing our proficiency in fundamental data engineering tasks and acted as a launching pad for future enhancements.

We started by focusing on the data source and its structure. Our primary data source was a simulated operational database named "totesys." From this database, we extracted crucial data, reflecting a commercial application's back-end information. To ensure precision and targeted data manipulation, we selected specific tables like sales_order, dim_staff, and dim_location. The extracted data was then organized and archived within structured Amazon S3 buckets, guaranteeing streamlined accessibility and well-organized data management.

The heart of our project's success lay in the data processing flow, where we orchestrated the intricate steps of data handling. Automation was key, as we established a systematic sequence for data extraction, transformation, and loading. This automated workflow eliminated the need for manual intervention, ensuring consistent and efficient data processing. Python applications played a central role in this process, serving as the driving force behind task automation. Once data was ingested, it underwent transformation before being loaded into the data warehouse, all seamlessly orchestrated by our Python-powered workflow.

Our project also showcased proficiency in technical aspects such as AWS utilization, data modeling, and Agile practices. We leveraged AWS services like Lambda, S3, and Cloudwatch for automation, storage, monitoring, and alerts. Our data modeling skills were put to the test as we structured the data to fit efficiently into the data warehouse's schema. Throughout the project, we followed Agile methodologies, embracing iterative development and collaborative teamwork.

As a result of our efforts, we successfully constructed a platform that demonstrated practical application of data engineering skills. Our project stood as a testament to our proficiency in Python programming, AWS utilization, data manipulation, ETL processes, and data visualization. Our MVP not only showcased our abilities but also laid the groundwork for potential extensions. Overall, this project provided a real-world simulation of data engineering challenges and allowed us to put our skills to the test, emphasizing the journey of learning and accomplishment.

Gooseberry Project Video

Team Gooseberries

Preview: Valentine Gakunga — Valentine Gakunga

Preview: Harry Hainsworth-Staples — Harry Hainsworth-Staples

Preview: Holly Salthouse — Holly Salthouse

Tech Stack

We used Github, AWS, Python, PSQL, bash script and Terraform.

The selection of these technologies was driven by the need for a comprehensive and efficient data engineering solution that could meet the project's objectives. Each technology played a specific role in different aspects of the project, contributing to its success. Here's why these technologies were chosen:

Github: Chosen as a version control system to facilitate collaborative development, track changes, and maintain codebase integrity. It allowed for seamless teamwork among project members.

AWS (Amazon Web Services): Leveraged for its versatile cloud services that provided a scalable, reliable, and cost-effective infrastructure. AWS services like Lambda, S3, and Cloudwatch were used for automation, data storage, monitoring, and alerts.

Python: Selected for its versatility and ease of use in creating automated scripts for data extraction, transformation, and other tasks. Python's libraries and frameworks supported efficient data manipulation.

PSQL (PostgreSQL): Chosen as a relational database management system for structured data storage. It facilitated data modeling and offered strong data integrity mechanisms.

Bash Scripting: Utilized for scripting automation tasks, particularly in the deployment process. Bash scripts ensured consistent and repeatable infrastructure setup.

Terraform: Adopted for infrastructure-as-code, allowing for streamlined and consistent deployment of resources on AWS. Terraform provided an efficient way to manage and version infrastructure.

Each of these technologies was strategically chosen to address specific project requirements, ensuring the successful development, deployment, and management of the data engineering solution.

The challenges that we faced were that we tried to upload the entire table every time to the final database, which was very time consuming and then we had to delete the database every time. We solved this by appending the data each time when new data was put in, streamlining the process which meant we did not have to delete any data.