Team Deliverance
present

Bootcamp Born, Data Delivered

Driven by the collective efforts of Az, Billy, Miles, Oliver, and Prubh, our team has successfully implemented Agile methodologies in the construction of a robust and adaptable data pipeline. Leveraging AWS services, Python, SQL, and modern DevOps practices, our mission was to develop an ETL process to extract, transform, and load data from the Totesys operational database into a data warehouse for comprehensive analysis and reporting.

Fostering a culture of collaboration, our team embraced pair programming sessions, daily stand-ups, and the utilisation of GitHub Projects with a Kanban board. This framework facilitated open communication, shared ownership of tasks, and transparent progress tracking, ensuring alignment with project objectives and deadlines.

Our technical approach centred around AWS Lambda functions orchestrated by AWS Step Functions, facilitating efficient data processing and integration. To ensure agility in data updates, we employed an EventBridge scheduler, guaranteeing timely access to new data within a 30-minute timeframe.

Initially storing data in JSON lines format, we transitioned to Parquet format for improved efficiency and performance. Data storage was streamlined through S3 buckets, monitoring and alerting functionalities were seamlessly integrated using AWS CloudWatch, while Tableau dashboards provided intuitive data visualisations for stakeholders.

Our infrastructure deployment was automated using Terraform, managed through a CI/CD pipeline integrated with GitHub Actions, reflecting our dedication to continuous integration and delivery practices.

We extend our thanks to our client/mentor Alex for his consistent encouragement and support throughout the project. Alex's belief in our abilities and encouragement have been invaluable, motivating us to overcome challenges and achieve success

This project stands as a testament to our growth and learning as data engineers. Through hands-on experience and collaborative effort, we have deepened our understanding of Agile methodologies and refined our skills in data engineering.

The Team

  • Team member imagePreview: Team member image

    Billy Hopkinson-Wood

  • Team member imagePreview: Team member image

    Miles Phillips

  • Team member imagePreview: Team member image

    Azmol Miah

  • Team member imagePreview: Team member image

    Oliver Boyd

  • Team member imagePreview: Team member image

    Prubh Singh

Technologies

Logos of AWS, Lambda, Terraform, Parquet, Cloudwatch and S3Preview: Logos of AWS, Lambda, Terraform, Parquet, Cloudwatch and S3

We used: Programming Languages and Frameworks: Python Databases and Data Storage: SQL Amazon S3 Data Warehouse Data Formats: Parquet jsonlines Cloud Services: AWS Lambda AWS S3 AWS CloudWatch AWS Step Functions Visualization: Tableau DevOps and Infrastructure Management: Terraform GitHub Action

For our ETL project, we selected a range of technologies and practices that best address the needs of data extraction, transformation, and loading, as well as data storage, monitoring, and visualisation. Here's why these technologies were chosen:

Python: We utilised Python due to its versatility and extensive library support, which significantly simplifies the development of ETL processes. Its strong community and rich ecosystem make it an ideal choice for scripting and automation tasks in ETL pipelines.

SQL: SQL is essential for querying and manipulating data within relational databases, ensuring efficient data extraction and transformation.

Amazon S3: S3 provides scalable, durable, and cost-effective object storage, making it suitable for storing raw data, intermediate files, and final transformed datasets. Its integration with other AWS services enhances data accessibility and processing capabilities.

Parquet and JSONLines: These data formats are chosen for their efficiency and compatibility with big data processing frameworks. Parquet's columnar storage format is optimised for analytical queries, while JSONLines offers a flexible and readable format for semi-structured data.

AWS Services (Lambda, CloudWatch, Step Functions): AWS Lambda enables serverless data processing, reducing infrastructure management overhead. AWS CloudWatch provides monitoring and logging capabilities, ensuring we can track and troubleshoot the ETL processes effectively. AWS Step Functions orchestrate complex workflows, ensuring reliable and scalable execution of ETL tasks.

Tableau: For data visualisation, Tableau offers powerful tools to create interactive and shareable dashboards. This enhances our ability to derive insights from the data and communicate findings effectively.

Terraform and GitHub Actions: Terraform allows us to manage infrastructure as code, ensuring reproducibility and scalability of our environment. GitHub Actions facilitates continuous integration and continuous deployment (CI/CD), enabling automated testing and deployment of our ETL processes.

By combining these technologies and practices, we can build a robust, scalable, and maintainable ETL pipeline that meets our data processing and analysis needs.

Challenges Faced

Triggering Process Steps: We implemented AWS Step Functions to orchestrate the sequential execution of various pipeline stages, ensuring a smooth and coordinated workflow.
AWS Lambda Layer Size Limit: Due to the size constraints of AWS Lambda layers, we optimised our deployment by removing the unnecessary Pandas test suite, ensuring the essential functionality remained intact.
Handling JSON DateTime and NaN Values: We standardised the payload format to handle JSON DateTime and NaN values effectively, ensuring data consistency and compatibility throughout the pipeline.
Maintaining Referential Integrity: We carefully planned the payload format and operation sequence to maintain referential integrity, ensuring that data relationships remained accurate and reliable.