Team Vinson Project
present

Team Vinson Project

In Project Vinson, our team of six—Andrea, Daniel, Elliot, Mohsin, Edith, and Svetlana - were engaged in a project focused on creating an automated process to extract data from the Totesys database. To manage the data, we used S3 buckets for storage. Once the data was extracted, it was transformed into a star schema model. After this transformation, the data was loaded into a data warehouse, making it well-suited for business analysis. Throughout each phase of the process, we carefully monitored and logged all activities to ensure accuracy and efficiency. This project reflects our growth as data engineers, deepening our understanding of Agile methodologies and refining our data engineering skills through hands-on experience and collaboration.

The Team

  • Team member imagePreview: Team member image

    Daniel Smyth

  • Team member imagePreview: Team member image

    Elliot Lyman

  • Team member imagePreview: Team member image

    Edith Cheler

  • Team member imagePreview: Team member image

    Mohsin Sarker

  • Team member imagePreview: Team member image

    Svetlana Wise

  • Team member imagePreview: Team member image

    Andrea Biro

Technologies

Python, SQL, Amazon S3, AWS Services (Lambda, Cloudwatch, Eventbridge,), Tableau, TerraformPreview: Python, SQL, Amazon S3, AWS Services (Lambda, Cloudwatch, Eventbridge,), Tableau, Terraform

We used: Python, SQL, Amazon S3, Parquet + CSV, AWS Services (Lambda, Cloudwatch, Eventbridge, SecretsManager, IAM), Tableau, Terraform and GitHub Actions.

  • Python: Python was used to write the AWS lambda functions which interacts with AWS infrastructure for retrieving secrets/passwords etc.
  • SQL: SQL is the standard language for querying relational databases and it allows the user to manipulate, analyse and integrate data.
  • Amazon S3: AWS S3 to store our ingested and processed data as it provides scalability, security and performance when interacting with the lambda functions.
  • Parquet and CSV: Parquet file formatting provides efficient compression which improved the overall performance of the lambda function. CSV for simplicity for use within the following lambda functions.
  • AWS Services (Lambda, CloudWatch, Eventbridge, SecretsManager, IAM): We used AWS lambda for serverless computing, to enable us to execute our functions with low running costs.
  • Cloudwatch was used for logging and monitoring all activities within the s3 buckets. Eventbridge was used to automate the invocation of the extract lambda function on a set timer.
  • SecretsManager was used for secure storage and management of sensitive information, ensuring security and compliance. IAM for permissions management between the AWS services.
  • Tableau: Tableau was used to visualise and subsequently analyse data to gain insights about business performance.
  • Terraform and GitHub Actions: Terraform was used to enable the deployment of our AWS infrastructure using Terraform's Infrastructure as Code capabilities. GitHub Actions was used to establish a workflow to automate the testing and deployment process through a continuous integration and continuous deployment pipeline.

Challenges Faced

  • Handling records containing apostrophes and None values required changes to the transform function.
  • Data organisation in the data warehouse caused issues which had to be addressed by making changes to how the extract function was ordering records.
  • Writing mock tests that don't provide false passes, while also not using real data for testing purposes.
  • Creating branches before CI/CD pipeline was created caused the initial extract function to remain untested for too long.
  • Creating a custom layer for the AWS lambda to access the pandas library resulting in issues due to errors importing numpy.