Team Vinson Projectpresent
Team Vinson Project
In Project Vinson, our team of six—Andrea, Daniel, Elliot, Mohsin, Edith, and Svetlana - were engaged in a project focused on creating an automated process to extract data from the Totesys database. To manage the data, we used S3 buckets for storage. Once the data was extracted, it was transformed into a star schema model. After this transformation, the data was loaded into a data warehouse, making it well-suited for business analysis. Throughout each phase of the process, we carefully monitored and logged all activities to ensure accuracy and efficiency. This project reflects our growth as data engineers, deepening our understanding of Agile methodologies and refining our data engineering skills through hands-on experience and collaboration.
The Team
Daniel Smyth
Elliot Lyman
Edith Cheler
Mohsin Sarker
Svetlana Wise
Andrea Biro
Technologies
We used: Python, SQL, Amazon S3, Parquet + CSV, AWS Services (Lambda, Cloudwatch, Eventbridge, SecretsManager, IAM), Tableau, Terraform and GitHub Actions.
- Python: Python was used to write the AWS lambda functions which interacts with AWS infrastructure for retrieving secrets/passwords etc.
- SQL: SQL is the standard language for querying relational databases and it allows the user to manipulate, analyse and integrate data.
- Amazon S3: AWS S3 to store our ingested and processed data as it provides scalability, security and performance when interacting with the lambda functions.
- Parquet and CSV: Parquet file formatting provides efficient compression which improved the overall performance of the lambda function. CSV for simplicity for use within the following lambda functions.
- AWS Services (Lambda, CloudWatch, Eventbridge, SecretsManager, IAM): We used AWS lambda for serverless computing, to enable us to execute our functions with low running costs.
- Cloudwatch was used for logging and monitoring all activities within the s3 buckets. Eventbridge was used to automate the invocation of the extract lambda function on a set timer.
- SecretsManager was used for secure storage and management of sensitive information, ensuring security and compliance. IAM for permissions management between the AWS services.
- Tableau: Tableau was used to visualise and subsequently analyse data to gain insights about business performance.
- Terraform and GitHub Actions: Terraform was used to enable the deployment of our AWS infrastructure using Terraform's Infrastructure as Code capabilities. GitHub Actions was used to establish a workflow to automate the testing and deployment process through a continuous integration and continuous deployment pipeline.
Challenges Faced
- Handling records containing apostrophes and None values required changes to the transform function.
- Data organisation in the data warehouse caused issues which had to be addressed by making changes to how the extract function was ordering records.
- Writing mock tests that don't provide false passes, while also not using real data for testing purposes.
- Creating branches before CI/CD pipeline was created caused the initial extract function to remain untested for too long.
- Creating a custom layer for the AWS lambda to access the pandas library resulting in issues due to errors importing numpy.