404 team name not foundpresent
404 team name not found Demo Video
Finding effective data solutions
Our Data Warehouse Pipeline is a robust and scalable solution designed to seamlessly transform the company's transactional database into a powerful analytics data warehouse. Leveraging the capabilities of AWS services, the pipeline ensures efficient extraction, transformation, and loading (ETL) processes to convert data from an Online Transaction Processing (OLTP) to an Online Analytical Processing (OLAP) format.
The Team
Sam Jukes
Josh Lee
Zhuangliang Cao
Alex Chan
Ryhan Uddin
Pavelas Zarikovas
Technologies
We used: Terraform, Docker, Pg8000, Pandas, Github, Slack, Trello, Excalidraw, Python.
We used Slack, Trello and Excalidraw for our planning and organisation. Slack huddles combined with Excalidraw were perfect for our morning and afternoon stand-ups and sprint planning ceremonies. Trello's kanban board made organising tickets simple and efficient. For Infrastructure as code we used Terraform, and Github Actions allowed continuous integration and deployment to guide each step of our project through autonomous testing and security checks. The Transactional and Ananlytics databases were hosted in PostgreSQL and we used Docker to mock them for unit and integration testing.
Our Lambdas were written in Python, Pythons exhaustive libraries and modules made the completion many of the tasks involved simpler and faster. We used pg8000 for our SQL Driver and pandas for our data manipulation as they were the technologies we as a team were most familiar with.
Challenges Faced
Regarding CI/CD, initially we structured our GitHub Actions workflow in such a way that our terraform would be deployed in full on every push. This meant that if a team member began working on a file, without first pulling from main, when they came to push, the old terraform would be re-deployed so it would end up overwriting our latest work. We resolved this by having our GitHub Actions only trigger on a push to main, though in future it’d be wise if we only deployed on push to main, but ran the tests and safety checks on every push.
AWS Permissions, getting the infrastructure deployed was a walk in the park but giving each component permissions to do it’s job was a fair challenge.
Another challenge was Escaping characters. In our Loading Lambda, parquet and PSQL all had different ways of escaping characters in a string, which made dealing with apostrophes somewhat frustrating. Though we found a solution eventually.
Finally we used docker which we were unfamiliar with, this meant it was a system we had to get to grips with, we faced some challenges with integration testing using docker but overcame in the end.