de-neural-normalisers Final Projectpresent

de-neural-normalisers Demo Video
Description in the readme.
This project was carried out by a six(man) team with the sole aim of developing infrastructure, in this case a pipeline that is resilient, reliable and managed and primarily deployed in code. At the heart of this project is a totesys operational database with a moderately complex structure with varying degrees of raw datasets. The first phase of this project was to ingest data from the totesys database using code (python scripts) deployed in AWS (Amazon Web Service) lambda, store these datasets in json (JavaScript Object Notation) format in an s3 bucket on AWS. Similarly, the second phase predominantly involved the transformation and modelling of data stored in the ingestion bucket using a third-party package called pandas deployed in lambda. In this instance, data modelling was adapted using the star schema approach and the output dataset of these transformations were stored as parquet files in a processed s3 bucket. In the third phase, data from processed bucket was loaded directly into a data warehouse (RDS database) on AWS, facilitated by code embedded in AWS lambda. All 3 phases were effectively monitored using the CloudWatch service on AWS and alarms triggered in the event of a major error and details of these errors subsequently sent to a designated email linked with SNS (Simple Notification Service) topic on AWS. Additionally, this pipeline was deployed (CI/CD) using github actions and orchestrated with the aid of stepFunctions, as well as an eventBridge scheduler set at 15mins for end-to-end updates and changes to the totesys database. Finally, BI (power BI) tools are plugged into our data warehouse to visually create insights that answer specific totesys business case questions.
The technical bits of this project involved the creation of layers for each lambda in every phase of the pipeline. The ingestion phase lambda had compatibility restraints with the Pg8000 module, hence a layer with all necessary dependencies for this module was created to allow for seamless connection with the totesys database. Likewise, the second phase required a layer that allowed the use of pandas package in boto3. This aided easy manipulation of data in our modelling stage. Finally, a layer was deployed to aid the use of sqlalchemy package to create a connection with our data warehouse and load data into this warehouse based on a predefined schema. It is also noteworthy that changes to this database was handled with code and these changes were detected within 30 minutes in our data warehouse. Further technical details of this project can be found here in https://github.com/Mpjames217/de-project-specification.
Technologies

We used: Python, Terraform, pandas package in python, boto3 and moto, AWS services like lambda, s3, cloudwatch, SNS topic, Step Functions, Event bridge, YAML, git actions for CI/CD
The tech stack we utilised for this project were mostly technologies thought during the course. However, we chose to use pandas for our star schema data transformation because of ease of data manipulation this package offers. Additionally, it is relatively straightforward to filter datasets, using the loc and iloc methods.
Challenges Faced
Due to time constraints, we could not really utilise our code review option, instead we reviewed code at a glance after every standup. Versioning of our facts table based on project specifications proved to be a little bit difficult. Lastly, we came to the conclusion that checking for changes in the totesys database could have been done more efficiently.