Alapin Projectpresent
resilient, resourceful and responsive
The challenge was to create an ETL data pipeline for Terrific Totes: to extract the data from an OLTP database and save it in a data lake, transform it into a star schema, and then load it into a data warehouse where we can perform analysis and create visualisations.
In our solution, Eventbridge starts the execution of our Step Function every 15 minutes. The Step Function invokes three lambda functions in order:
- The Extract Lambda Function takes the data from the OLTP warehouse converts it to parquet format and then saves it in an S3 Bucket.
- The transform function takes this ingested data and transforms it into the given star schema, saving each table as a parquet file in another S3 Bucket.
- The Load Function takes the transformed data and inserts it as new rows in the data warehouse.
Each function logs information to Cloudwatch and email alerts are set up to inform developers of any errors in executing the code.
Our repo used GitHub actions to run checks on our code before every pull request and deploy our terraform infrastructure to AWS. Our Python code is pep-8 compliant with 97% test coverage.
Once our data was loaded into the warehouse, we developed an interactive dashboard using Streamlit. This allowed users to explore data in real-time with flexible filters and customizable views, providing immediate insights. We also integrated Tableau into the Streamlit app, combining custom Python analysis with polished, pre-built reports.
Streamlit was chosen for its flexibility, enabling us to connect to PostgreSQL using pg8000, process data with pandas, and create dynamic visualizations using matplotlib.
Key features of our app include powerful data management tools, such as a data quality viewer where users can filter table names, clean data, handle nulls, and run descriptive statistics directly in the app. Users can also filter queries and analyze sales performance. Dynamic charts provide instant insights, and Tableau integration ensures seamless reporting and deeper analysis.
The Team
Alex Beveridge
Hanna Wang
Jessica Marcell
Mohankumar Nanjegowda
Sarah Dankwa
Technologies
We used: AWS services including Lambda, S3, SNS, Cloudwatch, Eventbridge, Step Functions and Secrets Manager.
Terraform was used to deploy our AWS infrastructure.
Python packages including Boto3, pandas, pg8000 and forex were used in our code to deliver the project
Testing was completed using Pytest and moto for mocking AWS resources.
GitHub and GitHub actions were used for our CICD pipeline.
Visualisation was done using Tableau and streamlit
Challenges Faced
Lambda layers: The pandas dependencies files needed for the Python application to run were too big, so we used a pre-built pandas layer in addition to our custom layer folder.
CI/CD: The tool Safety that we used to check for vulnerabilities in the requirements used by our Python application flagged an issue with 2 of our necessary services; Black – a Pep 8-compliant Python code formatter – and Jinja2 – a web template engine needed for pg8000. We decided to add ignore flags for each of these requirements when running our safety checks.
AWS SNS topics: We needed the SNS topics and subscription to remain persistent and not destroy on each terraform destroy command, so we created the topic in the console and referred to it in our terraform files.