Team Girley project phasepresent

Team Girley Demo Video
From Raw to Refined
Our team developed a scalable data pipeline to automate the extraction, transformation, and loading (ETL) of data from an operational database into an AWS-hosted data lake and data warehouse. The project was designed to enable efficient data processing, improve data accessibility, and provide structured insights for analytics and reporting.
Key features of the project:
Data Ingestion: Extracted data from a PostgreSQL database using an AWS Lambda function, secured with GitHub secrets. Used pg8000 for query execution and parameterized queries for dynamic data extraction.
Data Transformation & Storage: Stored raw data in Amazon S3, processed transformations an AWS Lambda function, and loaded structured data into Amazon S3 using a star-schema model.
Automation & Deployment: Used Terraform to provision AWS infrastructure, including S3, IAM, Lambda, Step Functions. Integrated CI/CD with GitHub Actions for automated deployment.
Monitoring & Error Handling: Configured CloudWatch Logs to track execution and failures. Implemented structured logging with timestamps and error messages. Set up SNS alerts for critical failures.
Branching & Documentation: Followed GitHub Flow for development; created a Wiki for project documentation.
Security & Compliance:
GitHub Secrets: Secure storage of credentials; restricted access in GitHub Actions.
AWS Security: IAM roles/policies via Terraform; CloudTrail logging; Security Groups for access control.
Terraform Security: Remote state in versioned S3; restricted .tfvars in version control.
Code Security: Branch protection rules; .gitignore for sensitive files.
This project provided hands-on experience with AWS data engineering, infrastructure as code, automation, and cloud security. It also involved working in a team, applying Agile methodologies, and ensuring best practices for logging, monitoring, and optimization.
The Team
Sathiyavathi Anandkumar
Tom Ashford
Prince Olubari
Muhammad Alom
Zidan Wang
Pablo Caldas
Technologies

AWS Lambda: Used for data extraction, transformation, and loading automation. PostgreSQL: The operational database from which data was extracted. AWS S3: For storing raw data and processed data in the data lake. AWS IAM: For secure access management and role-based access control. Terraform: For provisioning and managing AWS infrastructure, ensuring consistency and version control. AWS Step Functions: To automate workflows for data processing and movement. GitHub Actions: For CI/CD automation, enabling streamlined deployment and testing. AWS CloudWatch: For monitoring, logging, and tracking system health and performance. SNS: For alerting in case of critical errors. pg8000: For querying PostgreSQL with parameterized queries in Python.
AWS Lambda was chosen for its serverless nature, scalability, and ease of integration with other AWS services like S3 and Step Functions.
PostgreSQL was the data source due to its robustness, relational nature, and compatibility with the required data processing needs.
AWS S3 provided a cost-effective and scalable storage solution for raw and processed data.
AWS IAM was used to ensure security and enforce the principle of least privilege across the team.
Terraform was selected for its Infrastructure-as-Code capabilities, enabling automated and repeatable provisioning of AWS resources.
AWS Step Functions allowed us to automate workflows in a visual, easy-to-manage way, ensuring that data was processed and moved correctly.
GitHub Actions helped automate deployment pipelines, making it easier to continuously integrate and deploy changes.
AWS CloudWatch offered real-time logging and monitoring, ensuring visibility into the Lambda functions' execution and allowing quick responses to failures.
SNS provided a reliable mechanism for sending notifications on critical issues, keeping the team informed.
pg8000 was chosen for interacting with PostgreSQL in a lightweight, Pythonic way, supporting dynamic queries.
Challenges Faced
Yes, during the course of this project, we encountered a couple of significant challenges:
Database Credentials Exposure: We initially discovered that our database credentials were accidentally exposed on GitHub. To resolve this, we implemented GitHub Secrets for secure storage of sensitive data, ensuring that we never hardcode credentials in the repository. Additionally, we used a .gitignore file to prevent sensitive files from being committed.
Lambda Layers Implementation: We faced some difficulty in setting up and using Lambda layers. There was confusion around the process and the right way to install dependencies within layers. We spent a considerable amount of time debugging and researching solutions. Ultimately, we successfully implemented Lambda layers, but this was a learning curve that slowed our progress.
Despite these challenges, we were able to overcome them through teamwork, persistence, and further research.
The project provided valuable hands-on experience with AWS services, Terraform, and CI/CD practices. Working in a team, we also applied Agile methodologies, and throughout the project, we focused on best practices for logging, monitoring, and automation. It was a challenging yet rewarding project, and we gained a deeper understanding of cloud infrastructure and data engineering.