Bag End: The Fellowship of the Sales Final Project
present

Bag End: The Fellowship of the Sales Demo Video

One Data Pipeline to Rule Them All.

Bag End: Fellowship of Sales is a data pipeline designed to extract, transform, and load sales data efficiently. It retrieves raw data from PostgreSQL, processes it into a structured warehouse format using Pandas and PyArrow, and stores it in S3 for downstream analytics. Built with AWS Lambda, Step Functions, and SQLAlchemy, it ensures scalable, automated data processing while handling schema changes, empty extracts, and optimized indexing at the load stage.

The Team

Nicole Rutherford
Bonnie Packer
Beth Dunstan
Pieter van den Broek
Luke Guantlett

Technologies

AWS Lambda, AWS S3, Lambda, CloudWatch, PostgreSQL, SQL, SQLAlchemy, Pandas, Parquet, Pytest

Preview: AWS Lambda, AWS S3, Lambda, CloudWatch, PostgreSQL, SQL, SQLAlchemy, Pandas, Parquet, Pytest

Cloud & Infrastructure: AWS Lambda – Serverless computing to extract, transform, and load (ETL) data efficiently without managing infrastructure. AWS S3 – Cloud storage for raw and processed sales data, allowing scalable and cost-effective data management. AWS Step Functions – Orchestrates the ETL pipeline, ensuring a smooth workflow between Lambda functions. AWS CloudWatch – Logs and monitors the pipeline, alerting us to errors and performance issues. AWS SNS – Sends notifications for critical failures in the pipeline. Database & Querying PostgreSQL – Reliable, SQL-based database storing structured data from sales transactions. SQLAlchemy (pg8000 driver) – ORM and database toolkit for interacting with PostgreSQL in a Pythonic way. Data Processing & Format Pandas & PyArrow – Used in the transformation Lambda to efficiently handle and convert data to Parquet. Parquet Format – Optimized for analytics, reducing storage costs and improving query speed. Testing & Development Moto (mock_aws) – Mocks AWS services like S3 and SNS for unit testing without incurring AWS costs. Pytest – Structured and modular testing for each part of the pipeline. Unittest.mock – Used to replace database connections and AWS services with controlled mock objects.

We needed a reliable and flexible way to interact with our PostgreSQL database from AWS Lambda. SQLAlchemy was chosen because:

Abstraction & Maintainability – It allows us to write database queries in a Pythonic way while keeping our code modular.
Safe Query Handling – Using bound parameters (e.g., WHERE created_at > :last_extract_time) prevents SQL injection.
Flexibility – We can write both raw SQL queries and ORM-based models if needed.
Works with pg8000 – Since AWS Lambda has issues with some PostgreSQL drivers (like psycopg2 due to compiled dependencies), pg8000 is pure Python and lightweight, making it a better fit.
Alternative Considered: Using raw SQL with pg8000 alone – but SQLAlchemy improves readability, safety, and maintainability.

When transforming and storing data, we needed a format that is optimized for analytics and efficiently handles large datasets. PyArrow was chosen because:
Parquet Support – It natively supports Parquet, which is an optimal format for our data warehouse.
High Performance – It’s faster than Pandas for processing large tables (columnar memory layout).
Efficient Storage – Parquet is compressed and columnar, making it faster to read in analytics workloads.
AWS-Compatible – Amazon Athena, Redshift, and other services prefer Parquet over JSON/CSV for querying.
Alternative Considered: CSV or JSON – But they are slower, take up more space, and are inefficient for analytical queries.

When transforming raw sales data, we needed a tool to clean, manipulate, and structure the data before writing it to S3. Pandas was chosen because:
DataFrame Operations – Easy filtering, grouping, and type conversion.
Integrates with PyArrow – Allows seamless conversion to Parquet.
Handles Missing Data – Pandas is great for dealing with nulls and inconsistencies in the source data.
Well-Supported – A widely-used industry standard for Python data processing.
Alternative Considered: Dask (for parallel processing) – But our dataset is small enough that Pandas + PyArrow works efficiently.

Why This Combination?
SQLAlchemy + pg8000 → Safely fetch data from PostgreSQL
Pandas → Clean, transform, and prepare structured data
PyArrow → Convert data efficiently and save as Parquet
This stack ensures our pipeline is fast, scalable, and cost-effective while avoiding unnecessary complexity.

Challenges Faced

Challenges We Overcame:

Pandas Size Limitation – Our Lambda function exceeded the AWS package size limit when including Pandas, so we switched to using AWS Managed Layers to keep deployments lightweight.
Indexing in the Transform Step – Initially, we tried indexing columns in our Transform Lambda, but this made data retrieval from S3 too complex and increased processing time in the Load step. We simplified this by moving indexing to the Load function instead.
Handling Empty Files – Our Extract function initially assumed that every table would always have new data, but in reality, updates happen randomly per table. We had to adapt our code to handle cases where some or all tables had no new data, ensuring empty files were still created as expected.

Bag End: The Fellowship of the Sales Final Projectpresent

One Data Pipeline to Rule Them All.

The Team

Technologies

Challenges Faced

Bag End: The Fellowship of the Sales Final Project
present