GovTech
600K+ Files Processed
Voter Data ETL Pipeline
The Challenge:
Process 600,000+ scanned PDFs containing 3-column images that failed standard OCR.
The Engineering:
Engineered a Google Cloud Function to vertically stack cropped image columns. Used bare-metal Bash/Awk scripts for parsing and a custom transliteration engine for vernacular names.
Power Stack
Python
Bash
Google Cloud
MySQL
Analytics / Dashboards
10M+ Rows Analyzed
Global Pandemic Readiness Index
The Challenge:
Building a real-time rating system for 184 countries. We had to aggregate data from 100+ massive Excel sheets and disparate live APIs (Johns Hopkins, Oxford) to score countries across 16 dynamic policy dimensions.
The Engineering:
Developed an automated Python ETL pipeline using Cron jobs to fetch daily GitHub CSVs and API data. We built a scoring engine to normalize these diverse inputs into a "Travel Readiness Index," pushing clean, structured data to a MySQL-backed visualization dashboard.
Power Stack
Python
MySQL
Cron
Pandas
Google Cloud
Social Research / Big Data
1TB+ Data Harvested
Distributed Public Data Ingestion System
The Challenge:
Aggregate 1TB of fragmented public records from legacy government servers. The source servers were low-bandwidth, unstable, and implemented aggressive rate-limiting, making standard extraction impossible.
The Engineering:
We architected a distributed scraping cluster using rotating proxies and intelligent throttling. The system managed 50+ concurrent workers via Redis queues, handling server timeouts gracefully to achieve 99.9% data completeness without crashing the source.
Power Stack
Python
Google Cloud
MySQL
GovTech / Non-Profit
100k+ Daily Records Synced
IoT-Based Application Status Sync
The Challenge:
The client had submitted 100,000+ applications for government benefits on behalf of citizens. Manually checking the status of these applications on the central portal (NVSP) was impossible, and cloud server costs for scraping were prohibitive for a non-profit.
The Engineering:
We developed a hyper-efficient, headless automation script designed to run on low-cost Edge hardware (Raspberry Pi). The system utilizes Regex for rapid text parsing and runs nightly cron jobs to sync 100,000 status updates without requiring expensive cloud infrastructure.
Power Stack
Python
Bash
Raspberry Pi (Edge)
Regex
Linux
Public Relations / Large Scale Events
Sub-second Face Matching on 50k+ Photos
AI-Powered Face Search for Mass Event
The Challenge:
During a massive, nationwide public awareness march, hundreds of photos were uploaded daily. Participants struggled to find their specific pictures among thousands of unsorted uploads. Scrolling through endless cloud folders was inefficient and led to a poor user experience.
The Engineering:
We built a custom Face Recognition engine using Dlib and facial landmark detection. Users could upload a selfie, and the system would scan the entire event database to find matches. We implemented a unique "Sensitivity Slider" allowing users to adjust the strictness of the AI match.
Power Stack
Python
Dlib (C++ Library)
OpenCV
Django
Apache