Case study · Forthcoming
Why we built our own report ingestion instead of using a managed ETL
This piece is a skeleton. Each
TODOmarks a paragraph Naman is writing. Numbers are placeholders until validated against production.
Context
TODO: Explain the business setup at Vegapay — credit card management for issuer clients, the reporting obligations that come with regulated financial data, and the specific category of files this pipeline handles (daily delinquency, settlement, charge-off, etc.). One paragraph, no more. Anchor the rest of the piece in why correctness and auditability dominate ergonomics here.
What existed before
TODO: Describe the failing state. The cron job, the shared SFTP user, the manual reconciliation step at month-end. What broke and how often. The specific incident that triggered the rebuild — name it (sanitized), date it, quantify it.
Considered alternatives
We did not jump to "build it ourselves" — that's the easy story to tell afterward and a hard sell up front. Three managed options were evaluated:
- A managed ETL service. TODO: which one, what the POC showed, why it was rejected (latency? cost? data residency?).
- Airflow on a managed runner. TODO: where this nearly worked and what specifically pushed back. Schedulers vs. event-driven is the heart of this discussion.
- A thin internal service over the existing batch host. TODO: the cheapest option on paper; explain why "cheapest" misread the actual constraint.
The architecture
TODO: Walk through each component in 2–3 sentences. The choices to defend: why Chronos and not Quartz, why Kafka and not SQS, why raw to S3 and not Postgres bytea, why a separate metadata table at all.
Hard decisions
1. Event-driven vs. polled
TODO: The SFTP server doesn't push. Something polls. The decision was whether to poll from the ingestion service or to put a watcher next to the SFTP host. We chose poll-from-ingestion to keep the SFTP host stateless — describe the latency cost (60s p99) and why that was acceptable.
2. Exactly-once vs. at-least-once with idempotency
TODO: Why we landed on at-least-once with a content-hash key in Postgres, not Kafka transactions. Be honest about how this couples Postgres availability to ingestion correctness.
3. Schema-on-read vs. schema-on-write
TODO: The clients' file schemas drift quarterly. We kept raw bytes in S3 and built a thin parser layer that fails loudly on schema deviation but doesn't block ingestion. Explain why "fail loud, don't drop" was the right shape here.
Numbers
| Metric | Before | After | |---|---|---| | Time from file drop → queryable | TODO: ~45 min | TODO: ~90 s | | Manual reconciliation hours / month | TODO: ~16 | TODO: ~0 | | First-request p99 latency | TODO: 150 ms | 30 ms | | Unique alerts / week | TODO: 12 | TODO: 1 |
TODO: One paragraph after the table that explains which numbers matter and why. The latency win is the headline; the alert reduction is the durable one.
What I'd revisit
TODO: Two or three honest critiques. Candidates: (1) Chronos is fine for now but couples scheduling to deploys; (2) the content-hash idempotency key fails for legitimate corrections — a file revision with a typo fix gets deduped; (3) we kept everything in one Kafka topic for simplicity and that's already showing strain.