Case study · Forthcoming

Why we built our own report ingestion instead of using a managed ETL

Vegapay · 2025~12 min read

This piece is a skeleton. Each TODO marks a paragraph Naman is writing. Numbers are placeholders until validated against production.

Context

TODO: Explain the business setup at Vegapay — credit card management for issuer clients, the reporting obligations that come with regulated financial data, and the specific category of files this pipeline handles (daily delinquency, settlement, charge-off, etc.). One paragraph, no more. Anchor the rest of the piece in why correctness and auditability dominate ergonomics here.

What existed before

TODO: Describe the failing state. The cron job, the shared SFTP user, the manual reconciliation step at month-end. What broke and how often. The specific incident that triggered the rebuild — name it (sanitized), date it, quantify it.

Considered alternatives

We did not jump to "build it ourselves" — that's the easy story to tell afterward and a hard sell up front. Three managed options were evaluated:

A managed ETL service. TODO: which one, what the POC showed, why it was rejected (latency? cost? data residency?).
Airflow on a managed runner. TODO: where this nearly worked and what specifically pushed back. Schedulers vs. event-driven is the heart of this discussion.
A thin internal service over the existing batch host. TODO: the cheapest option on paper; explain why "cheapest" misread the actual constraint.

The architecture

┌──────────┐ │ client │ │ SFTP │ └────┬─────┘ │ file drop ▼ ┌─────────────┐ │ watcher │ (Chronos schedule, debounced) └──────┬──────┘ │ publish ▼ ┌─────────────┐ │ Kafka │ │ (topic A) │ └──┬───────┬──┘ │ │ ▼ ▼ ┌──────┐ ┌────────┐ │ S3 │ │Postgres│ │ raw │ │ meta │ └──────┘ └────────┘

SFTP → ingestion service → Kafka → consumers → Postgres + S3

TODO: Walk through each component in 2–3 sentences. The choices to defend: why Chronos and not Quartz, why Kafka and not SQS, why raw to S3 and not Postgres bytea, why a separate metadata table at all.

Hard decisions

1. Event-driven vs. polled

TODO: The SFTP server doesn't push. Something polls. The decision was whether to poll from the ingestion service or to put a watcher next to the SFTP host. We chose poll-from-ingestion to keep the SFTP host stateless — describe the latency cost (60s p99) and why that was acceptable.

2. Exactly-once vs. at-least-once with idempotency

TODO: Why we landed on at-least-once with a content-hash key in Postgres, not Kafka transactions. Be honest about how this couples Postgres availability to ingestion correctness.

3. Schema-on-read vs. schema-on-write

TODO: The clients' file schemas drift quarterly. We kept raw bytes in S3 and built a thin parser layer that fails loudly on schema deviation but doesn't block ingestion. Explain why "fail loud, don't drop" was the right shape here.

Numbers

| Metric | Before | After | |---|---|---| | Time from file drop → queryable | TODO: ~45 min | TODO: ~90 s | | Manual reconciliation hours / month | TODO: ~16 | TODO: ~0 | | First-request p99 latency | TODO: 150 ms | 30 ms | | Unique alerts / week | TODO: 12 | TODO: 1 |

TODO: One paragraph after the table that explains which numbers matter and why. The latency win is the headline; the alert reduction is the durable one.

What I'd revisit

TODO: Two or three honest critiques. Candidates: (1) Chronos is fine for now but couples scheduling to deploys; (2) the content-hash idempotency key fails for legitimate corrections — a file revision with a typo fix gets deduped; (3) we kept everything in one Kafka topic for simplicity and that's already showing strain.

Footnotes

[1]
TODO: A note on file size distributions — most are under 1 MB, but a few clients drop 800 MB quarterlies and that changed the back-pressure design.
[2]
TODO: A note on alarms — why we route to PagerDuty for ingestion failures but only Slack for parse failures.