Case study · Vegapay · 2025

A reporting service for regulated credit data

Architect, Backend · 2025~12 min read

↓80% p99 latency30ms first-request0 manual recon / moIn production

TODO: One-paragraph narrative TL;DR. The business problem in one sentence, the technical shape of the answer in one sentence, the outcome that matters in one sentence. Aim for what a CTO would forward to a colleague.

Context

TODO: One paragraph on the business setup — credit card management for issuer clients, the reporting obligations that come with regulated financial data, and the file category this pipeline handles. Anchor the rest of the piece in why correctness and auditability dominate ergonomics here.

What existed before

TODO: The failing state. A cron job, a shared SFTP user, a manual reconciliation step at month-end. What broke and how often. The specific incident that triggered the rebuild — sanitized, dated, quantified.

Alternatives considered

TODO: Don't open with "we built it ourselves." Walk through the managed options first:

A managed ETL service. TODO: which one, POC outcome, why rejected.
Airflow on a managed runner. TODO: where it nearly worked; why event-driven won.
A thin internal service on the batch host. TODO: cheapest on paper; why "cheapest" misread the constraint.

The architecture

┌──────────┐ │ client │ │ SFTP │ └────┬─────┘ │ file drop ▼ ┌─────────────┐ │ watcher │ (Chronos schedule, debounced) └──────┬──────┘ │ publish ▼ ┌─────────────┐ │ Kafka │ │ (topic A) │ └──┬───────┬──┘ │ │ ▼ ▼ ┌──────┐ ┌────────┐ │ S3 │ │Postgres│ │ raw │ │ meta │ └──────┘ └────────┘

SFTP → ingestion → Kafka → consumers → Postgres + S3

TODO: 2–3 sentences per component. Decisions to defend: Chronos vs Quartz, Kafka vs SQS, raw to S3 vs Postgres bytea, separate metadata table at all.

Hard decisions

1. Event-driven vs. polled

TODO: SFTP doesn't push. Something polls. We chose poll-from-ingestion to keep the SFTP host stateless. Note the latency cost (60s p99) and why it was acceptable.

2. Exactly-once vs. at-least-once with idempotency

TODO: Why we landed on at-least-once with a content-hash key in Postgres, not Kafka transactions. Be honest about how this couples Postgres availability to ingestion correctness.

3. Schema-on-read vs. schema-on-write

TODO: Client file schemas drift quarterly. Raw bytes in S3 + a thin parser that fails loud on deviation but doesn't block ingestion. Why "fail loud, don't drop" was the right shape.

Numbers

| Metric | Before | After | |---|---|---| | Time from file drop → queryable | TODO | TODO | | Manual reconciliation hours / month | TODO | ~0 | | First-request p99 latency | TODO | 30 ms | | Unique alerts / week | TODO | TODO |

TODO: One paragraph after the table on which numbers matter and why. Latency is the headline; alert reduction is the durable win.

What I'd revisit

TODO: 2–3 honest critiques. Candidates: Chronos couples scheduling to deploys; the content-hash idempotency breaks for legitimate corrections; single-topic design is already showing strain.

Footnotes

[1]
TODO: File-size distribution note. Most under 1 MB; some clients drop 800 MB quarterlies. How that changed back-pressure design.
[2]
TODO: Alarm routing — PagerDuty for ingestion failures, Slack for parse failures, and why.