A reporting service for regulated credit data
TODO: One-paragraph narrative TL;DR. The business problem in one sentence, the technical shape of the answer in one sentence, the outcome that matters in one sentence. Aim for what a CTO would forward to a colleague.
Context
TODO: One paragraph on the business setup — credit card management for issuer clients, the reporting obligations that come with regulated financial data, and the file category this pipeline handles. Anchor the rest of the piece in why correctness and auditability dominate ergonomics here.
What existed before
TODO: The failing state. A cron job, a shared SFTP user, a manual reconciliation step at month-end. What broke and how often. The specific incident that triggered the rebuild — sanitized, dated, quantified.
Alternatives considered
TODO: Don't open with "we built it ourselves." Walk through the managed options first:
- A managed ETL service. TODO: which one, POC outcome, why rejected.
- Airflow on a managed runner. TODO: where it nearly worked; why event-driven won.
- A thin internal service on the batch host. TODO: cheapest on paper; why "cheapest" misread the constraint.
The architecture
TODO: 2–3 sentences per component. Decisions to defend: Chronos vs Quartz, Kafka vs SQS, raw to S3 vs Postgres bytea, separate metadata table at all.
Hard decisions
1. Event-driven vs. polled
TODO: SFTP doesn't push. Something polls. We chose poll-from-ingestion to keep the SFTP host stateless. Note the latency cost (60s p99) and why it was acceptable.
2. Exactly-once vs. at-least-once with idempotency
TODO: Why we landed on at-least-once with a content-hash key in Postgres, not Kafka transactions. Be honest about how this couples Postgres availability to ingestion correctness.
3. Schema-on-read vs. schema-on-write
TODO: Client file schemas drift quarterly. Raw bytes in S3 + a thin parser that fails loud on deviation but doesn't block ingestion. Why "fail loud, don't drop" was the right shape.
Numbers
| Metric | Before | After | |---|---|---| | Time from file drop → queryable | TODO | TODO | | Manual reconciliation hours / month | TODO | ~0 | | First-request p99 latency | TODO | 30 ms | | Unique alerts / week | TODO | TODO |
TODO: One paragraph after the table on which numbers matter and why. Latency is the headline; alert reduction is the durable win.
What I'd revisit
TODO: 2–3 honest critiques. Candidates: Chronos couples scheduling to deploys; the content-hash idempotency breaks for legitimate corrections; single-topic design is already showing strain.