What I Learned Building AI Features Into a Fintech Product

Three weeks ago, we shipped an AI-powered transaction categorization feature for a fintech infrastructure company. The demo took two days to build. Getting it production-ready took six weeks. That ratio tells you everything about AI feature development.

The demo was impressive. You paste in a batch of transactions, the model categorizes them, and the output looks clean. The CEO loved it. The PM loved it. I loved it too, right up until we started testing edge cases.

A wire transfer labeled “SEPA CT REF-8847291 ACME GMBH” landed in “Entertainment.” A recurring subscription payment got labeled differently every time we ran it. And the model confidently categorized a clearly fraudulent transaction as “Regular business expense” without any hesitation.

This is the gap between demo and product, and it’s where most AI features die.

The thing nobody talks about: defining “good”

Before writing any production code, I made the team answer one question: what does a correct categorization look like, and what happens when it’s wrong?

For traditional features, this is obvious. A button either works or it doesn’t. A calculation is either right or wrong. For AI features, “right” is fuzzy. Is 85% accuracy good enough? Depends. If you’re categorizing expense reports for internal review, maybe. If you’re categorizing transactions for regulatory reporting, absolutely not.

We built an eval set of 200 transactions with hand-labeled categories. Not exciting work. Took the team three days. But that eval set became the foundation for every decision that followed: prompt changes, model selection, fallback logic, launch criteria.

The rule I enforce now: if you can’t write down what “good” looks like in concrete examples before you start building, you’re not ready to build.

Architecture for uncertainty

AI features sit inside the same product architecture as everything else. But they need extra layers to manage the fundamental uncertainty of probabilistic output. Here’s what the production feature stack actually looked like:

Input validation. Transactions go through a normalizer before the model sees them. Strip reference codes, standardize currency formats, expand abbreviations. The cleaner the input, the more consistent the output. This sounds boring. It improved accuracy by 8 points.

The model call. GPT-3.5-turbo with a tight prompt, structured JSON output, and a confidence score. We tried GPT-4 initially: better accuracy, but 10x the cost and 3x the latency. For this use case, 3.5-turbo plus good input normalization was the better trade-off.

Output validation. Every categorization gets checked against the valid category list. If the model returns a category that doesn’t exist (and it does, about 2% of the time), we fall back. If the confidence score is below our threshold, we fall back.

The fallback. This is the part most teams skip. Our fallback is a rules-based categorizer that handles ~40% of transactions using keyword matching and counterparty lookup. It’s not as good as the model, but it’s deterministic and always available. When the model is uncertain, the user gets the rules-based result with a flag saying “review suggested.”

Feedback loop. Users can correct categorizations. Those corrections feed back into our eval set and, eventually, into prompt improvements. This is the part that compounds over time.

Testing probabilistic systems

Unit tests cover the deterministic parts: input normalization, output validation, the rules-based fallback. Those work exactly like traditional testing.

Model behavior gets tested differently. We run the full eval set (200 transactions) on every prompt change and weekly against the live model. The metrics we track:

Accuracy against our labeled set (target: >88%)
Category distribution (catches when the model starts favoring certain categories)
Confidence score distribution (catches when the model becomes less certain overall)
Fallback rate (how often we’re bypassing the model)

We don’t assert on individual outputs. That’s a trap. The same input might get slightly different wording each time, and that’s fine as long as the category is right. We assert on aggregate metrics across the eval set.

One thing that bit us: OpenAI changed something in the model (not a version bump, just inference-time behavior) and our accuracy dropped 3 points overnight. We caught it because we run the eval set daily. Teams that don’t do this are flying blind.

Launching without embarrassment

We launched to 5% of users first. Internal users, specifically the finance team. They’re the harshest critics and the most forgiving audience: they understand the constraints and give precise feedback.

Two things came out of the soft launch that we hadn’t anticipated:

Multi-currency transactions confused the model more than we expected. A USD payment from a EUR account got categorized based on the currency, not the merchant. We added currency normalization to the input pipeline.
Users didn’t trust the output even when it was correct. Adding the confidence score as a visual indicator (“high confidence” / “review suggested”) dramatically improved trust. People are fine with AI output when they know the system is honest about its uncertainty.

After two weeks with the finance team, we expanded to 50%, then 100%. The feedback loop was running, the eval metrics were stable, and the fallback was handling edge cases gracefully.

What I’d tell another team

Define quality before you write code. Build the eval set first. It’s tedious, and it’s the most important thing you’ll do.

Design the fallback before the happy path. What happens when the model is wrong or unavailable? If that experience is terrible, your feature is fragile. If it’s graceful, you can ship confidently.

Instrument everything. Not just errors and latency. Quality metrics, confidence distributions, fallback rates, user corrections. You need to see the model’s behavior as a continuous signal, not a binary pass/fail.

Resist the demo. The demo is a trap. It shows the best case. Production shows the average case and every edge case you didn’t think of. Ship the demo only after you’ve built the scaffolding that makes the average case acceptable and the worst case survivable.

AI features are software. Treat them that way, with tests, rollout controls, monitoring, and healthy skepticism. The model is the least interesting part of the system. The interesting part is everything around it.