Multimodal AI: Five Use Cases That Actually Work (and Three That Do Not)

| 5 min read |
ai multimodal gpt-4v vision

GPT-4V is out and everyone is building vision features. After testing it across real workflows, here is what ships well and what falls apart.

Quick take

Vision-capable models are legitimately useful for document extraction, UI review, and accessibility. They’re unreliable for precise measurements, tiny text, and anything that requires counting. Treat it like a smart intern who’s great at describing what they see but bad at details. Build for uncertainty, validate outputs, and keep a fallback path.

GPT-4V dropped and my first reaction was to throw every image I could find at it. Receipts. Architecture diagrams. Screenshots. Photos of whiteboards from meetings. The results ranged from “holy shit, this actually works” to “that’s confidently wrong in a way that would cost money.”

After a few weeks of serious testing, I have a clearer picture of where multimodal AI is ready for production and where it will get you in trouble.

What Actually Ships

1. Invoice and Receipt Extraction

This is the killer use case at a fintech company. We process financial documents. Extracting vendor name, amount, date, and line items from a photo of a receipt used to require a dedicated OCR pipeline, post-processing rules, and a prayer. Now I send the image to GPT-4V with a structured prompt and get JSON back.

Analyze this invoice image. Return JSON with these fields:
- vendor_name (string)
- total_amount (string, include currency)
- invoice_date (string, YYYY-MM-DD)
- line_items (array of {description, amount})
If a field is not visible, return null.

Hit rate on clean documents is around 90%. On crumpled receipts with bad lighting, it drops to maybe 65%. Good enough for a first pass with human review on low-confidence results.

2. UI Review

I started using it to review screenshots of our admin dashboards. “List any layout issues, missing states, or accessibility concerns in this screenshot.” The results aren’t comprehensive, but they catch obvious problems – misaligned elements, missing error states, low contrast text – faster than a manual review pass.

3. Accessibility

Alt text generation. Genuinely good at this. Feed it a product image or a chart and ask for a concise description. The output is usually better than what most developers write manually, which is a low bar, but still.

4. Architecture Diagram Interpretation

This one surprised me. I photographed a whiteboard diagram from a system design session and asked the model to describe the components and data flow. It got the high-level architecture right. Not perfect on every label, but the structure was correct. Useful for converting whiteboard photos into documentation drafts.

5. Visual Anomaly Detection

For predictable environments – “does this photo show the expected setup?” – the model is decent at spotting obvious differences. Missing components, wrong configurations, visible damage. It works best when you can describe what “normal” looks like and ask the model to flag deviations.

What Doesn’t Work (Yet)

Counting

Ask it to count items in a busy image. Watch it fail. It will confidently give you a number that’s wrong. Small objects, overlapping items, dense arrangements – the model can’t reliably count. Don’t build features that depend on this.

Precise Measurements

“How far apart are these two components?” The model doesn’t do spatial precision. It can tell you something is “on the left” or “near the top” but asking for millimeter-level accuracy is asking for trouble.

Tiny or Low-Quality Text

Faded labels, handwritten notes in bad lighting, text smaller than about 10px on a screenshot – all unreliable. The model will either skip the text entirely or hallucinate plausible content. This is the failure mode that scares me most because it’s indistinguishable from correct output unless you verify.

The Cost Problem

Vision calls are expensive. A single image analysis costs roughly 10-20x what a text-only call costs, depending on image size and detail level. At scale, this adds up fast.

My rules:

  • Resize aggressively. Crop to the region of interest. A full-resolution photo of a receipt when all you need is the total amount is wasting tokens and money.
  • Use low detail mode for simple tasks. GPT-4V supports a detail parameter. Use “low” for tasks like “is there text in this image?” and “high” only when you need it.
  • Cache everything. Same image, same question, same answer. Don’t re-process.
  • Batch when possible. Multiple questions about the same image should be a single API call, not five separate ones.

Building for Uncertainty

The single most important design principle: assume the model will be wrong sometimes, and build your product flow to handle it gracefully.

For document extraction at a fintech company, every result goes through a confidence check. If any field comes back null or if the extracted amount doesn’t parse as a valid number, it routes to human review. The model handles the easy 70-80% automatically. Humans handle the rest. The total cost is still lower than having humans process everything manually.

Ask the model to cite visible evidence. “What text did you read to determine the vendor name?” If it can’t point to specific text in the image, the answer is probably a hallucination.

Keep an OCR fallback for critical text extraction. The vision model is better at understanding context. Traditional OCR is better at reading exact characters. Use both.

Multimodal AI isn’t magic. It’s a new tool with a specific reliability profile. Know where it’s strong, know where it fails, and design your system to handle both. That’s the boring answer. It’s also the right one.