Wiring GPT-4o-mini into a repair dispatch platform

How FixFlow turns a tenant's paragraph ("the dishwasher is making a noise and also leaking") into a structured work order with priority, equipment type, and parts likelihood, without the model fabricating unit numbers.

When I started wiring GPT-4o-mini into FixFlow I thought the job was prompt engineering. It wasn't. The real job is deciding which of the model's very confident outputs to trust, which to double-check, and which to throw away. FixFlow is a multi-tenant intake platform. Property-management companies get a branded portal where tenants submit appliance-repair tickets. The output has to be structured (priority, equipment type, likely parts, urgency) because dispatch logic reads from it downstream. Tenants, of course, don't write structured text. That's the whole problem.

The shape of the problem

A typical intake reads: "hi my name is Jane and the dishwasher in unit 4B has been leaking for three days and now it won't start either, I work 9-5 M-F." From that we need a tenant name, unit identifier, equipment type, symptom cluster, priority signal, and a contact window. What we can't have is any of it invented. If GPT makes up a unit number, dispatch shows up at the wrong door and the tenant calls in angry.

// Stage 1 — strict extraction, no reasoning

The first call has one job: pull structured fields out of free text. JSON schema, temperature at 0.1, a prompt that tells the model to return null instead of guessing. No priority reasoning here, no summaries. Just extraction.

{
  "name": "Jane",
  "unit": "4B",
  "equipment": "dishwasher",
  "symptoms": ["leaking", "not starting"],
  "window": "9-5 M-F",
  "confidence": { "unit": 0.95, "equipment": 0.98 }
}

// Stage 2 — priority signal, bounded

Priority comes from a second call with a tight prompt: "given these symptoms, rate urgency 1-5 where 1 is cosmetic and 5 is emergency (flooding, no heat in winter)." The model returns a number and a short rationale. The rationale never gets shown to users. It's there as telemetry, so when the model is wrong (and it is, regularly) I can read back why.

A bug I'd have missed: early on, the model was scoring "dishwasher leaking" as priority 2. A property manager DM'd me: "any leak that reaches the floor damages the unit below. Priority 4 minimum." Added it to the prompt. Every water-related false negative disappeared.

// Stage 3 — validation layer

The LLM is never the source of truth for anything the system acts on. Unit numbers get checked against the property's unit roster in Supabase. Equipment types get matched against a known vocabulary. Priority 5 triggers an immediate SMS to the on-call dispatcher and falls back to a sane default if the model's output looks weird.

What GPT-4o-mini is good at

Parsing fuzzy text into structured fields (with temperature near zero).
Rewriting tenant complaints into dispatch-readable short descriptions.
Detecting urgency patterns (water + "unit below" = high).
Multi-symptom decomposition ("leaking and won't start" → two tracked symptoms).

What it's terrible at

Unit identifiers. Always verify against the roster.
Dates. "Three days ago" becomes whatever date it feels like.
Rare equipment models. Stick to known types, don't let it invent.
Reading your mind about priority scales (see sidebar).

The infrastructure around it

Every LLM call gets logged with its input, output, token count, latency, and rationale. Three reasons that matters: I can audit bad decisions weeks later, replay prompts when I upgrade model versions, and watch for prompt drift before it starts costing me. I run a nightly job that samples 10 random intakes and I review them over coffee. If more than one feels wrong, the prompt goes back on the bench.

A good LLM architecture asks the model for less and verifies everything it gets back.

What I'd do differently

Start with the validation layer, not the prompts. I burned a week tuning prompts that were fighting a problem the roster check solved in twenty minutes. Second thing: track model confidence scores separately from the structured output, and refuse to auto-act on anything below 0.6. Low-confidence items go to a human queue. The dispatchers don't mind an extra click. They mind showing up at the wrong door.

FixFlow is live at fixflow-intake.vercel.app. Full case study is here.