A busy Swedish pizzeria. A 69-item menu. Regional accents, phonetic mishears, and a peak-hour chaos problem. Here's exactly what we built and what changed.
This pizzeria was doing well — good location, loyal regulars, solid kitchen. But the phone line was a constant source of friction. Staff spent Friday nights trying to hear orders over kitchen noise. Customers repeated themselves. Errors made it to the kitchen. They tried an off-the-shelf AI voice tool and it made things worse, not better.
Swedish pizza names didn't match anything in the model's training data. "Capricciosa" came through as "kabbalisch ås". The AI confidently passed garbage to the kitchen.
When the AI failed to parse an order, it asked the customer to repeat. Then again. On a Friday at 18:30, most just hung up. Lost order, frustrated customer.
The AI would produce a phonetic transcript and staff would type it into the POS manually. This introduced a second point of failure — and removed any efficiency gain.
Between 17:00–21:00 on weekends, the phone rang continuously. Staff were split between the floor, the kitchen, and a phone that required constant attention. Orders were missed entirely.
Standard AI voice tools aren't trained for Swedish accents, business-specific menus, or regional dialects. They're trained on broadcast Swedish — not the casual, noisy, phonetically compressed way people actually order pizza over the phone. Off-the-shelf solutions failed 30% of the time on this menu alone.
Seven stages from a customer saying "hej" to a kitchen ticket printing — designed around the specific failure modes we found in the audit.
Customer calls the restaurant's existing number. No new number required.
AI agent picks up in <1s. Greets in Swedish. Begins streaming audio to Whisper ASR.
Claude parses the transcript: items, quantity, size, modifiers, allergies.
Levenshtein + Dice bigram against the 69-item menu. Phonetic alias table applied first.
Regex + NLP extracts gluten-free, sauce, sliced, extra/remove toppings. Validated against schema.
Zod-validated order written to PostgreSQL. Full audit trail. Two-pass validation runs here.
Structured order pushed to kitchen display in real time. Staff never touch a phone for this order.
Steps 4 and 5 (highlighted) are the parts that off-the-shelf solutions skip entirely — and where 90% of errors originated.
We built a hybrid matching algorithm combining Levenshtein distance (weighted 40%) and Dice bigram similarity (weighted 60%). Pure Levenshtein penalises long menu names unfairly — Dice bigram handles phonetic similarity better for Swedish compound words. The matcher runs against every item in the menu and returns the best candidate above a confidence threshold.
Before fuzzy matching runs, each input token is checked against a hand-curated phonetic alias table built from the actual mishears found in the 400-call audit. This handles the systematic errors — "svepperoni" will never fuzzy-match to "Pepperoni" without a hint, but with the alias table it maps cleanly.
Getting the pizza name right is 50% of the problem. The other 50% is extracting modifications accurately — especially allergy-related ones where errors have real consequences. We built a regex + NLP pipeline that processes Swedish keywords and maps them to structured flags.
The first pass runs immediately on the live transcript. The second pass runs asynchronously on the Whisper audio review of the same call — comparing the two outputs for discrepancies. If they disagree above a threshold, the order is flagged for staff review before it prints to the kitchen. In practice, this catches about 95% of remaining errors the first pass would have passed through.
Every previous system had a human in the loop between phone and kitchen. Our integration pushes validated orders directly to the kitchen display via webhook — zero manual re-entry, zero transcription lag. The kitchen sees the order within seconds of the customer finishing the call, with the same structure every time.
Measured over a 60-day period post-deployment against the same 60-day baseline pre-deployment.
"Staff morale" is qualitative — reported by the owner in a post-deployment review. The phrase used was: "the team doesn't dread the phone anymore."
Stack, algorithms, schema, and database design — expand if you want the full picture
Claude extracts structured order from real-time transcript during the call. Order confirmed verbally with customer. Provisional kitchen ticket queued.
Whisper re-processes the full call audio independently. Output compared with Pass 1. Discrepancy above threshold flags the order for staff review before final print.
Clean orders print automatically. Flagged orders pause at the POS station for 10-second staff confirmation. Catches ~95% of remaining errors.
Broadcast-quality Swedish and phone-quality colloquial Swedish are very different inputs. Models trained on the former consistently fail on the latter, especially under background noise. You need a model or a post-processing layer that accounts for how people actually talk in context.
No general-purpose LLM knows your 69-item pizza menu. The model doesn't need to be perfect at transcription if your matching layer is good enough to recover from imperfect input. A well-tuned fuzzy matcher against a curated menu will outperform a better ASR model with no post-processing.
Everyone focuses on getting the item name right. In practice, modifier parsing is where errors with real consequences occur. Allergy information, sauce preferences, and removal requests are communicated in highly variable Swedish phrasing. Getting this right requires a purpose-built extraction layer, not a generic one.
A single extraction pass on a noisy call transcript will always have some error rate. Running a second pass on the audio — independently, with a different model — and comparing outputs is a reliable way to catch the cases where the first pass was wrong. The 8-second async window is acceptable; most kitchens don't need the ticket faster than that.
When orders go directly to the kitchen without human re-entry, errors become immediately visible and correctable. Staff stop being transcription operators. The kitchen gets consistent, structured input. This operational change has downstream effects on speed, morale, and error rates that are hard to fully attribute to any single part of the system.
The specific challenge here was Swedish pizza names. But the underlying pattern — generic AI failing on domain-specific language — shows up everywhere.
This work is likely relevant if your business deals with any of these:
30 minutes. No pitch deck. We'll listen to your current setup and tell you honestly whether this is the right fit — and what the build would look like.