What the output looks like
Actual scores, actual deductions. Nothing hypothetical.
Maria Gonzalez
Medical Interpretation · English → Spanish
Deductions
Generated in 18s · 3-agent consensus
100 - 16 = 84
Before Agent One
12 calls reviewed this week
all done by hand
3 reviewers, scores all over the place
up to 18 pts apart
45 minutes per report
that is 9 hrs/week just on QA
After Agent One
487 calls reviewed this week
zero manual effort
Same scoring logic on every call
no reviewer variance
Report done in 18 seconds
0.15 hrs/week on QA
Mistakes we catch on every call
These show up constantly. The AI flags them automatically.
Summarizing instead of interpreting
criticalThe interpreter paraphrases what was said instead of rendering it faithfully. This changes meaning.
Role confusion
criticalSaying "I think..." as themselves instead of interpreting in first person for the speaker.
Dropping medical or legal terms
moderateLeaving out dosages, diagnoses, or legal terms. The kind of detail that matters most.
Inserting personal opinions
criticalThe interpreter adds their own commentary or advice. Not their role.
Skipping parts of the conversation
moderateCherry-picking what to interpret. Side comments, overlapping speech, quiet remarks get left out.
Not flagging communication breakdowns
moderateWhen a speaker is unclear or uses idioms, the interpreter should step in. Often they do not.
Things we have learned about interpretation QA
Short reads. Straight to the point.
The 5 interpreter errors that keep showing up
Summarizing instead of interpreting is still number one. It also carries the biggest penalty.
Pass rate is a vanity metric
80% pass rate sounds fine until you realize the threshold is 70. That is a low bar. QA depth tells you more.
Most QA systems score the wrong thing
They score the call. We score the interpretation. Big difference. One measures outcomes, the other measures the interpreter.
Why summarizing costs 50 points
It is not a minor slip. When an interpreter summarizes instead of interpreting, the meaning changes. Automatic 50-point deduction.
Why we use three AI agents, not one
A single model can make things up. Three independent evaluations reach consensus. That is harder to fool than any single reviewer.
Going from 5% to 100% call coverage
Most teams review a handful of calls. When you review all of them, you start seeing patterns that were completely hidden.
Done listening to calls manually?
Score every call with your rubric. You can be up and running in minutes. No integrations required.