Procurement · Vendor evaluation

Five questions to ask any voice AI vendor — including us

May 2026 · 6 min read

Author

Kalpesh Upadhyay

Founder, iBridge

If you are evaluating voice AI for a clinic group or an RCM operation, the difference between a serious vendor and a wrapper is not always obvious in the first conversation. Demos are scripted. Marketing is polished. The same architecture problems that ruin production deployments are completely invisible at the demo stage.

Five questions can change that. Each one is designed to expose the difference between a vendor who has built something real and a vendor who has wrapped a UI around three commodity APIs. Use them on every voice AI vendor you evaluate. Use them on us. The vendors who answer them well are the small set of companies actually building production systems. The point is to find them.

1. Show me the audit log. Live.

Not in a deck. Live. Open your dashboard, run a test call, and let me see what gets captured. I want to see every prompt, every tool call, every decision, every knowledge consultation, every routing choice. I want to query it.

A serious vendor has a hash-chained, append-only audit log that the customer's engineering team can query directly. The audit captures not just what the agent did, but why it did it — the reasoning chain, the knowledge sources consulted, the confidence scores at each step. If the vendor cannot show this in five minutes during the demo, they do not have it. That should end the evaluation.

The audit log is not a feature. It is the evidence that the vendor has built a real system rather than a black box. Without it, you cannot debug, you cannot improve, you cannot defend the deployment in a compliance review.

2. Tell me what your agent will refuse.

If the answer is "our agent is helpful and will try to answer anything," you are looking at a wrapper. The agent will say anything the underlying language model can plausibly produce, including things that are wrong, things that are out of scope, and things that create liability for your organization.

A serious agent has explicit, customer-editable scope. A defined list of topics it will engage with, topics it will decline, and topics it will escalate. The customer's training authority can edit the scope. The scope is enforced at runtime, not just in the prompt.

Ask: when the agent encounters something outside its scope, what happens? A real answer involves a defined escalation path, a logged flag, and a human-readable reasoning chain. A wrapper's answer is usually some variation of "the LLM is pretty good about that." That is not an answer. That is a hope.

3. Where does PHI live, and for how long?

Specifically: from the moment a patient's name appears on the call, where does that data sit? In what database? Encrypted at what level? Accessible to which engineers? Retained for how long?

The honest answer for most production-grade systems should be: PHI lives only at the boundary, only for the duration of one interaction, and is destroyed before any persistent system writes the call. Patient identifying information should never reach the vendor's database, the vendor's reasoning models, the vendor's observability stack, or any system the vendor's engineers can read.

If the vendor cannot describe this architecture in two minutes, they do not have one. What they have is a database with patient data in it, and a hope that nothing goes wrong. That is a compliance liability disguised as a product, and it becomes your liability the moment you sign the contract.

4. What happens when your agent doesn't know?

Every voice AI agent will encounter situations beyond its training: unfamiliar payer IVR menus, novel rep questions, edge-case clinical scenarios, dropped calls, contradictory information from multiple knowledge sources. What happens in those moments determines whether the deployment succeeds or fails.

Three patterns to watch for:

The hallucination pattern. The agent confidently makes up an answer. Authorization numbers that do not exist. Patient eligibility status pulled from thin air. This is the worst outcome and the most common one with wrapper vendors.

The hang pattern. The agent loops on hold music, repeats the same prompt, or sits in silence. The call ends with no useful outcome. Less catastrophic but still failure.

The graceful escalation pattern.The agent recognizes its limits, hands off to a human queue with the full context attached, and the human picks up exactly where the agent left off — not back at hold-position-one in a new queue. This is what production-grade systems do.

Ask the vendor to show you a real example of pattern three. If they cannot, they do not have it.

5. Run a hundred of my real calls before we contract.

This is the question that separates demos from production. Any vendor unwilling to run a benchmark on a representative sample of your actual calls — before any contract is signed — is selling you their best-case scenario, not yours.

A serious vendor will accept a recorded sample of 50 to 100 calls, process them in a sandbox tenant, and return: cost per call, accuracy by extracted field, average handle time, first-call resolution rate, and a confidence-scored breakdown of where the agents win and where they hand off. The benchmark is at no charge, the methodology is transparent, and the result is the floor of what they will commit to in a contract.

If the vendor refuses or stalls on this, the conversation is over. Not because they are bad people, but because they have something to hide about how their system performs on calls they did not pre-select.

What good answers look like

A serious vendor can answer all five questions in writing within a single business day. They can produce: a live audit log demo, a scope-and-refusal documentation page, a data flow diagram showing PHI residency, an escalation path document, and a benchmark methodology. They will participate in your security questionnaire and turn it around in three to five business days, not three weeks.

Their architecture is built around minimum-necessary access. Patient data is held only at the boundary. Their audit log is queryable, hash-chained, and covers every state change. Their refusal logic is explicit and customer-editable. Their escalation paths are real, not aspirational.

The vendors who get this right are not always the loudest in the market. They are not always the best-funded. They are not always the ones with the most polished demos. But they are the ones whose deployments survive the first ninety days in production and compound from there.

What to do with this

Send this list to every voice AI vendor in your evaluation set as part of the initial RFP. Score the responses. The exercise will eliminate at least half of your candidates in the first round and reveal the remaining half's actual posture — not their marketing posture.

We at iBridge have published answers to all five of these questions on our security page and in the Business Associate Agreement we will sign on day one. We encourage every prospective customer to ask all five questions of every vendor they evaluate, including us. The cheapest insurance an RCM operation or a clinic group can buy is compliance done seriously, by a vendor that earned the trust before the contract was signed.

Read what we believe→