Create evaluators, bind them to agents, run simulation suites with AI personas, and use knowledge base eval suites to measure retrieval quality.
Shipping an agent without evaluation is like deploying code without tests. Feather’s evaluation system gives you confidence that your agents respond accurately, helpfully, and on-brand — before and after every change. You can run automated judges on every live conversation, simulate hundreds of test scenarios with AI personas, and measure retrieval quality across your knowledge bases. All results feed back into a single quality dashboard so you know exactly where to improve.
An evaluator defines a reusable quality check. Feather supports model-judge evaluators that send the conversation to an LLM with your custom scoring prompt.
curl -X POST https://api-sandbox.featherhq.com/v1/evaluators \ -H "x-api-key: <your_api_key>" \ -H "Content-Type: application/json" \ -d '{ "name": "Response Quality", "kind": "model_judge", "format": "score", "severity": "medium", "prompt": "Rate the response quality from 0-1. Consider accuracy, helpfulness, and tone. Respond with a JSON object {\"score\": <float>, \"rationale\": \"<reason>\"}", "threshold": 0.7 }'
Set is_critical: true on the evaluator bindings that should determine your session’s headline pass/fail verdict. Non-critical evaluators still run and are visible in the results, but they don’t affect the overall session score — useful for experimental or informational checks.
Simulation suites let you run automated conversations at scale before deploying changes. An AI persona plays the role of a customer, your agent responds, and the results are evaluated automatically — giving you a full quality report without involving real users.
1
Create a persona
Personas define how the simulated customer behaves. Be specific — the more detail you provide, the more realistic and useful the simulation.
curl -X POST https://api-sandbox.featherhq.com/v1/personas \ -H "x-api-key: <your_api_key>" \ -H "Content-Type: application/json" \ -d '{ "name": "Frustrated Customer", "body": "You are a frustrated customer who had a bad experience. Be direct and assertive. You want a resolution quickly and will push back if the agent deflects." }'
2
Create a scenario
Scenarios define the situation: what the customer intends to accomplish and which agents are in scope.
curl -X POST https://api-sandbox.featherhq.com/v1/scenarios \ -H "x-api-key: <your_api_key>" \ -H "Content-Type: application/json" \ -d '{ "name": "Order Issue", "intent": "Customer wants to cancel an order that has not shipped yet. They should be able to get a full refund.", "agent_ids": ["<agent_id>"] }'
3
Create a simulation suite
A suite groups multiple persona + scenario pairs into a single runnable batch.
While simulation suites test full conversation quality, knowledge base eval suites focus specifically on retrieval accuracy — verifying that your KB returns the right chunks and generates accurate answers for known questions.
pass_threshold is the minimum percentage of cases that must pass for the overall suite run to be considered successful.
2
Add eval cases
Each case is a question you know the answer to. Feather will retrieve chunks, generate an answer, and evaluate it against your expected answer and success criteria.
curl -X POST \ "https://api-sandbox.featherhq.com/v1/knowledge-base/evals/suites/<suite_id>/cases" \ -H "x-api-key: <your_api_key>" \ -H "Content-Type: application/json" \ -d '{ "question": "How do I reset my password?", "expected_answer": "Click Forgot Password on the login screen.", "success_criteria": [ "Mentions the login screen", "Mentions the Forgot Password link or button" ] }'
Add as many cases as you need. A good eval suite covers common questions, edge cases, and any areas where retrieval has failed before.
3
Trigger a run
Run the suite against one or more knowledge bases. Pass an array of kb_ids to test multiple KBs in one run.
{ "items": [ { "id": "kbcase_01hxg7s9nv2tu0x4einp", "question_snapshot": "How do I reset my password?", "expected_answer_snapshot": "Click Forgot Password on the login screen.", "generated_answer": "To reset your password, click the 'Forgot Password' link on the login screen. You'll receive an email with a reset link within a few minutes.", "passed": true, "judge_reasoning": "The generated answer correctly identifies the login screen and the Forgot Password link, satisfying both success criteria.", "retrieved_chunks": [ { "chunk_id": "chunk_01hxg7s9nv2tu0x4eabc", "document_id": "doc_01hxg7s9nv2tu0x4edef", "document_title": "Account Management", "score": 0.94, "content_preview": "If you've forgotten your password, click 'Forgot Password' on the login screen..." } ] } ], "next_cursor": null, "has_more": false}
Use retrieved_chunks and judge_reasoning together to diagnose retrieval failures — poor chunk scores indicate an embedding or chunking issue, while a passing chunk score with a failing answer suggests a generation problem.