Evaluate and Test Your Agents with Feather Evals

Shipping an agent without evaluation is like deploying code without tests. Feather’s evaluation system gives you confidence that your agents respond accurately, helpfully, and on-brand — before and after every change. You can run automated judges on every live conversation, simulate hundreds of test scenarios with AI personas, and measure retrieval quality across your knowledge bases. All results feed back into a single quality dashboard so you know exactly where to improve.

Create an evaluator

An evaluator defines a reusable quality check. Feather supports model-judge evaluators that send the conversation to an LLM with your custom scoring prompt.

curl -X POST https://api-sandbox.featherhq.com/v1/evaluators \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Response Quality",
    "kind": "model_judge",
    "format": "score",
    "severity": "medium",
    "prompt": "Rate the response quality from 0-1. Consider accuracy, helpfulness, and tone. Respond with a JSON object {\"score\": <float>, \"rationale\": \"<reason>\"}",
    "threshold": 0.7
  }'

Response:

{
  "id": "eval_01hxe5q7nt0rs8v2cgln",
  "name": "Response Quality",
  "kind": "model_judge",
  "format": "score",
  "severity": "medium",
  "threshold": 0.7,
  "status": "active",
  "created_at": "2024-09-01T10:00:00Z"
}

Field	Description
`kind`	`model_judge` — uses an LLM to score the response. More types coming soon.
`format`	`score` returns a float 0–1. `boolean` returns a pass/fail verdict.
`severity`	`info`, `low`, `medium`, `high`, or `critical`. Affects how failures surface in the dashboard.
`threshold`	Minimum score to be considered passing. Only used when `format` is `score`.

Bind an evaluator to an agent

Bindings connect an evaluator to a specific agent. Once bound, every session for that agent automatically runs the evaluator — no extra code needed.

curl -X POST https://api-sandbox.featherhq.com/v1/evaluator-bindings \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluator_id": "<eval_id>",
    "scope": "agent",
    "agent_id": "<agent_id>",
    "is_critical": true
  }'

List all bindings for an agent:

curl "https://api-sandbox.featherhq.com/v1/evaluator-bindings?agent_id=<agent_id>" \
  -H "x-api-key: <your_api_key>"

Set is_critical: true on the evaluator bindings that should determine your session’s headline pass/fail verdict. Non-critical evaluators still run and are visible in the results, but they don’t affect the overall session score — useful for experimental or informational checks.

Simulation suites

Simulation suites let you run automated conversations at scale before deploying changes. An AI persona plays the role of a customer, your agent responds, and the results are evaluated automatically — giving you a full quality report without involving real users.

Create a persona

Personas define how the simulated customer behaves. Be specific — the more detail you provide, the more realistic and useful the simulation.

curl -X POST https://api-sandbox.featherhq.com/v1/personas \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Frustrated Customer",
    "body": "You are a frustrated customer who had a bad experience. Be direct and assertive. You want a resolution quickly and will push back if the agent deflects."
  }'

Create a scenario

Scenarios define the situation: what the customer intends to accomplish and which agents are in scope.

curl -X POST https://api-sandbox.featherhq.com/v1/scenarios \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Order Issue",
    "intent": "Customer wants to cancel an order that has not shipped yet. They should be able to get a full refund.",
    "agent_ids": ["<agent_id>"]
  }'

Create a simulation suite

A suite groups multiple persona + scenario pairs into a single runnable batch.

curl -X POST https://api-sandbox.featherhq.com/v1/sim-suites \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Core Scenarios",
    "items": [
      {
        "persona_id": "<persona_id>",
        "scenario_id": "<scenario_id>"
      }
    ]
  }'

Dispatch a suite run

Kick off all simulations in the suite against a specific agent and channel.

curl -X POST \
  "https://api-sandbox.featherhq.com/v1/sim-suites/<suite_id>/runs" \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "<agent_id>",
    "channel": "text"
  }'

Poll for results

Suite runs are asynchronous. Poll until status is completed, then review the status of each child run in runs.

curl "https://api-sandbox.featherhq.com/v1/sim-suites/<suite_id>/runs/<suite_run_id>" \
  -H "x-api-key: <your_api_key>"

Completed run response:

{
  "id": "srun_01hxf6r8nu1st9w3dhmn",
  "status": "completed",
  "runs": [
    {
      "id": "srn_01hxf6r8nu1st9w3dhab",
      "session_id": "sess_01hxf6r8nu1st9w3dhcd",
      "persona_id": "<persona_id>",
      "scenario_id": "<scenario_id>",
      "channel": "text",
      "status": "completed",
      "started_at": "2024-09-01T10:00:05Z",
      "finished_at": "2024-09-01T10:00:42Z",
      "created_at": "2024-09-01T10:00:00Z"
    }
  ]
}

Knowledge base eval suites

While simulation suites test full conversation quality, knowledge base eval suites focus specifically on retrieval accuracy — verifying that your KB returns the right chunks and generates accurate answers for known questions.

Create a KB eval suite

curl -X POST https://api-sandbox.featherhq.com/v1/knowledge-base/evals/suites \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Product FAQ Quality",
    "pass_threshold": 70
  }'

pass_threshold is the minimum percentage of cases that must pass for the overall suite run to be considered successful.

Add eval cases

Each case is a question you know the answer to. Feather will retrieve chunks, generate an answer, and evaluate it against your expected answer and success criteria.

curl -X POST \
  "https://api-sandbox.featherhq.com/v1/knowledge-base/evals/suites/<suite_id>/cases" \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How do I reset my password?",
    "expected_answer": "Click Forgot Password on the login screen.",
    "success_criteria": [
      "Mentions the login screen",
      "Mentions the Forgot Password link or button"
    ]
  }'

Add as many cases as you need. A good eval suite covers common questions, edge cases, and any areas where retrieval has failed before.

Trigger a run

Run the suite against one or more knowledge bases. Pass an array of kb_ids to test multiple KBs in one run.

curl -X POST \
  "https://api-sandbox.featherhq.com/v1/knowledge-base/evals/suites/<suite_id>/runs" \
  -H "x-api-key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{"kb_ids": ["<kb_id>"]}'

Review per-case results

Fetch the detailed results once the run completes. Each case shows you exactly what was retrieved and why it passed or failed.

curl "https://api-sandbox.featherhq.com/v1/knowledge-base/evals/runs/<run_id>/results" \
  -H "x-api-key: <your_api_key>"

Per-case results (paginated):

{
  "items": [
    {
      "id": "kbcase_01hxg7s9nv2tu0x4einp",
      "question_snapshot": "How do I reset my password?",
      "expected_answer_snapshot": "Click Forgot Password on the login screen.",
      "generated_answer": "To reset your password, click the 'Forgot Password' link on the login screen. You'll receive an email with a reset link within a few minutes.",
      "passed": true,
      "judge_reasoning": "The generated answer correctly identifies the login screen and the Forgot Password link, satisfying both success criteria.",
      "retrieved_chunks": [
        {
          "chunk_id": "chunk_01hxg7s9nv2tu0x4eabc",
          "document_id": "doc_01hxg7s9nv2tu0x4edef",
          "document_title": "Account Management",
          "score": 0.94,
          "content_preview": "If you've forgotten your password, click 'Forgot Password' on the login screen..."
        }
      ]
    }
  ],
  "next_cursor": null,
  "has_more": false
}

Use retrieved_chunks and judge_reasoning together to diagnose retrieval failures — poor chunk scores indicate an embedding or chunking issue, while a passing chunk score with a failing answer suggests a generation problem.

Explore further

Evaluators API Reference

Full schema reference for evaluator objects, binding options, result formats, and severity levels.

Simulations & Scenarios

API reference for personas, scenarios, sim suites, and suite run objects — including all configuration options.

​Create an evaluator

​Bind an evaluator to an agent

​Simulation suites

​Knowledge base eval suites

​Explore further

Evaluators API Reference

Simulations & Scenarios

Create an evaluator

Bind an evaluator to an agent

Simulation suites

Knowledge base eval suites

Explore further