Testing and Debugging Agents — How to Know It Works

Episode 9 18 min

Introduction: “I Think It Works” Is Not Good Enough

With regular software, testing is straightforward: give a specific input, get a specific output. If 2+2 equals 4, it is correct. If not, it is wrong.

With Agents, things are different. Agent output is non-deterministic — give the same input, and you might get two different answers. How do you test something that gives a different answer every time?

In this episode, you will learn techniques for testing and debugging Agents — from logging to professional tools.

Why Is Agent Testing Hard?

1. Non-deterministic Output

The LLM may generate different text each time. You cannot use assertEqual and say “the output must be exactly this.”

2. Chain of Decisions

The Agent does not give a simple response — it makes a series of decisions: which tool to use? What parameters? How to interpret the result? Each one could go wrong.

3. External Service Dependencies

The Agent depends on the LLM API, external tools, and databases. Any of these could change or become unavailable.

Level 1: Log Everything (Trace Logging)

First of all, you need to see what the Agent is doing. Without logs, debugging an Agent is like fixing a car in the dark.

import json
import time
from datetime import datetime

class AgentTracer:
    """Records all Agent actions."""

    def __init__(self):
        self.traces = []
        self.current_trace = None

    def start_trace(self, user_input: str):
        """Start a new trace."""
        self.current_trace = {
            "id": f"trace_{int(time.time())}",
            "timestamp": datetime.now().isoformat(),
            "input": user_input,
            "steps": [],
            "total_tokens": 0,
            "total_cost": 0.0,
            "total_time": 0.0,
        }
        self._start_time = time.time()

    def log_step(self, step_type: str, data: dict):
        """Record a step."""
        step = {
            "type": step_type,
            "timestamp": datetime.now().isoformat(),
            "data": data,
        }
        self.current_trace["steps"].append(step)

        if "usage" in data:
            tokens = data["usage"].get("total_tokens", 0)
            self.current_trace["total_tokens"] += tokens
            self.current_trace["total_cost"] += tokens * 0.00015 / 1000

    def end_trace(self, output: str):
        """End the trace."""
        self.current_trace["output"] = output
        self.current_trace["total_time"] = time.time() - self._start_time
        self.traces.append(self.current_trace)
        return self.current_trace

    def print_trace(self, trace: dict = None):
        """Pretty-print the trace."""
        t = trace or self.current_trace
        print(f"\n{'='*60}")
        print(f"Trace: {t['id']}")
        print(f"Input: {t['input']}")
        print(f"{'='*60}")

        for i, step in enumerate(t["steps"], 1):
            print(f"\n--- Step {i}: {step['type']} ---")
            for key, value in step["data"].items():
                if key == "usage":
                    continue
                val_str = str(value)[:200]
                print(f"  {key}: {val_str}")

        print(f"\n{'='*60}")
        print(f"Output: {t['output'][:200]}")
        print(f"Tokens: {t['total_tokens']}")
        print(f"Cost: ${t['total_cost']:.4f}")
        print(f"Time: {t['total_time']:.2f}s")
        print(f"{'='*60}\n")

Level 2: Unit Testing Tools

Tools are deterministic — specific input, specific output. These you can easily unit test.

import pytest

class TestCalculateTool:
    def test_basic_addition(self):
        result = calculate("2 + 3")
        assert "5" in result

    def test_complex_expression(self):
        result = calculate("(10 + 5) * 2")
        assert "30" in result

    def test_division(self):
        result = calculate("10 / 3")
        assert "3.33" in result

    def test_invalid_expression(self):
        result = calculate("hello")
        assert "Error" in result or "error" in result

    def test_dangerous_input(self):
        """Should not allow code execution."""
        result = calculate("__import__('os').system('ls')")
        assert "Error" in result or "allowed" in result


class TestWeatherTool:
    @pytest.mark.asyncio
    async def test_valid_city(self):
        result = await get_weather("Tehran")
        assert "Temperature" in result or "temp" in result

    @pytest.mark.asyncio
    async def test_invalid_city(self):
        result = await get_weather("XYZNotACity")
        assert "Error" in result or "error" in result


class TestTimeTool:
    def test_valid_timezone(self):
        result = get_current_time("Asia/Tehran")
        assert len(result) > 0
        assert "invalid" not in result.lower()

    def test_invalid_timezone(self):
        result = get_current_time("Invalid/Zone")
        assert "invalid" in result.lower() or "Invalid" in result

Level 3: Behavioral Testing

This is where it gets interesting. We cannot check exact output, but we can check behavior.

import pytest
from agent import process_message

class TestAgentBehavior:
    """Test Agent behavior — not exact output."""

    @pytest.mark.asyncio
    async def test_uses_weather_tool_for_weather_query(self):
        """Should use weather tool when asked about weather."""
        tracer.start_trace("What is the weather in Tehran?")
        result = await process_message(
            chat_id=999,
            user_message="What is the weather in Tehran?",
            user_name="Test"
        )
        trace = tracer.end_trace(result)

        tool_steps = [s for s in trace["steps"] if s["type"] == "tool_call"]
        assert len(tool_steps) > 0
        assert tool_steps[0]["data"]["tool"] == "get_weather"

    @pytest.mark.asyncio
    async def test_responds_in_expected_language(self):
        """Response should be in the expected language."""
        result = await process_message(
            chat_id=999,
            user_message="Hello, how are you?",
            user_name="Test"
        )
        # Check that it contains English characters
        assert any(c.isalpha() for c in result)

    @pytest.mark.asyncio
    async def test_says_dont_know_for_unknown(self):
        """Should say it does not know for unanswerable questions."""
        result = await process_message(
            chat_id=999,
            user_message="What is my grandmother's phone number?",
            user_name="Test"
        )

        negative_phrases = ["don't know", "cannot", "no information", "not able"]
        has_negative = any(p in result.lower() for p in negative_phrases)
        assert has_negative, f"Should say it doesn't know: {result}"

    @pytest.mark.asyncio
    async def test_remembers_context(self):
        """Should remember previous conversation."""
        chat_id = 888

        # First message
        await process_message(
            chat_id=chat_id,
            user_message="My name is Alex",
            user_name="Test"
        )

        # Second message
        result = await process_message(
            chat_id=chat_id,
            user_message="What is my name?",
            user_name="Test"
        )

        assert "Alex" in result, f"Memory did not work: {result}"

Level 4: Golden Dataset

Create a collection of questions and expected answers. Every time you change the Agent, test it against this collection.

GOLDEN_DATASET = [
    {
        "input": "What is the weather in London?",
        "expected_tool": "get_weather",
        "expected_contains": ["London"],
        "category": "weather",
    },
    {
        "input": "What is 25 times 4?",
        "expected_tool": "calculate",
        "expected_contains": ["100"],
        "category": "math",
    },
    {
        "input": "Hello, how are you?",
        "expected_tool": None,
        "expected_contains": [],
        "category": "greeting",
    },
    {
        "input": "What time is it now?",
        "expected_tool": "get_current_time",
        "expected_contains": [":"],
        "category": "time",
    },
]


async def run_golden_tests():
    """Run golden tests."""
    results = []

    for test_case in GOLDEN_DATASET:
        tracer.start_trace(test_case["input"])
        result = await process_message(
            chat_id=777,
            user_message=test_case["input"],
            user_name="GoldenTest"
        )
        trace = tracer.end_trace(result)

        # Check tool
        tool_used = None
        tool_steps = [s for s in trace["steps"] if s["type"] == "tool_call"]
        if tool_steps:
            tool_used = tool_steps[0]["data"]["tool"]

        tool_correct = tool_used == test_case["expected_tool"]

        # Check content
        content_correct = all(
            keyword in result for keyword in test_case["expected_contains"]
        )

        passed = tool_correct and content_correct

        results.append({
            "input": test_case["input"],
            "category": test_case["category"],
            "passed": passed,
            "tool_expected": test_case["expected_tool"],
            "tool_actual": tool_used,
            "output_preview": result[:100],
        })

        status = "PASS" if passed else "FAIL"
        print(f"[{status}] {test_case['input'][:40]}")

    total = len(results)
    passed = sum(1 for r in results if r["passed"])
    print(f"\n{passed}/{total} tests passed ({passed/total*100:.0f}%)")

    return results

LangSmith — Professional Monitoring Tool

LangSmith is from the LangChain team but works with any framework. It gives you a web dashboard to see all LLM calls, debug them, and evaluate them.

# Install: pip install langsmith
# Set environment variables:
# LANGCHAIN_API_KEY=your_key
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_PROJECT=my-agent

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(name="agent_response")
def get_response(user_input: str) -> str:
    """Every call is automatically recorded in LangSmith."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Helpful assistant"},
            {"role": "user", "content": user_input}
        ]
    )
    return response.choices[0].message.content

# Every time you call get_response,
# a complete trace appears in the LangSmith dashboard:
# - Input and output
# - Token usage
# - Execution time
# - Errors

Why is LangSmith useful?

  • See all traces in one dashboard
  • Filter — for example, only traces with errors
  • Shows cost per conversation
  • Build datasets and set up automated evaluations

Regression Testing — Make Sure Nothing Broke

class RegressionTest:
    """Regression test — make sure a new update
    did not break anything."""

    def __init__(self, baseline_file: str):
        self.baseline_file = baseline_file

    def save_baseline(self, results: list):
        """Save current results as baseline."""
        with open(self.baseline_file, "w") as f:
            json.dump(results, f, ensure_ascii=False)

    def compare(self, new_results: list) -> dict:
        """Compare new results with baseline."""
        with open(self.baseline_file) as f:
            baseline = json.load(f)

        regressions = []
        improvements = []

        for old, new in zip(baseline, new_results):
            if old["passed"] and not new["passed"]:
                regressions.append({
                    "input": new["input"],
                    "was": "PASS",
                    "now": "FAIL",
                })
            elif not old["passed"] and new["passed"]:
                improvements.append({
                    "input": new["input"],
                    "was": "FAIL",
                    "now": "PASS",
                })

        return {
            "regressions": regressions,
            "improvements": improvements,
            "regression_count": len(regressions),
            "improvement_count": len(improvements),
        }

Practical Debugging Tips

1. Always turn on verbose mode: During development, see all logs. Every LLM call, every tool, every decision.

2. Isolate the problem: If the Agent acts incorrectly, first determine if the issue is with the LLM or the tool. Test the tool separately. If the tool is correct, the problem is with the prompt or model.

3. Lower the temperature: During testing, set temperature to 0 to make output as reproducible as possible.

4. Save real conversations: Save real user conversations (with privacy considerations) and use them to improve the Agent.

5. Run A/B tests: Run two Agent versions simultaneously and see which performs better.

Warning: Automate Agent testing. Every time you change the prompt, update the model, or add a new tool — tests should run automatically.

Summary

  • Agent testing is hard but essential
  • Log everything — without logs, debugging is impossible
  • Unit test the tools — they are deterministic
  • Test Agent behavior, not exact output
  • Build a golden dataset and run regression tests
  • Use LangSmith or similar tools for monitoring

Next episode — the final one in the series — we build a complete Research Agent from scratch!