adaptive-testing-tools: a small Python library for Adaptive Random Testing

From one-off LLM eval scripts to a reusable ART primitive you can drop into any Python test harness.

Nov 29, 2025

Many evaluation scripts start as a quick loop over test cases and a counter for “how often did this work”.

After a few experiments, the script carries more and more responsibilities: input generation, sampling strategy, distance metrics, logging, metrics, seeding, and reporting.

At that point the code no longer behaves like a simple test harness. It behaves like a small framework that is hard to reuse, hard to explain, and easy to copy-paste incorrectly into the next project.

The underlying problem is that where you sample the input space and what you measure are mixed together. When you want to try a new sampling strategy, such as spreading tests out instead of clustering them, you often have to rewrite large parts of the script.

In this post I introduce adaptive-testing-tools, a small Python library that separates these concerns. It gives you a reusable Adaptive Random Testing (ART) primitive for choosing diverse test inputs, so your own code can focus on the behaviour under test and the metrics you care about.

It is a very small Python package that gives you a reusable ART primitive and a few random generators, so you can keep your evaluation logic focused on the system under test rather than on the sampling algorithm.

In this post, I will:

Explain what the library does and who it is for.
Show how it wraps the ART loop I previously wrote by hand for a tool-calling LLM.
Outline how you can reuse the same pattern for your own prompts, agents, and APIs.

If you care about LLM reliability and you like tight, focused tools, this is meant for you.

Github: https://github.com/khaled-e-a/adaptive-testing-tools
PyPi: https://pypi.org/project/adaptive-testing-tools/

The shift: from ad-hoc diversity loops to a shared ART primitive

Adaptive Random Testing was introduced as an enhancement to pure random testing.

Instead of sampling test inputs uniformly and accepting whatever clustering arises by chance, ART tries to spread test cases out over the input space, under the intuition that failure regions often occupy contiguous patches rather than isolated points.

One simple and widely used ART family is Fixed-Size Candidate Set (FSCS).

The idea is algorithmically simple:

At each iteration, generate a small pool of random candidate inputs.
For each candidate, compute its distance to every previously tested input.
For each candidate, keep the minimum distance to previous tests.
Pick the candidate whose minimum distance is largest.
Execute the test on that candidate and record the result.

Repeat this for a fixed number of iterations and you get a test suite that is still random but intentionally diverse.

In the LLM world, recent work has started to apply diversity-based adaptive testing techniques to prompt templates and LLM-based applications, using string distances to spread test inputs.

In my own earlier article on testing tool-calling LLMs with Adaptive Random Testing, I implemented a minimal FSCS loop directly inside the evaluation script for an incident-triage agent that reads log prefixes like [AUTH] and [BILL] and is supposed to call the matching telemetry tool once, then summarize service health.

The idea worked.

The code did not feel reusable.

It mixed together:

Input generation (realistic log snippets).
The model and tool-calling interaction.
The ART loop and distance logic.
The reporting and Tool Call Accuracy metric.

adaptive-testing-tools is the extraction of that middle layer into a separate, reusable unit.

What `adaptive-testing-tools` provides

At its core, the library offers one main abstraction:

adaptive_random_testing(
    generate_candidate: Callable[[Random], str],
    evaluate: Callable[[str], R],
    *,
    pool_size: int = 10,
    max_iterations: int = 5,
    seed: Optional[int] = None,
    distance_fn: Callable[[str, str], int] = levenshtein_distance,
) -> List[AdaptiveSample[R]]

This function implements FSCS-style Adaptive Random Testing for string-valued inputs, with a configurable distance function that defaults to Levenshtein edit distance (using rapidfuzz when available).

It expects two callables from you:

generate_candidate(rng: Random) -> str
A pure generator that receives a random.Random instance and returns a candidate input to test.
evaluate(candidate: str) -> R
A function that runs the system under test on that candidate and returns any result you care about (boolean, structured object, or a tuple).

The return value is a list of AdaptiveSample[R] instances.

Each AdaptiveSample stores:

iteration: the iteration index (1-based).
candidate: the exact string that was tested.
result: whatever your evaluate function returned.
distance_to_previous: the minimum distance from this candidate to all prior candidates, or None for the first sample.

On top of that, there are a few simple helpers:

random_int(low, high)
random_choice(options)
random_string(length, alphabet=None)

These are small utilities designed for quickly sketching candidate generators in tests, without pulling in a heavier property-based testing framework.

Design principle 1: Separate “where to test” from “what you measure”

The key design choice is that adaptive_random_testing never makes assumptions about what counts as a failure.

It only decides where to sample next, based on distances between string inputs.

All correctness criteria, metrics, and logging live in your evaluate function and in how you post-process the list of AdaptiveSample results.

This keeps the library small and lets you reuse it both for:

Traditional string-based systems (parsers, APIs, small DSLs).
Prompt templates and LLM agents, where the expensive part is running the model and inspecting outputs.

Design principle 2: “Boring” defaults, explicit configuration

The default configuration aims to be boring and predictable:

String distance defaults to Levenshtein.
A single seed argument controls deterministic replay.
pool_size and max_iterations are plain integers.

If you care about more advanced behavior (other distance metrics, custom diversity notions, or more elaborate stopping criteria), you stay in control by swapping out the distance_fn or wrapping the function in a higher-level driver.

This also keeps the library aligned with how diversity-based ART is used in the literature for string-like test domains.

Process: integrating the library into the tool-calling LLM harness

In the earlier article, I introduced an incident triage harness:

Inputs are log snippets with subsystem prefixes like [AUTH], [BILL], [RISK].
A tool-calling LLM must select the matching telemetry function exactly once.
A fake telemetry backend returns structured metrics (error rate, latency, retry backlog).
The model must respond with a structured summary that includes the service tag and metrics.
Evaluation extracts tool events from the model’s response and computes Tool Call Accuracy.

The new version of the script you pasted keeps all of that logic intact.

The only change is how we generate and schedule the log inputs.

Step 1: Define a realistic candidate generator

The function random_log_line(rng: random.Random, length_range: Tuple[int, int]) -> str builds synthetic but structured log lines:

It stitches together body phrases like
“error spike after {ms}ms handshake in {region}” or
“retry queue length {count}k and backlog warning in {region}”.
It varies numerical fields such as ms and count.
It uses the subsystem prefixes ([AUTH], [BILL], …) both inside the body and as the tag at the beginning of the line.

The result is a log snippet that looks like true incident telemetry, but is fully generated and reproducible from a seed.

This function is a direct fit for generate_candidate once you adapt the signature to accept a Random instance.

In the code, this happens inside run_adaptive_test:

def generate_candidate(rng: random.Random) -> str:
    return random_log_line(rng, length_range)

One line connects your realistic input model to the adaptive sampler.

Step 2: Wrap the existing model interaction as `evaluate`

All of the tool-calling logic, tool definitions, and the fake telemetry backend live in call_model(text_input: str) -> Tuple[str, List[ToolEvent]].

That function:

Assembles the prompt using PROMPT_TEMPLATE and TOOL_GUIDANCE.
Invokes the OpenAI client with tools configured.
Handles required tool outputs in a loop, using summarize_log to synthesize the JSON for each call.
Records (tool_name, is_correct) pairs in tool_events.
Extracts the final text reply.

This function is already a good candidate for evaluate.

In run_adaptive_test, it is used directly:

def evaluate(candidate: str) -> EvaluationResult:
    return call_model(candidate)

The return type EvaluationResult is defined as:

ToolEvent = Tuple[str, bool]
EvaluationResult = Tuple[str, List[ToolEvent]]

So each AdaptiveSample[EvaluationResult] produced by the library now carries:

The raw model response text.
A list of tool events with correctness flags.
The input candidate and its distance to previous ones.

Step 3: Call `adaptive_random_testing` instead of a hand-rolled FSCS loop

With generate_candidate and evaluate in place, the ART loop becomes a single call:

samples = adaptive_random_testing(
    generate_candidate=generate_candidate,
    evaluate=evaluate,
    pool_size=pool_size,
    max_iterations=max_iterations,
    seed=seed,
)

The rest of run_adaptive_test is pure post-processing:

It counts how many tool calls were correct vs total, to compute Tool selection accuracy.
It returns the samples list along with these aggregate counts.

The main() function then:

Prints a small slice of sample inputs and their tool-calling behavior.
Reports the final Tool selection accuracy as a percentage.

In other words, the library replaces only the selection strategy for which inputs to test next.

All the domain-specific logic remains in your script, where it belongs.

What actually changed in my workflow

Pulling the ART logic into adaptive-testing-tools has changed my workflow in a few concrete ways.

First, the incident triage script is clearer.

When I share or revisit it, the logic reads as:

“Here is how I model log inputs.”
“Here is how I wire up the LLM and tools.”
“Here is how I evaluate each input.”
“Here is how I ask for a spread-out sample of inputs.”

The FSCS algorithm and distance choices are no longer mixed into the middle of that story.

Second, experimentation is easier.

If I want to try:

A different string distance (e.g., token-level instead of character-level).
A larger candidate pool for more aggressive diversity.
A different test budget (more or fewer iterations).

I change function arguments rather than rewriting loops.

This is the kind of change you want when you are iterating on evaluation strategy: a small parameter tweak, not a refactor.

Third, the library is now usable outside LLM testing.

The same primitive works for:

API endpoints that accept JSON with string-heavy fields.
Small interpreters or DSLs.
Any system where “input is a string” and you care about diversity over that string space.

This aligns with broader work on Adaptive Random Testing, which has been applied to a variety of domains where failures cluster in regions of the input space.

Why this matters for LLM evaluation

Adaptive testing is gaining attention in LLM evaluation research because brute-force static benchmarks do not scale with the variability of model behavior and the cost of queries.

Several recent papers explore adaptive or diversity-based selection of test inputs for LLM applications, using distance metrics and feedback from previous runs to prioritize informative cases.

The tiny library here is not a full framework.

It is a building block.

It lets you:

Treat prompt templates and agents as functions f: string -> output.
Define your own correctness checks and metrics (Tool Call Accuracy, robustness to log variation, etc.).
Use ART to sample the input space more systematically without committing to a heavyweight tool.

If you are already comfortable writing Python harnesses around your LLM systems, then adaptive-testing-tools gives you a simple, inspectable way to bring in adaptive sampling while staying close to the code.

How to try this pattern on your own project

If you want to adapt this approach, here is a simple checklist:

Install the library

pip install adaptive-testing-tools

Pick one unit of behavior to test
For example:
- A single tool-calling prompt.
- A small agent loop.
- A specific API or function that wraps your model.
Write a generate_candidate(rng) that models realistic inputs
Model the structure that matters:
- Tags, prefixes, and important keywords.
- Varying numeric fields.
- Regions where you suspect brittleness.
Wrap your system in evaluate(candidate)
- Run your model or system.
- Extract whatever signals you care about (correctness flags, intermediate actions, latencies).
- Return them as a result object or tuple.
Call adaptive_random_testing with a small pool and budget
Start with something like:

samples = adaptive_random_testing(
    generate_candidate,
    evaluate,
    pool_size=10,
    max_iterations=50,
    seed=1234,
)

Post-process samples into metrics and examples
- Compute aggregate metrics (accuracy, error rates, coverage-like proxies).
- Print or log a handful of interesting failures and edge cases.
- Optionally, persist samples to disk for reproducible analysis.

You now have an adaptive test harness around a concrete LLM behavior, without writing any ART internals yourself.

What is next and how you can follow

Right now, adaptive-testing-tools is intentionally small.

My next steps will likely include:

Experimenting with alternative distance functions for long or structured prompts.
Adding light utilities for summarizing AdaptiveSample collections into simple reports.
Trying the same pattern on non-LLM code, to keep the library general.

I will keep sharing what works, what breaks, and what I change in my own evaluation setups.

If you are building or reviewing LLM agents and want to follow these experiments:

Try the library in one of your own test harnesses.
Note where it helps and where it feels too low-level.
Share feedback, issues, or ideas for small, composable additions.

And if you want to follow the ongoing story – from one-off scripts, to small tools, to more systematic LLM evaluation – consider subscribing so you see the next iteration.

Semantics & Systems

Discussion about this post

Ready for more?