From 4407aefdf5767ff776117668e005183ebb477932 Mon Sep 17 00:00:00 2001 From: Alex O'Connell Date: Sun, 21 Dec 2025 13:31:43 -0500 Subject: [PATCH] more synthesizing scenarios + clean up example formatting --- data/README.md | 20 ++ data/{device_types.py => devices.py} | 80 ++++++ data/generate_data.py | 370 ++++++++++----------------- data/synthesize.py | 359 ++++++++++++++++++++++++-- data/tools.py | 8 + 5 files changed, 584 insertions(+), 253 deletions(-) rename data/{device_types.py => devices.py} (74%) diff --git a/data/README.md b/data/README.md index 6201bc0..7a90a4c 100644 --- a/data/README.md +++ b/data/README.md @@ -67,6 +67,26 @@ The response pile is a CSV with the following headers: `service,response,languag Generating the full dataset using the python script will print out a warning for any responses that are missing for a persona. +## Synthesizing new pile data +You can quickly append fresh examples to the CSV piles without editing them manually by running `synthesize.py`. The script talks to the configured LLM and writes the generated rows directly into the per-language pile files. + +Examples: + +```bash +# Append 25 failed tool-call recoveries and 25 refusals in Spanish +python3 synthesize.py --language spanish --model gpt-oss-120b --failed-tool-calls 25 --refusals 25 --concurrency 6 + +# Generate new actions plus matching refusal samples in German +python3 synthesize.py --language german --actions 100 --refusals 40 --model gpt-oss-120b +``` + +Useful flags: +- `--failed-tool-calls`: number of `pile_of_failed_tool_calls.csv` rows to synthesize. +- `--refusals`: number of `pile_of_refusals.csv` rows to synthesize. +- `--actions`, `--status`, `--devices`: existing knobs for the other piles. + +The script automatically routes generations to the correct language-specific pile under `data/piles//`. + ## Adding new Home Assistant functionality TODO