* update and polish gptq eval
* fix typo
* Update evaluation/gpqa/README.md
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Update evaluation/gpqa/run_infer.py
Co-authored-by: Graham Neubig <neubig@gmail.com>
* add headless mode to all appropriate agent controller call
* delegate set to error when in headless mode
* try to deduplicate a bit
* make headless_mode default to True and only change it to false for AgentSession
---------
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README