Xingyao Wang
50c13aad98
[Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification ( #4396 )
2024-10-15 21:34:52 +08:00
Robert Brennan
01ae22ef57
Rename OpenDevin to OpenHands ( #3472 )
...
* Replace OpenDevin with OpenHands
* Update CONTRIBUTING.md
* Update README.md
* Update README.md
* update poetry lock; move opendevin folder to openhands
* fix env var
* revert image references in docs
* revert permissions
* revert permissions
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
2024-08-20 00:44:54 +08:00
Graham Neubig
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings ( #2597 )
...
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings
* Update evaluation/webarena/scripts/run_infer.sh
---------
Co-authored-by: OpenDevin <opendevin@all-hands.dev >
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba
Evaluation time travel: build sandbox on the fly ( #2491 )
2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf
Evaluation time travel: allow evaluation on a specific version ( #2356 )
...
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Frank Xu
48151bdbb0
[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes ( #2170 )
...
* add webarena, and revamp messaging for webarena eval
* add changes for browsergym
* update infer script
* fix unit tests
* update
* add multiple run for miniwob
* update instruction, remove personal path
* update
* add code for getting final reward, fix integration, add results
* add avg cost calculation
2024-06-06 09:01:20 +08:00