* [eval] increase timeout for swebench eval init/complete
* allow CmdRunAction to optionally block when .timeout is setted
* fix unit test for serialization
* fix unit tests for security analyzer
* fix integration tests
* add more timeout
* move filematching logic into server
* wait until ready before returning
* show loading message instead of empty
* logspam
* delint
* fix type
* add a few more default ignores
* Reduce runtime tests duration by running them across CPUs
* fix hardcoded image name
* test two cpus
* Test folder change
* Up the CPU to 4 again to test
* Change to 3 CPUs
* Down to 2
* Add param to remove all openhands containers
* Add comment
* Add reruns just in case
* Fix ordering of if
* Add Handling of Cache Prompt When Formatting Messages
* Fix Value for Cache Control
* Fix Value for Cache Control
* Update openhands/core/message.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
* Fix lint error
* Serialize Messages if Propt Caching Is Enabled
* Remove formatting message change
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
* dummy test change
* regen yml: 1st install python 3.11, then poetry
* fix caching for poetry; old entry for python was rather useless
* fix steps order (cache before poetry)
* add poetry caching to ghcr_runtime; fix fork conditions
* ghcr_runtime: more caching actions; condition fixes
* fix interim action error (order of steps)
* cache@v4 instead of v3
* fixed interim typo for 2 fork conditions
* runtime/test_env_vars: compacted multiple tests into one to reduce time
* ugh if fork condition changes again
* (feat) making prompt caching optional instead of enabled default
At present, only the Claude models support prompt caching as a experimental feature, therefore, this feature should be implemented as an optional setting rather than being enabled by default.
Signed-off-by: Yi Lin <teroincn@gmail.com>
* handle the conflict
* fix unittest mock return value
* fix lint error in whitespace
---------
Signed-off-by: Yi Lin <teroincn@gmail.com>
* feat: add SWE-bench fullset support
* fix instance image list
* update eval script and documentation
* increase timeout for remote runtime
* add push script
* handle the case when ret push is an generator
* update pbar
* set SWE-Bench default to run SWE-Bench lite
* add script to cleanup remote runtime
* fix the cases when tag is too long
* update README
* update readme for cleanup
* rename od to oh
* Update evaluation/swe_bench/README.md
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Update evaluation/swe_bench/README.md
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh
Co-authored-by: Graham Neubig <neubig@gmail.com>
* gets API key and Runtime from env var
---------
Co-authored-by: Graham Neubig <neubig@gmail.com>
* CodeActAgent: fix message prep if prompt caching is not supported
* fix python version in regen tests workflow
* fix in conftest "mock_completion" method
* add disable_vision to LLMConfig; revert change in message parsing in llm.py
* format messages in several files for completion
* refactored message(s) formatting (llm.py); added vision_is_active()
* fix a unit test
* regenerate: added LOG_TO_FILE and FORCE_REGENERATE env flags
* try to fix path to logs folder in workflow
* llm: prevent index error
* try FORCE_USE_LLM in regenerate
* tweaks everywhere...
* fix 2 random unit test errors :(
* added FORCE_REGENERATE_TESTS=true to regenerate CLI
* fix test_lint_file_fail_typescript again
* double-quotes for env vars in workflow; llm logger set to debug
* fix typo in regenerate
* regenerate iterations now 20; applied iteration counter fix by Li
* regenerate: pass FORCE_REGENERATE flag into env
* fixes for int tests. several mock files updated.
* browsing_agent: fix response_parser.py adding ) to empty response
* test_browse_internet: fix skipif and revert obsolete mock files
* regenerate: fi bracketing for http server start/kill conditions
* disable test_browse_internet for CodeAct*Agents; mock files updated after merge
* missed to include more mock files earlier
* reverts after review feedback from Li
* forgot one
* browsing agent test, partial fixes and updated mock files
* test_browse_internet works in my WSL now!
* adapt unit test test_prompt_caching.py
* add DEBUG to regenerate workflow command
* convert regenerate workflow params to inputs
* more integration test mock files updated
* more files
* test_prompt_caching: restored test_prompt_caching_headers purpose
* file_ops: fix potential exception, like "cross device copy"; fixed mock files accordingly
* reverts/changes wrt feedback from xingyao
* updated docs and config template
* code cleanup wrt review feedback