Commit Graph

5277 Commits

Author SHA1 Message Date
Reinier van der Leer
a5de79beb6 ci(benchmark): Add nightly benchmark workflow
Added autogpts-benchmark-nightly.yml, which will run every night at 02:00 UTC with a selection of challenges.
2024-02-16 17:41:58 +01:00
Reinier van der Leer
483c01b681 lint(benchmark): Remove unnecessary pass statement in __main__.py 2024-02-16 17:27:56 +01:00
Reinier van der Leer
992b8874fc chore: Update agbenchmark dependency for agent and forge 2024-02-16 17:22:58 +01:00
Reinier van der Leer
2a55efb322 fix(benchmark): Include WebArenaSiteInfo.additional_info (e.g. credentials) in task input
Without the `additional_info`, it is impossible to get past the login page on challenges where that is necessary.
2024-02-16 17:20:44 +01:00
Reinier van der Leer
23d58a3cc0 feat(benchmark/cli): Add challenge list, challenge info subcommands
- Add `challenge list` command with options `--all`, `--names`, `--json`
   - Add `tabular` dependency
   - Add `.utils.utils.sorted_by_enum_index` function to easily sort lists by an enum value/property based on the order of the enum's definition
- Add `challenge info [name]` command with option `--json`
   - Add `.utils.utils.pretty_print_model` routine to pretty-print Pydantic models
- Refactor `config` subcommand to use `pretty_print_model`
2024-02-16 15:17:11 +01:00
Reinier van der Leer
70e345b2ce refactor(benchmark): load_webarena_challenges
- Reduce duplicate and nested statements
- Add `skip_unavailable` parameter

Related changes:
- Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec`
- Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run
2024-02-16 15:11:48 +01:00
Reinier van der Leer
650a701317 chore: Update agbenchmark dependency for agent and forge 2024-02-15 18:19:06 +01:00
Reinier van der Leer
679339d00c feat(benchmark): Make report output folder configurable
- Make `AgentBenchmarkConfig.reports_folder` directly configurable (through `REPORTS_FOLDER` env variable). The default is still `./agbenchmark_config/reports`.
- Change all mentions of `REPORT_LOCATION` (which fulfilled the same function at some point in the past) to `REPORTS_FOLDER`.
2024-02-15 18:07:45 +01:00
Reinier van der Leer
fd5730b04a feat(agent/telemetry): Distinguish between production and dev environment based on VCS state
- Added a helper function `.app.utils.vcs_state_diverges_from_master()`. This function determines whether the relevant part of the codebase diverges from our `master`.
- Updated `.app.telemetry._setup_sentry()` to determine the default environment name using `vcs_state_diverges_from_master`.
2024-02-15 16:00:30 +01:00
Reinier van der Leer
b7f08cd0f7 feat(agent/telemetry): Enable performance tracing & update opt-in prompt accordingly 2024-02-15 14:46:36 +01:00
Reinier van der Leer
8762f7ab3d fix(forge): Make watchfiles pattern more specific to prevent unwanted (breaking) reloads
This fixes the issue of changes in artifacts triggering an application reload (which caused connection errors for in-progress requests).
2024-02-15 13:42:38 +01:00
Reinier van der Leer
a9b7b175ff fix(agent/profile_generator): Improve robustness by leveraging create_chat_completion's parse handling 2024-02-15 11:48:07 +01:00
Reinier van der Leer
52b93dd84e fix(cli/agent start): Wait for applications to finish starting before returning
- Added a helper function `wait_until_conn_ready(port)` to wait for the benchmark and agent applications to finish starting
- Improved the CLI's own logging (within the `agent start` command)
2024-02-15 11:26:26 +01:00
Reinier van der Leer
6a09a44ef7 lint(agent): Fix telemetry.py linting error & formatting 2024-02-14 23:31:35 +01:00
Toran Bruce Richards
32a627eda9 Add Privacy Policy link to telementry opt-in. 2024-02-14 16:42:34 +00:00
Reinier van der Leer
67bafa6302 fix(autogpt/llm): AssistantChatMessage.tool_calls default [] instead of None
OpenAI ChatCompletion calls fail when `tool_calls = None`. This issue came to light after 22aba6d.
2024-02-14 14:34:04 +01:00
Reinier van der Leer
6017eefb32 ci: Enable telemetry in CI runs on master 2024-02-14 12:03:54 +01:00
Reinier van der Leer
ae197fc85f feat(agent/telemetry): Distinguish between users
This allows us to get a much better sense of how many users actually experience issues, and how issue occurrence is distributed among users.
2024-02-14 11:50:45 +01:00
Reinier van der Leer
22aba6dd8a fix(agent/llm): Include bad response in parse-fix prompt in OpenAIProvider.create_chat_completion
Apparently I forgot to also append the response that caused the parse error before throwing it back to the LLM and letting it fix its mistake(s).
2024-02-14 11:20:31 +01:00
Reinier van der Leer
88bbdfc7fc ci: Pick 3 challenges to run with --mock in smoke test CI 2024-02-14 02:30:03 +01:00
Reinier van der Leer
d0c9b7c405 lint(benchmark): Remove unused imports 2024-02-14 01:34:30 +01:00
Reinier van der Leer
e7698a4610 chore(agent): Update forge and agbenchmark dependencies 2024-02-14 01:32:28 +01:00
Reinier van der Leer
ab05b7ae70 chore(forge): Update agbenchmark dependency 2024-02-14 01:27:07 +01:00
Reinier van der Leer
327fb1f916 fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions
- Fixed `--mock` mode
   - Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
   - Implemented mock mode in `WebArenaChallenge`

- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted

- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`

- Disabled left-over Helicone code (see 056163e)

- Fixed a couple of challenge definitions
   - WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
   - synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list

- Added some debug logging in agent_api_interface.py and challenges/builtin.py
2024-02-14 01:05:34 +01:00
Reinier van der Leer
bb7f5abc6c fix(agent/text_processing): Fix extract_information LLM response parsing
OpenAI's newest models return JSON with markdown fences around it, breaking the `json.loads` parser.

This commit adds an `extract_list_from_response` function to json_utils/utilities.py and uses this function to replace `json.loads` in `_process_text`.
2024-02-13 18:28:17 +01:00
Reinier van der Leer
393d6b97e6 feat(agent): Add Sentry integration for telemetry
* Add Sentry integration for telemetry
   - Add `sentry_sdk` dependency
   - Add setup logic and config flow using `TELEMETRY_OPT_IN` environment variable
      - Add app/telemetry.py with `setup_telemetry` helper routine
      - Call `setup_telemetry` in `cli()` in app/cli.py
      - Add `TELEMETRY_OPT_IN` to .env.template
      - Add helper function `env_file_exists` and routine `set_env_config_value` to app/utils.py
         - Add unit tests for `set_env_config_value` in test_utils.py
      - Add prompt to startup to ask whether the user wants to enable telemetry if the env variable isn't set

* Add `capture_exception` statements for LLM parsing errors and command failures
2024-02-13 18:10:52 +01:00
Reinier van der Leer
3b8d63dfb6 chore(agent): Update autogpt-forge and agbenchmark dependencies to propagate dependency updates
This also indirectly updates `python-multipart` and fixes "python-multipart vulnerable to Content-Type Header ReDoS" https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/57
2024-02-13 13:24:24 +01:00
Reinier van der Leer
6763196d78 chore(forge): Update agbenchmark dependency 2024-02-13 12:44:17 +01:00
Reinier van der Leer
e1da58da02 chore(forge): Update aiohttp, fastapi, and python-multipart dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/56
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/52
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/49

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/45
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/42
2024-02-13 12:38:36 +01:00
Reinier van der Leer
91cec515d4 chore(benchmark): Update python-multipart dependency to mitigate vulnerability
- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55
2024-02-13 12:36:00 +01:00
Reinier van der Leer
cc585a014f chore(agent): Update aiohttp and fastapi dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/57
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/54
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/50

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/44
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/41
2024-02-13 12:30:12 +01:00
Reinier van der Leer
e641cccb42 chore(benchmark): Update aiohttp and fastapi dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/53
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/48

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/46
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/43
2024-02-13 12:21:52 +01:00
Mahdi Karami
cc73d4104b fix(forge): incorrect import 'sdk' in .actions.finish (#6822) 2024-02-13 11:02:03 +01:00
Reinier van der Leer
250552cb3d fix(agent/tests): Update test_config.py:test_initial_values 2024-02-12 13:26:47 +01:00
Reinier van der Leer
1d653973e9 feat(agent/llm): Use new OpenAI models as default SMART_LLM, FAST_LLM, and EMBEDDING_MODEL
- Change default `SMART_LLM` from `gpt-4` to `gpt-4-turbo-preview`
- Change default `FAST_LLM` from `gpt-3.5-turbo-16k` to `gpt-3.5-turbo-0125`
- Change default `EMBEDDING_MODEL` from `text-embedding-ada-002` to `text-embedding-3-small`
- Update .env.template, azure.yaml.template, and documentation accordingly
2024-02-12 13:19:37 +01:00
Reinier van der Leer
7bf9ba5502 chore(agent/llm): Update OpenAI model info
- Add `text-embedding-3-small` and `text-embedding-3-large` as `EMBEDDING_v3_S` and `EMBEDDING_v3_L` respectively
- Add `gpt-3.5-turbo-0125` as `GPT3_v4`
- Add `gpt-4-1106-vision-preview` as `GPT4_v3_VISION`
- Add GPT-4V models to info map
- Change chat model info mapping to derive info for aliases (e.g. `gpt-3.5-turbo`) from specific versions instead of the other way around
2024-02-12 13:17:20 +01:00
Reinier van der Leer
14c9773890 ci(agent): Add GIT_REVISION label to Docker builds 2024-02-12 12:31:04 +01:00
Reinier van der Leer
39fddb1214 fix(agent): Fix application of extra_request_headers in OpenAIProvider 2024-02-12 12:21:30 +01:00
Reinier van der Leer
fe0923ba6c feat(agent/web): Add browser extensions to deal with cookie walls and ads (#6778)
* Add `_sideload_chrome_extensions` subroutine to `open_page_in_browser` in web_selenium.py
   * Sideloads uBlock Origin and I Still Don't Care About Cookies, downloading them if necessary
* Add 2-second delay to `open_page_in_browser` to allow time for handling cookie walls
2024-02-02 18:30:37 +01:00
Reinier van der Leer
dfaeda7cd5 lint(agent/tests): Fix line length in test_utils.py 2024-02-02 18:29:28 +01:00
Reinier van der Leer
9b7fee673e fix(agent/tests): Update test_utils.py:test_extract_json_from_response* in accordance with 956cdc7
Commit 956cdc7 "fix(agent/json_utils): Decode as JSON rather than Python objects" broke these unit tests because they generated "JSON" by stringifying a Python object.
2024-02-02 18:21:19 +01:00
Reinier van der Leer
925269d17b lint(agent): Fix line length in docstring of EpisodicActionHistory.handle_compression 2024-02-02 17:43:42 +01:00
Fernando Navarro Páez
266fe3a3f7 fix(forge): Fix "no module named 'forge.sdk.abilities'" (#6571)
Fixes #6537
2024-02-01 11:23:35 +01:00
Reinier van der Leer
66e0c87894 feat(agent): Add history compression to increase longevity and efficiency
* Compress steps in the prompt to reduce token usage, and to increase longevity when using models with limited context windows
* Move multiple copies of step formatting code to `Episode.format` method
* Add `EpisodicActionHistory.handle_compression` method to handle compression of new steps
2024-01-31 17:51:45 +01:00
Reinier van der Leer
55433f468a feat(agent/web): Improve read_webpage information extraction abilities
* Implement `extract_information` function in `autogpt.processing.text` module. This function extracts pieces of information from a body of text based on a list of topics of interest.
* Add `topics_of_interest` and `get_raw_content` parameters to `read_webpage` commmand
   * Limit maximum content length if `get_raw_content=true` is specified
2024-01-31 15:08:08 +01:00
Reinier van der Leer
956cdc77fa fix(agent/json_utils): Decode as JSON rather than Python objects
* Replace `ast.literal_eval` with `json.loads` in `extract_dict_from_response`

This fixes a bug where boolean values could not be decoded because of their required capitalization in Python.
2024-01-31 14:15:02 +01:00
Reinier van der Leer
83a0b03523 fix(agent/prompting): Fix representation of (optional) command parameters in prompt 2024-01-31 14:10:22 +01:00
Reinier van der Leer
25b9e290a5 fix(agent/json_utils): Make extract_dict_from_response more robust
* Accommodate for both ```json and ```JSON blocks in responses
2024-01-29 15:03:09 +01:00
Reinier van der Leer
ab860981d8 feat(agent/llm): Add support for gpt-4-0125-preview
* Add `gpt-4-0125-preview` model to OpenAI model list
* Add `gpt-4-turbo-preview` alias to OpenAI model list
2024-01-29 11:22:32 +01:00
Reinier van der Leer
a0cae78ba3 feat(benchmark): Add -N, --attempts option for multiple attempts per challenge
LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result.

Changes:
- Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`.
- Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge.
- Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge.
- Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge.
   - Calculate `success_percentage` from results of the current run, rather than all known results ever.
   - Add docstrings to a number of models in report_types.py.
   - Allow `None` as a success value, e.g. for runs that did not render any results before being cut off.
- Make SingletonReportManager thread-safe.
2024-01-22 17:16:55 +01:00