Files
AutoGPT/benchmark/agbenchmark/utils/prompts.py
Reinier van der Leer f107ff8cf0 Set up unified pre-commit + CI w/ linting + type checking & FIX EVERYTHING (#7171)
- **FIX ALL LINT/TYPE ERRORS IN AUTOGPT, FORGE, AND BENCHMARK**

### Linting
- Clean up linter configs for `autogpt`, `forge`, and `benchmark`
- Add type checking with Pyright
- Create unified pre-commit config
- Create unified linting and type checking CI workflow

### Testing
- Synchronize CI test setups for `autogpt`, `forge`, and `benchmark`
   - Add missing pytest-cov to benchmark dependencies
- Mark GCS tests as slow to speed up pre-commit test runs
- Repair `forge` test suite
  - Add `AgentDB.close()` method for test DB teardown in db_test.py
  - Use actual temporary dir instead of forge/test_workspace/
- Move left-behind dependencies for moved `forge`-code to from autogpt to forge

### Notable type changes
- Replace uses of `ChatModelProvider` by `MultiProvider`
- Removed unnecessary exports from various __init__.py
- Simplify `FileStorage.open_file` signature by removing `IOBase` from return type union
  - Implement `S3BinaryIOWrapper(BinaryIO)` type interposer for `S3FileStorage`

- Expand overloads of `GCSFileStorage.open_file` for improved typing of read and write modes

  Had to silence type checking for the extra overloads, because (I think) Pyright is reporting a false-positive:
  https://github.com/microsoft/pyright/issues/8007

- Change `count_tokens`, `get_tokenizer`, `count_message_tokens` methods on `ModelProvider`s from class methods to instance methods

- Move `CompletionModelFunction.schema` method -> helper function `format_function_def_for_openai` in `forge.llm.providers.openai`

- Rename `ModelProvider` -> `BaseModelProvider`
- Rename `ChatModelProvider` -> `BaseChatModelProvider`
- Add type `ChatModelProvider` which is a union of all subclasses of `BaseChatModelProvider`

### Removed rather than fixed
- Remove deprecated and broken autogpt/agbenchmark_config/benchmarks.py
- Various base classes and properties on base classes in `forge.llm.providers.schema` and `forge.models.providers`

### Fixes for other issues that came to light
- Clean up `forge.agent_protocol.api_router`, `forge.agent_protocol.database`, and `forge.agent.agent`

- Add fallback behavior to `ImageGeneratorComponent`
   - Remove test for deprecated failure behavior

- Fix `agbenchmark.challenges.builtin` challenge exclusion mechanism on Windows

- Fix `_tool_calls_compat_extract_calls` in `forge.llm.providers.openai`

- Add support for `any` (= no type specified) in `JSONSchema.typescript_type`
2024-05-28 05:04:21 +02:00

80 lines
2.8 KiB
Python

SCORING_MAP = {
"percentage": (
"assign a float score that will represent a percentage out of 100. "
"Use decimal points to be even more accurate. "
"0 represents the worst possible generation, "
"while 100 represents the ideal generation"
),
"scale": (
"assign an integer score from a scale of 1-10. "
"1 represents a really bad generation, while 10 represents an ideal generation"
),
"binary": (
"assign a binary score of either 0 or 1. "
"0 represents a failure, while 1 represents a success"
),
}
REFERENCE_PROMPT = """Ignore previous directions. You are now an expert at evaluating how close machine generated responses are to human answers. You essentially act as a hyper advanced BLEU score.
In order to score the machine generated response you will {scoring}. Make sure to factor in the distance to the ideal response into your thinking, deliberation, and final result regarding scoring. Return nothing but a float score.
Here is the given task for you to evaluate:
{task}
Here is the ideal response you're comparing to based on the task:
{answer}
Here is the current machine generated response to the task that you need to evaluate:
{response}
""" # noqa: E501
RUBRIC_PROMPT = """Ignore previous directions. You are now an expert at evaluating machine generated responses to given tasks.
In order to score the generated texts you will {scoring}. Make sure to factor in rubric into your thinking, deliberation, and final result regarding scoring. Return nothing but a float score.
Here is the given task for you to evaluate:
{task}
Use the below rubric to guide your thinking about scoring:
{answer}
Here is the current machine generated response to the task that you need to evaluate:
{response}
""" # noqa: E501
QUESTION_PROMPT = """Ignore previous directions. You are now an expert at evaluating machine generated responses to given tasks.
In order to score the generated texts you will {scoring}. Make sure to think about whether the generated response answers the question well in order to score accurately. Return nothing but a float score.
Here is the given task:
{task}
Here is a question that checks if the task was completed correctly:
{answer}
Here is the current machine generated response to the task that you need to evaluate:
{response}
""" # noqa: E501
FEW_SHOT_EXAMPLES = """Here are some examples of how to score a machine generated response based on the above:
{examples}
""" # noqa: E501
CUSTOM_PROMPT = """{custom}
{scoring}
"""
PROMPT_MAP = {
"rubric": RUBRIC_PROMPT,
"reference": REFERENCE_PROMPT,
"question": QUESTION_PROMPT,
"custom": CUSTOM_PROMPT,
}
END_PROMPT = """Remember to always end your response with nothing but a float score.
Float score:"""