mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-04-08 03:00:28 -04:00
- **FIX ALL LINT/TYPE ERRORS IN AUTOGPT, FORGE, AND BENCHMARK** ### Linting - Clean up linter configs for `autogpt`, `forge`, and `benchmark` - Add type checking with Pyright - Create unified pre-commit config - Create unified linting and type checking CI workflow ### Testing - Synchronize CI test setups for `autogpt`, `forge`, and `benchmark` - Add missing pytest-cov to benchmark dependencies - Mark GCS tests as slow to speed up pre-commit test runs - Repair `forge` test suite - Add `AgentDB.close()` method for test DB teardown in db_test.py - Use actual temporary dir instead of forge/test_workspace/ - Move left-behind dependencies for moved `forge`-code to from autogpt to forge ### Notable type changes - Replace uses of `ChatModelProvider` by `MultiProvider` - Removed unnecessary exports from various __init__.py - Simplify `FileStorage.open_file` signature by removing `IOBase` from return type union - Implement `S3BinaryIOWrapper(BinaryIO)` type interposer for `S3FileStorage` - Expand overloads of `GCSFileStorage.open_file` for improved typing of read and write modes Had to silence type checking for the extra overloads, because (I think) Pyright is reporting a false-positive: https://github.com/microsoft/pyright/issues/8007 - Change `count_tokens`, `get_tokenizer`, `count_message_tokens` methods on `ModelProvider`s from class methods to instance methods - Move `CompletionModelFunction.schema` method -> helper function `format_function_def_for_openai` in `forge.llm.providers.openai` - Rename `ModelProvider` -> `BaseModelProvider` - Rename `ChatModelProvider` -> `BaseChatModelProvider` - Add type `ChatModelProvider` which is a union of all subclasses of `BaseChatModelProvider` ### Removed rather than fixed - Remove deprecated and broken autogpt/agbenchmark_config/benchmarks.py - Various base classes and properties on base classes in `forge.llm.providers.schema` and `forge.models.providers` ### Fixes for other issues that came to light - Clean up `forge.agent_protocol.api_router`, `forge.agent_protocol.database`, and `forge.agent.agent` - Add fallback behavior to `ImageGeneratorComponent` - Remove test for deprecated failure behavior - Fix `agbenchmark.challenges.builtin` challenge exclusion mechanism on Windows - Fix `_tool_calls_compat_extract_calls` in `forge.llm.providers.openai` - Add support for `any` (= no type specified) in `JSONSchema.typescript_type`
80 lines
2.8 KiB
Python
80 lines
2.8 KiB
Python
SCORING_MAP = {
|
|
"percentage": (
|
|
"assign a float score that will represent a percentage out of 100. "
|
|
"Use decimal points to be even more accurate. "
|
|
"0 represents the worst possible generation, "
|
|
"while 100 represents the ideal generation"
|
|
),
|
|
"scale": (
|
|
"assign an integer score from a scale of 1-10. "
|
|
"1 represents a really bad generation, while 10 represents an ideal generation"
|
|
),
|
|
"binary": (
|
|
"assign a binary score of either 0 or 1. "
|
|
"0 represents a failure, while 1 represents a success"
|
|
),
|
|
}
|
|
|
|
|
|
REFERENCE_PROMPT = """Ignore previous directions. You are now an expert at evaluating how close machine generated responses are to human answers. You essentially act as a hyper advanced BLEU score.
|
|
In order to score the machine generated response you will {scoring}. Make sure to factor in the distance to the ideal response into your thinking, deliberation, and final result regarding scoring. Return nothing but a float score.
|
|
|
|
Here is the given task for you to evaluate:
|
|
{task}
|
|
|
|
Here is the ideal response you're comparing to based on the task:
|
|
{answer}
|
|
|
|
Here is the current machine generated response to the task that you need to evaluate:
|
|
{response}
|
|
|
|
""" # noqa: E501
|
|
|
|
RUBRIC_PROMPT = """Ignore previous directions. You are now an expert at evaluating machine generated responses to given tasks.
|
|
In order to score the generated texts you will {scoring}. Make sure to factor in rubric into your thinking, deliberation, and final result regarding scoring. Return nothing but a float score.
|
|
|
|
Here is the given task for you to evaluate:
|
|
{task}
|
|
|
|
Use the below rubric to guide your thinking about scoring:
|
|
{answer}
|
|
|
|
Here is the current machine generated response to the task that you need to evaluate:
|
|
{response}
|
|
|
|
""" # noqa: E501
|
|
|
|
QUESTION_PROMPT = """Ignore previous directions. You are now an expert at evaluating machine generated responses to given tasks.
|
|
In order to score the generated texts you will {scoring}. Make sure to think about whether the generated response answers the question well in order to score accurately. Return nothing but a float score.
|
|
|
|
Here is the given task:
|
|
{task}
|
|
|
|
Here is a question that checks if the task was completed correctly:
|
|
{answer}
|
|
|
|
Here is the current machine generated response to the task that you need to evaluate:
|
|
{response}
|
|
|
|
""" # noqa: E501
|
|
|
|
FEW_SHOT_EXAMPLES = """Here are some examples of how to score a machine generated response based on the above:
|
|
{examples}
|
|
|
|
""" # noqa: E501
|
|
|
|
CUSTOM_PROMPT = """{custom}
|
|
{scoring}
|
|
|
|
"""
|
|
|
|
PROMPT_MAP = {
|
|
"rubric": RUBRIC_PROMPT,
|
|
"reference": REFERENCE_PROMPT,
|
|
"question": QUESTION_PROMPT,
|
|
"custom": CUSTOM_PROMPT,
|
|
}
|
|
|
|
END_PROMPT = """Remember to always end your response with nothing but a float score.
|
|
Float score:"""
|