AutoGPT

github/AutoGPT

Fork 0

mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-04-30 03:00:41 -04:00

Commit Graph

Author	SHA1	Message	Date
Zamil Majdy	91dd9364bb	fix(backend): implement retry mechanism for SmartDecisionMaker tool call validation (#11015 ) <!-- Clearly explain the need for these changes: --> This PR fixes a critical production issue where SmartDecisionMakerBlock was silently accepting tool calls with typo'd parameter names (e.g., 'maximum_keyword_difficulty' instead of 'max_keyword_difficulty'), causing downstream blocks to receive null values and execution failures. The solution implements comprehensive parameter validation with automatic retry when the LLM provides malformed tool calls, giving the LLM specific feedback to correct the errors. ### Changes 🏗️ <!-- Concisely describe all of the changes made in this pull request: --> Core Validation & Retry Logic (`backend/blocks/smart_decision_maker.py`) - Add tool call parameter validation against function schema - Implement retry mechanism using existing `create_retry_decorator` from `backend.util.retry` - Validate provided parameters against expected schema properties and required fields - Generate specific error messages for unknown parameters (typos) and missing required parameters - Add error feedback to conversation history for LLM learning on retry attempts - Use `input_data.retry` field to configure number of retry attempts Comprehensive Test Coverage (`backend/blocks/test/test_smart_decision_maker.py`) - Add `test_smart_decision_maker_parameter_validation` with 4 comprehensive test scenarios: 1. Tool call with typo'd parameter (should retry and eventually fail with clear error) 2. Tool call missing required parameter (should fail immediately with clear error) 3. Valid tool call with optional parameter missing (should succeed) 4. Valid tool call with all parameters provided (should succeed) - Verify retry mechanism works correctly and respects retry count - Mock LLM responses for controlled testing of validation logic Load Tests Documentation Update (`load-tests/README.md`) - Update documentation to reflect current orchestrator-based architecture - Remove references to deprecated `run-tests.js` and `comprehensive-orchestrator.js` - Streamline documentation to focus on working `orchestrator/orchestrator.js` - Update NPM scripts and command examples for current workflow - Clean up outdated file references to match actual infrastructure Production Impact - Prevents silent failures: Tool call parameter typos now cause retries instead of null downstream values - Maintains compatibility: No breaking changes to existing SmartDecisionMaker functionality - Improves reliability: LLM receives feedback to correct parameter errors - Configurable retries: Uses existing `retry` field for user control - Accurate documentation: Load-tests docs now match actual working infrastructure ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <!-- Put your test plan here: --> - [x] Run existing SmartDecisionMaker tests to ensure no regressions: `poetry run pytest backend/blocks/test/test_smart_decision_maker.py -xvs` ✅ All 4 tests passed - [x] Run new parameter validation test specifically: `poetry run pytest backend/blocks/test/test_smart_decision_maker.py::test_smart_decision_maker_parameter_validation -xvs` ✅ Passed with retry behavior confirmed - [x] Verify retry mechanism works by checking log output for retry attempts ✅ Confirmed in test logs - [x] Test tool call validation with different scenarios (typos, missing params, valid calls) ✅ All scenarios covered and working - [x] Run code formatting and linting: `poetry run format` ✅ All formatters passed - [x] Verify no breaking changes to existing SmartDecisionMaker functionality ✅ All existing tests pass - [x] Verify load-tests documentation accuracy ✅ README now matches actual orchestrator infrastructure #### For configuration changes: - [x] `.env.default` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under Changes) Note: No configuration changes were needed as this uses existing retry infrastructure and block schema validation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-09-30 16:18:05 +00:00
Zamil Majdy	4c000086e6	feat(backend): implement clean k6 load testing infrastructure (#10978 ) ## Summary Implement comprehensive k6 load testing infrastructure for the AutoGPT Platform with clean file organization, unified test runner, and cloud integration. ## Key Features ### 🗂️ Clean File Organization - tests/basic/: Simple validation tests (connectivity, single endpoints) - tests/api/: Core functionality tests (API endpoints, graph execution) - tests/marketplace/: User-facing feature tests (public/library access) - tests/comprehensive/: End-to-end scenario tests (complete user journeys) - orchestrator/: Advanced test orchestration for full suites ### 🚀 Unified Test Runner - Single entry point: `run-tests.js` for both local and cloud execution - 7 available tests: From basic connectivity to comprehensive platform journeys - Flexible execution: Run individual tests, comma-separated lists, or all tests - Auto-configuration: Different VU/duration settings for local vs cloud execution ### 🔐 Advanced Authentication - Pre-authenticated tokens: 24-hour JWT tokens eliminate Supabase rate limiting - Configurable generation: Default 10 tokens, scalable to 150+ for high concurrency - Graceful handling: Proper auth failure detection and recovery - ES module compatible: Modern JavaScript with full import/export support ### ☁️ k6 Cloud Integration - Cloud execution: Tests run on k6 cloud infrastructure for consistent results - Real-time monitoring: Live dashboards with performance metrics - URL tracking: Automatic test result URL capture and storage - Sequential orchestration: Proper delays between tests for resource management ## Test Coverage ### Performance Validated - Core API: 100 VUs successfully testing `/api/credits`, `/api/graphs`, `/api/blocks`, `/api/executions` - Graph Execution: 80 VUs for complete workflow pipeline testing - Marketplace: 150 VUs for public browsing, 100 VUs for authenticated library operations - Authentication: 150+ concurrent users with pre-authenticated token scaling ### User Journey Simulation - Dashboard workflows: Credits checking, graph management, execution monitoring - Marketplace browsing: Public search, agent discovery, category filtering - Library operations: Agent adding, favoriting, forking, detailed views - Complete workflows: End-to-end platform usage with realistic user behavior ## Technical Implementation ### ES Module Compatibility - Full ES module support with modern JavaScript imports/exports - Proper module execution patterns for Node.js compatibility - Clean separation between CommonJS legacy and modern ES modules ### Error Handling & Monitoring - Separate metrics: HTTP status, authentication, JSON validation, overall success - Graceful degradation: Auth failures don't crash VUs, proper error tracking - Performance thresholds: Configurable P95/P99 latency and error rate limits - Custom counters: Track operation types, success rates, user journey completion ### Infrastructure Benefits - Rate limit protection: Pre-auth tokens prevent Supabase auth bottlenecks - Scalable testing: Support for 150+ concurrent users with proper token management - Cloud consistency: Tests run on dedicated k6 cloud servers for reliable results - Development workflow: Local execution for debugging, cloud for performance validation ## Usage ### Quick Start ```bash # Setup and verification export SUPABASE_SERVICE_KEY="your-service-key" node generate-tokens.js node run-tests.js verify # Local testing (development) node run-tests.js run core-api-test DEV # Cloud testing (performance) node run-tests.js cloud all DEV ``` ### NPM Scripts ```bash npm run verify # Quick setup check npm test # All tests locally npm run cloud # All tests in k6 cloud ``` ## Validation Results ✅ Authentication: 100% success rate with fresh 24-hour tokens ✅ File Structure: All imports and references verified correct ✅ Test Execution: All 7 tests execute successfully with proper metrics ✅ Cloud Integration: k6 cloud execution working with proper credentials ✅ Documentation: Complete README with usage examples and troubleshooting ## Files Changed ### Core Infrastructure - `run-tests.js`: Unified test runner supporting local/cloud execution - `generate-tokens.js`: ES module compatible token generation with 24-hour expiry - `README.md`: Comprehensive documentation with updated file references ### Organized Test Structure - `tests/basic/connectivity-test.js`: Basic connectivity validation - `tests/basic/single-endpoint-test.js`: Individual API endpoint testing - `tests/api/core-api-test.js`: Core authenticated API endpoints - `tests/api/graph-execution-test.js`: Graph workflow pipeline testing - `tests/marketplace/public-access-test.js`: Public marketplace browsing - `tests/marketplace/library-access-test.js`: Authenticated marketplace/library operations - `tests/comprehensive/platform-journey-test.js`: Complete user journey simulation ### Configuration - `configs/environment.js`: Environment URLs and performance settings - `package.json`: NPM scripts and dependencies for unified workflow This infrastructure provides a solid foundation for continuous performance monitoring and load testing of the AutoGPT Platform. 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co> Co-authored-by: Reinier van der Leer <pwuts@agpt.co>	2025-09-25 12:51:54 +07:00

Author

SHA1

Message

Date

Zamil Majdy

91dd9364bb

fix(backend): implement retry mechanism for SmartDecisionMaker tool call validation (#11015 )

<!-- Clearly explain the need for these changes: -->

This PR fixes a critical production issue where SmartDecisionMakerBlock
was silently accepting tool calls with typo'd parameter names (e.g.,
'maximum_keyword_difficulty' instead of 'max_keyword_difficulty'),
causing downstream blocks to receive null values and execution failures.

The solution implements comprehensive parameter validation with
automatic retry when the LLM provides malformed tool calls, giving the
LLM specific feedback to correct the errors.

### Changes 🏗️

<!-- Concisely describe all of the changes made in this pull request:
-->

**Core Validation & Retry Logic
(`backend/blocks/smart_decision_maker.py`)**
- Add tool call parameter validation against function schema
- Implement retry mechanism using existing `create_retry_decorator` from
`backend.util.retry`
- Validate provided parameters against expected schema properties and
required fields
- Generate specific error messages for unknown parameters (typos) and
missing required parameters
- Add error feedback to conversation history for LLM learning on retry
attempts
- Use `input_data.retry` field to configure number of retry attempts

**Comprehensive Test Coverage
(`backend/blocks/test/test_smart_decision_maker.py`)**
- Add `test_smart_decision_maker_parameter_validation` with 4
comprehensive test scenarios:
1. Tool call with typo'd parameter (should retry and eventually fail
with clear error)
2. Tool call missing required parameter (should fail immediately with
clear error)
  3. Valid tool call with optional parameter missing (should succeed)
  4. Valid tool call with all parameters provided (should succeed)
- Verify retry mechanism works correctly and respects retry count
- Mock LLM responses for controlled testing of validation logic

**Load Tests Documentation Update (`load-tests/README.md`)**
- Update documentation to reflect current orchestrator-based
architecture
- Remove references to deprecated `run-tests.js` and
`comprehensive-orchestrator.js`
- Streamline documentation to focus on working
`orchestrator/orchestrator.js`
- Update NPM scripts and command examples for current workflow
- Clean up outdated file references to match actual infrastructure

**Production Impact**
- **Prevents silent failures**: Tool call parameter typos now cause
retries instead of null downstream values
- **Maintains compatibility**: No breaking changes to existing
SmartDecisionMaker functionality
- **Improves reliability**: LLM receives feedback to correct parameter
errors
- **Configurable retries**: Uses existing `retry` field for user control
- **Accurate documentation**: Load-tests docs now match actual working
infrastructure

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
- [x] Run existing SmartDecisionMaker tests to ensure no regressions:
`poetry run pytest backend/blocks/test/test_smart_decision_maker.py
-xvs` ✅ All 4 tests passed
- [x] Run new parameter validation test specifically: `poetry run pytest
backend/blocks/test/test_smart_decision_maker.py::test_smart_decision_maker_parameter_validation
-xvs` ✅ Passed with retry behavior confirmed
- [x] Verify retry mechanism works by checking log output for retry
attempts ✅ Confirmed in test logs
- [x] Test tool call validation with different scenarios (typos, missing
params, valid calls) ✅ All scenarios covered and working
- [x] Run code formatting and linting: `poetry run format` ✅ All
formatters passed
- [x] Verify no breaking changes to existing SmartDecisionMaker
functionality ✅ All existing tests pass
- [x] Verify load-tests documentation accuracy ✅ README now matches
actual orchestrator infrastructure

#### For configuration changes:

- [x] `.env.default` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

**Note**: No configuration changes were needed as this uses existing
retry infrastructure and block schema validation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

2025-09-30 16:18:05 +00:00

Zamil Majdy

4c000086e6

feat(backend): implement clean k6 load testing infrastructure (#10978 )

## Summary

Implement comprehensive k6 load testing infrastructure for the AutoGPT
Platform with clean file organization, unified test runner, and cloud
integration.

## Key Features

### 🗂️ Clean File Organization
- **tests/basic/**: Simple validation tests (connectivity, single
endpoints)
- **tests/api/**: Core functionality tests (API endpoints, graph
execution)
- **tests/marketplace/**: User-facing feature tests (public/library
access)
- **tests/comprehensive/**: End-to-end scenario tests (complete user
journeys)
- **orchestrator/**: Advanced test orchestration for full suites

### 🚀 Unified Test Runner
- **Single entry point**: `run-tests.js` for both local and cloud
execution
- **7 available tests**: From basic connectivity to comprehensive
platform journeys
- **Flexible execution**: Run individual tests, comma-separated lists,
or all tests
- **Auto-configuration**: Different VU/duration settings for local vs
cloud execution

### 🔐 Advanced Authentication
- **Pre-authenticated tokens**: 24-hour JWT tokens eliminate Supabase
rate limiting
- **Configurable generation**: Default 10 tokens, scalable to 150+ for
high concurrency
- **Graceful handling**: Proper auth failure detection and recovery
- **ES module compatible**: Modern JavaScript with full import/export
support

### ☁️ k6 Cloud Integration
- **Cloud execution**: Tests run on k6 cloud infrastructure for
consistent results
- **Real-time monitoring**: Live dashboards with performance metrics
- **URL tracking**: Automatic test result URL capture and storage
- **Sequential orchestration**: Proper delays between tests for resource
management

## Test Coverage

### Performance Validated
- **Core API**: 100 VUs successfully testing `/api/credits`,
`/api/graphs`, `/api/blocks`, `/api/executions`
- **Graph Execution**: 80 VUs for complete workflow pipeline testing
- **Marketplace**: 150 VUs for public browsing, 100 VUs for
authenticated library operations
- **Authentication**: 150+ concurrent users with pre-authenticated token
scaling

### User Journey Simulation
- **Dashboard workflows**: Credits checking, graph management, execution
monitoring
- **Marketplace browsing**: Public search, agent discovery, category
filtering
- **Library operations**: Agent adding, favoriting, forking, detailed
views
- **Complete workflows**: End-to-end platform usage with realistic user
behavior

## Technical Implementation

### ES Module Compatibility
- Full ES module support with modern JavaScript imports/exports
- Proper module execution patterns for Node.js compatibility
- Clean separation between CommonJS legacy and modern ES modules

### Error Handling & Monitoring  
- **Separate metrics**: HTTP status, authentication, JSON validation,
overall success
- **Graceful degradation**: Auth failures don't crash VUs, proper error
tracking
- **Performance thresholds**: Configurable P95/P99 latency and error
rate limits
- **Custom counters**: Track operation types, success rates, user
journey completion

### Infrastructure Benefits
- **Rate limit protection**: Pre-auth tokens prevent Supabase auth
bottlenecks
- **Scalable testing**: Support for 150+ concurrent users with proper
token management
- **Cloud consistency**: Tests run on dedicated k6 cloud servers for
reliable results
- **Development workflow**: Local execution for debugging, cloud for
performance validation

## Usage

### Quick Start
```bash
# Setup and verification
export SUPABASE_SERVICE_KEY="your-service-key"
node generate-tokens.js
node run-tests.js verify

# Local testing (development)
node run-tests.js run core-api-test DEV

# Cloud testing (performance)
node run-tests.js cloud all DEV
```

### NPM Scripts
```bash
npm run verify    # Quick setup check
npm test         # All tests locally  
npm run cloud    # All tests in k6 cloud
```

## Validation Results

✅ **Authentication**: 100% success rate with fresh 24-hour tokens  
✅ **File Structure**: All imports and references verified correct  
✅ **Test Execution**: All 7 tests execute successfully with proper
metrics
✅ **Cloud Integration**: k6 cloud execution working with proper
credentials
✅ **Documentation**: Complete README with usage examples and
troubleshooting

## Files Changed

### Core Infrastructure
- `run-tests.js`: Unified test runner supporting local/cloud execution
- `generate-tokens.js`: ES module compatible token generation with
24-hour expiry
- `README.md`: Comprehensive documentation with updated file references

### Organized Test Structure  
- `tests/basic/connectivity-test.js`: Basic connectivity validation
- `tests/basic/single-endpoint-test.js`: Individual API endpoint testing
- `tests/api/core-api-test.js`: Core authenticated API endpoints
- `tests/api/graph-execution-test.js`: Graph workflow pipeline testing  
- `tests/marketplace/public-access-test.js`: Public marketplace browsing
- `tests/marketplace/library-access-test.js`: Authenticated
marketplace/library operations
- `tests/comprehensive/platform-journey-test.js`: Complete user journey
simulation

### Configuration
- `configs/environment.js`: Environment URLs and performance settings
- `package.json`: NPM scripts and dependencies for unified workflow

This infrastructure provides a solid foundation for continuous
performance monitoring and load testing of the AutoGPT Platform.

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
Co-authored-by: Reinier van der Leer <pwuts@agpt.co>

2025-09-25 12:51:54 +07:00

2 Commits