refactor: split architecture.md into thematic files

2026-04-29 03:00:45 -04:00 · 2026-01-21 17:29:49 -08:00
8 changed files with 501 additions and 2 deletions
@@ -1,8 +1,18 @@
-# OpenHands Architecture
+# OpenHands

 This directory contains the core components of OpenHands.

-For an overview of the system architecture, see the [architecture documentation](https://docs.openhands.dev/usage/architecture/backend) (v0 backend architecture).
+## Documentation
+
+- **[Architecture Documentation](./architecture/README.md)** - Detailed system architecture with Mermaid diagrams covering:
+  - System Architecture Overview
+  - Conversation Startup & WebSocket Flow
+  - Authentication Flow (Keycloak)
+  - Agent Execution & LLM Flow
+  - External Integrations (GitHub, Slack, Jira, etc.)
+  - Metrics, Logs & Observability
+
+- **[External Architecture Docs](https://docs.openhands.dev/usage/architecture/backend)** - Official documentation (v0 backend architecture)

 ## Classes

@@ -0,0 +1,12 @@
+# OpenHands Architecture
+
+This document provides detailed architecture diagrams and explanations for the OpenHands system.
+
+## Documentation Sections
+
+- [System Architecture Overview](./system-architecture.md)
+- [Conversation Startup & WebSocket Flow](./conversation-startup.md)
+- [Authentication Flow](./authentication.md)
+- [Agent Execution & LLM Flow](./agent-execution.md)
+- [External Integrations](./external-integrations.md)
+- [Metrics, Logs & Observability](./observability.md)
@@ -0,0 +1,96 @@
+# Agent Execution & LLM Flow
+
+When the agent executes inside the sandbox, it makes LLM calls through LiteLLM:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant User as User (Browser)
+    participant AS as Agent Server
+    participant Agent as Agent<br/>(CodeAct)
+    participant LLM as LLM Class
+    participant Lite as LiteLLM
+    participant Proxy as LLM Proxy<br/>(llm-proxy.app.all-hands.dev)
+    participant Provider as LLM Provider<br/>(OpenAI, Anthropic, etc.)
+    participant AES as Action Execution Server
+
+    Note over User,AES: Agent Loop - LLM Call Flow
+
+    User->>AS: WebSocket: User message
+    AS->>Agent: Process message
+    Agent->>Agent: Build prompt from state
+
+    Agent->>LLM: completion(messages, tools)
+    LLM->>LLM: Apply config (model, temp, etc.)
+
+    alt Using OpenHands Provider
+        LLM->>Lite: litellm_proxy/{model}
+        Lite->>Proxy: POST /chat/completions
+        Proxy->>Proxy: Auth, rate limit, routing
+        Proxy->>Provider: Forward request
+        Provider-->>Proxy: Response
+        Proxy-->>Lite: Response
+    else Using Direct Provider
+        LLM->>Lite: {provider}/{model}
+        Lite->>Provider: Direct API call
+        Provider-->>Lite: Response
+    end
+
+    Lite-->>LLM: ModelResponse
+    LLM->>LLM: Track metrics (cost, tokens)
+    LLM-->>Agent: Parsed response
+
+    Agent->>Agent: Parse action from response
+    AS->>User: WebSocket: Action event
+
+    Note over User,AES: Action Execution
+
+    AS->>AES: HTTP: Execute action
+    AES->>AES: Run command/edit file
+    AES-->>AS: Observation
+    AS->>User: WebSocket: Observation event
+
+    Agent->>Agent: Update state
+    Note over Agent: Loop continues...
+```
+
+### LLM Components
+
+| Component | Purpose | Location |
+|-----------|---------|----------|
+| **LLM Class** | Wrapper with retries, metrics, config | `openhands/llm/llm.py` |
+| **LiteLLM** | Universal LLM API adapter | External library |
+| **LLM Proxy** | OpenHands managed proxy for billing/routing | `llm-proxy.app.all-hands.dev` |
+| **LLM Registry** | Manages multiple LLM instances | `openhands/llm/llm_registry.py` |
+
+### Model Routing
+
+```
+User selects model
+        │
+        ▼
+┌───────────────────┐
+│ Model prefix?     │
+└───────────────────┘
+        │
+        ├── openhands/claude-3-5  ──► Rewrite to litellm_proxy/claude-3-5
+        │                              Base URL: llm-proxy.app.all-hands.dev
+        │
+        ├── anthropic/claude-3-5  ──► Direct to Anthropic API
+        │                              (User's API key)
+        │
+        ├── openai/gpt-4          ──► Direct to OpenAI API
+        │                              (User's API key)
+        │
+        └── azure/gpt-4           ──► Direct to Azure OpenAI
+                                       (User's API key + endpoint)
+```
+
+### LLM Proxy Benefits
+
+When using `openhands/` prefixed models:
+- **Unified Billing**: Costs tracked through OpenHands account
+- **No API Keys Needed**: Users don't need their own provider keys
+- **Rate Limiting**: Managed quotas and throttling
+- **Model Routing**: Automatic failover and load balancing
+- **Usage Tracking**: Detailed metrics and cost analysis
@@ -0,0 +1,58 @@
+# Authentication Flow
+
+OpenHands uses Keycloak for identity management in the SaaS deployment. The authentication flow involves multiple services:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant User as User (Browser)
+    participant App as App Server
+    participant KC as Keycloak
+    participant IdP as Identity Provider<br/>(GitHub, Google, etc.)
+    participant DB as User Database
+
+    Note over User,DB: OAuth 2.0 / OIDC Authentication Flow
+
+    User->>App: Access OpenHands
+    App->>User: Redirect to Keycloak
+    User->>KC: Login request
+    KC->>User: Show login options
+    User->>KC: Select provider (e.g., GitHub)
+    KC->>IdP: OAuth redirect
+    User->>IdP: Authenticate
+    IdP-->>KC: OAuth callback + tokens
+    KC->>KC: Create/update user session
+    KC-->>User: Redirect with auth code
+    User->>App: Auth code
+    App->>KC: Exchange code for tokens
+    KC-->>App: Access token + Refresh token
+    App->>App: Create signed JWT cookie
+    App->>DB: Store/update user record
+    App-->>User: Set keycloak_auth cookie
+
+    Note over User,DB: Subsequent Requests
+
+    User->>App: Request with cookie
+    App->>App: Verify JWT signature
+    App->>KC: Validate token (if needed)
+    KC-->>App: Token valid
+    App->>App: Extract user context
+    App-->>User: Authorized response
+```
+
+### Authentication Components
+
+| Component | Purpose | Location |
+|-----------|---------|----------|
+| **Keycloak** | Identity provider, SSO, token management | External service |
+| **UserAuth** | Abstract auth interface | `openhands/server/user_auth/user_auth.py` |
+| **SaasUserAuth** | Keycloak implementation | `enterprise/server/auth/saas_user_auth.py` |
+| **JWT Service** | Token signing/verification | `openhands/app_server/services/jwt_service.py` |
+| **Auth Routes** | Login/logout endpoints | `enterprise/server/routes/auth.py` |
+
+### Token Flow
+
+1. **Keycloak Access Token**: Short-lived token for API access
+2. **Keycloak Refresh Token**: Long-lived token to obtain new access tokens
+3. **Signed JWT Cookie**: App Server's session cookie containing encrypted Keycloak tokens
+4. **Provider Tokens**: OAuth tokens for GitHub, GitLab, etc. (stored separately for git operations)
@@ -0,0 +1,68 @@
+# Conversation Startup & WebSocket Flow
+
+When a user starts a conversation, this sequence occurs:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant User as User (Browser)
+    participant App as App Server
+    participant SS as Sandbox Service
+    participant RAPI as Runtime API
+    participant Pool as Warm Pool
+    participant Sandbox as Sandbox (Container)
+    participant AS as Agent Server
+    participant AES as Action Execution Server
+
+    Note over User,AES: Phase 1: Conversation Creation
+    User->>App: POST /api/conversations
+    App->>App: Authenticate user
+    App->>SS: Create sandbox
+
+    Note over SS,Pool: Phase 2: Runtime Provisioning
+    SS->>RAPI: POST /start (image, env, config)
+    RAPI->>Pool: Check for warm runtime
+    alt Warm runtime available
+        Pool-->>RAPI: Return warm runtime
+        RAPI->>RAPI: Assign to session
+    else No warm runtime
+        RAPI->>Sandbox: Create new container
+        Sandbox->>AS: Start Agent Server
+        Sandbox->>AES: Start Action Execution Server
+        AES-->>AS: Ready
+    end
+    RAPI-->>SS: Runtime URL + session API key
+    SS-->>App: Sandbox info
+    App-->>User: Conversation ID + Sandbox URL
+
+    Note over User,AES: Phase 3: Direct WebSocket Connection
+    User->>AS: WebSocket: /sockets/events/{id}
+    AS-->>User: Connection accepted
+    AS->>User: Replay historical events
+
+    Note over User,AES: Phase 4: User Sends Message
+    User->>AS: WebSocket: SendMessageRequest
+    AS->>AS: Agent processes message
+    AS->>AS: LLM call → generate action
+
+    Note over User,AES: Phase 5: Action Execution Loop
+    loop Agent Loop
+        AS->>AES: HTTP: Execute action
+        AES->>AES: Run in sandbox
+        AES-->>AS: Observation result
+        AS->>User: WebSocket: Event update
+        AS->>AS: Update state, next action
+    end
+
+    Note over User,AES: Phase 6: Task Complete
+    AS->>User: WebSocket: AgentStateChanged (FINISHED)
+```
+
+### Key Points
+
+1. **Initial Setup via App Server**: The App Server handles authentication and coordinates with the Sandbox Service
+2. **Runtime API Provisioning**: The Sandbox Service calls the Runtime API, which checks for warm runtimes before creating new containers
+3. **Warm Pool Optimization**: Pre-warmed runtimes reduce startup latency significantly
+4. **Direct WebSocket to Sandbox**: Once created, the user's browser connects **directly** to the Agent Server inside the sandbox
+5. **App Server Not in Hot Path**: After connection, all real-time communication bypasses the App Server entirely
+6. **Agent Server Orchestrates**: The Agent Server manages the AI loop, calling the Action Execution Server for actual command execution
@@ -0,0 +1,88 @@
+# External Integrations
+
+OpenHands integrates with external services (GitHub, Slack, Jira, etc.) through webhook-based event handling:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant Ext as External Service<br/>(GitHub/Slack/Jira)
+    participant App as App Server
+    participant IntRouter as Integration Router
+    participant Manager as Integration Manager
+    participant Conv as Conversation Service
+    participant Sandbox as Sandbox
+
+    Note over Ext,Sandbox: Webhook Event Flow (e.g., GitHub Issue Created)
+
+    Ext->>App: POST /api/integration/{service}/events
+    App->>IntRouter: Route to service handler
+    IntRouter->>IntRouter: Verify signature<br/>(HMAC/signing secret)
+
+    IntRouter->>Manager: Parse event payload
+    Manager->>Manager: Extract context<br/>(repo, issue, user)
+    Manager->>Manager: Map external user → OpenHands user<br/>(via stored tokens)
+
+    Manager->>Conv: Create conversation<br/>(with issue context)
+    Conv->>Sandbox: Provision sandbox
+    Sandbox-->>Conv: Ready
+
+    Manager->>Sandbox: Start agent with task
+
+    Note over Ext,Sandbox: Agent Works on Task...
+
+    Sandbox-->>Manager: Task complete
+    Manager->>Ext: POST result<br/>(PR, comment, etc.)
+
+    Note over Ext,Sandbox: Callback Flow (Agent → External Service)
+
+    Sandbox->>App: Webhook callback<br/>/api/v1/webhooks
+    App->>Manager: Process callback
+    Manager->>Ext: Update external service
+```
+
+### Supported Integrations
+
+| Integration | Trigger Events | Agent Actions |
+|-------------|----------------|---------------|
+| **GitHub** | Issue created, PR opened, @mention | Create PR, comment, push commits |
+| **GitLab** | Issue created, MR opened | Create MR, comment, push commits |
+| **Slack** | @mention in channel | Reply in thread, create tasks |
+| **Jira** | Issue created/updated | Update ticket, add comments |
+| **Linear** | Issue created | Update status, add comments |
+
+### Integration Components
+
+| Component | Purpose | Location |
+|-----------|---------|----------|
+| **Integration Routes** | Webhook endpoints per service | `enterprise/server/routes/integration/` |
+| **Integration Managers** | Business logic per service | `enterprise/integrations/{service}/` |
+| **Token Manager** | Store/retrieve OAuth tokens | `enterprise/server/auth/token_manager.py` |
+| **Callback Processor** | Handle agent → service updates | `enterprise/integrations/{service}/*_callback_processor.py` |
+
+### Integration Authentication
+
+```
+External Service (e.g., GitHub)
+        │
+        ▼
+┌─────────────────────────────────┐
+│ GitHub App Installation         │
+│ - Webhook secret for signature  │
+│ - App private key for API calls │
+└─────────────────────────────────┘
+        │
+        ▼
+┌─────────────────────────────────┐
+│ User Account Linking            │
+│ - Keycloak user ID              │
+│ - GitHub user ID                │
+│ - Stored OAuth tokens           │
+└─────────────────────────────────┘
+        │
+        ▼
+┌─────────────────────────────────┐
+│ Agent Execution                 │
+│ - Uses linked tokens for API    │
+│ - Can push, create PRs, comment │
+└─────────────────────────────────┘
+```
@@ -0,0 +1,103 @@
+# Metrics, Logs & Observability
+
+OpenHands uses multiple systems for monitoring, analytics, and debugging:
+
+```mermaid
+flowchart LR
+    subgraph Sources["Sources"]
+        Agent["Agent Server"]
+        App["App Server"]
+        Frontend["Frontend"]
+    end
+
+    subgraph Collection["Collection"]
+        JSONLog["JSON Logs"]
+        Metrics["Metrics"]
+        PH["PostHog"]
+    end
+
+    subgraph Services["Services"]
+        DD["DataDog"]
+        PHCloud["PostHog Cloud"]
+    end
+
+    Agent --> JSONLog
+    App --> JSONLog
+    App --> PH
+    Frontend --> PH
+
+    JSONLog --> DD
+    Metrics --> DD
+    PH --> PHCloud
+```
+
+### Logging Infrastructure
+
+| Component | Format | Destination | Purpose |
+|-----------|--------|-------------|---------|
+| **Application Logs** | JSON (when `LOG_JSON=1`) | stdout → DataDog | Debugging, error tracking |
+| **Access Logs** | JSON (Uvicorn) | stdout → DataDog | Request tracing |
+| **LLM Debug Logs** | Plain text | File (optional) | LLM call debugging |
+
+### JSON Log Format
+
+When `LOG_JSON=1` is set, all logs are emitted as single-line JSON for DataDog ingestion:
+
+```json
+{
+  "message": "Conversation started",
+  "severity": "INFO",
+  "conversation_id": "abc-123",
+  "user_id": "user-456",
+  "timestamp": "2024-01-15T10:30:00Z"
+}
+```
+
+### Metrics Tracked
+
+| Metric | Tracked By | Storage | Purpose |
+|--------|------------|---------|---------|
+| **LLM Cost** | `Metrics` class | Conversation stats file | Billing, budget limits |
+| **Token Usage** | `Metrics` class | Conversation stats file | Usage analytics |
+| **Response Latency** | `Metrics` class | Conversation stats file | Performance monitoring |
+| **User Events** | PostHog | PostHog Cloud | Product analytics |
+| **Feature Flags** | PostHog | PostHog Cloud | Gradual rollouts |
+
+### PostHog Analytics
+
+PostHog is used for both product analytics and feature flags:
+
+**Frontend Events:**
+- `conversation_started`
+- `download_trajectory_button_clicked`
+- Feature flag checks
+
+**Backend Events:**
+- Experiment assignments
+- Conversion tracking
+
+### DataDog Integration
+
+Logs are ingested by DataDog through structured JSON output:
+
+1. **Log Collection**: Container stdout/stderr → DataDog Agent → DataDog Logs
+2. **APM Traces**: Distributed tracing across services (when enabled)
+3. **Dashboards**: Custom dashboards for:
+   - Error rates by service
+   - Request latency percentiles
+   - Conversation success rates
+   - LLM cost tracking
+
+### Conversation Stats Persistence
+
+Per-conversation metrics are persisted for billing and analytics:
+
+```python
+# Location: openhands/server/services/conversation_stats.py
+ConversationStats:
+  - service_to_metrics: Dict[str, Metrics]
+  - accumulated_cost: float
+  - token_usage: TokenUsage
+
+# Stored at: {file_store}/conversation_stats/{conversation_id}.pkl
+```
@@ -0,0 +1,64 @@
+# System Architecture Overview
+
+OpenHands uses a multi-tier architecture with these main components:
+
+```mermaid
+flowchart TB
+    subgraph AppServer["OpenHands App Server (Single Instance)"]
+        API["REST API<br/>(FastAPI)"]
+        Auth["Authentication"]
+        ConvMgr["Conversation<br/>Manager"]
+        SandboxSvc["Sandbox<br/>Service"]
+    end
+
+    subgraph RuntimeAPI["Runtime API (Separate Service)"]
+        RuntimeMgr["Runtime<br/>Manager"]
+        WarmPool["Warm Runtime<br/>Pool"]
+    end
+
+    subgraph Sandbox["Sandbox (Docker/K8s Container)"]
+        AS["Agent Server<br/>(openhands-agent-server)"]
+        AES["Action Execution<br/>Server"]
+        Browser["Browser<br/>Environment"]
+        FS["File System"]
+    end
+
+    User["User"] -->|"1. HTTP/REST"| API
+    API --> Auth
+    Auth --> ConvMgr
+    ConvMgr --> SandboxSvc
+
+    SandboxSvc -->|"2. POST /start"| RuntimeMgr
+    RuntimeMgr -->|"Check pool"| WarmPool
+    WarmPool -->|"Warm runtime<br/>available?"| RuntimeMgr
+    RuntimeMgr -->|"3. Provision or<br/>assign runtime"| Sandbox
+
+    User -.->|"4. WebSocket<br/>(Direct)"| AS
+
+    AS -->|"HTTP"| AES
+    AES --> Browser
+    AES --> FS
+```
+
+### Component Responsibilities
+
+| Component | Location | Instances | Purpose |
+|-----------|----------|-----------|---------|
+| **App Server** | Host | 1 per deployment | REST API, auth, conversation management |
+| **Sandbox Service** | Inside App Server | 1 | Manages sandbox lifecycle, calls Runtime API |
+| **Runtime API** | Separate service | 1 per deployment | Provisions runtimes, manages warm pool |
+| **Agent Server** | Inside sandbox | 1 per sandbox | AI agent loop, LLM calls, state management |
+| **Action Execution Server** | Inside sandbox | 1 per sandbox | Execute bash, file ops, browser actions |
+
+### Runtime API Endpoints
+
+The Runtime API manages the actual container/pod lifecycle:
+
+| Endpoint | Purpose |
+|----------|---------|
+| `POST /start` | Start a new runtime (or assign from warm pool) |
+| `POST /stop` | Stop and clean up a runtime |
+| `POST /pause` | Pause a running runtime |
+| `POST /resume` | Resume a paused runtime |
+| `GET /sessions/{id}` | Get runtime status |
+| `GET /list` | List all active runtimes |