fix(backend/db): expose /ready endpoint via POST and GET for consistency

fix(backend/db): add memory-aware readiness probe for database-manager
Add /ready endpoint that checks process memory usage against a configurable threshold (READINESS_MEMORY_LIMIT env var, default 1.5GB). When memory exceeds the threshold, the readiness probe fails, causing Kubernetes to remove the pod from Service endpoints. In-flight requests drain gracefully while the liveness probe (/health_check) continues to pass. Once the pod is fully drained, Kubernetes terminates and restarts it with fresh memory. Base AppService gets a default readiness_check() that delegates to health_check(), so other services are unaffected.
2026-04-08 03:00:28 -04:00 · 2026-03-23 23:00:32 +07:00 · 2026-03-23 22:56:27 +07:00 · 2026-03-23 22:54:43 +07:00 · 2026-03-23 22:49:48 +07:00 · 2026-03-23 22:45:23 +07:00
3 changed files with 128 additions and 68 deletions
--- a/.claude/skills/pr-test/SKILL.md
+++ b/.claude/skills/pr-test/SKILL.md
@@ -346,59 +346,54 @@ agent-browser --session-name pr-test open 'http://localhost:3000/copilot' --time
 # ... fill chat input and press Enter, wait 20-30s for response
 ```

-## Step 5: Record results
+## Step 5: Record results and take screenshots

-For each test scenario, record in `$RESULTS_DIR/test-report.md`:
+**Take a screenshot at every significant test step** — before and after interactions, on success, and on failure. Name them sequentially with descriptive names:

-```markdown
-# E2E Test Report: PR #{N} — {title}
-Date: {date}
-Branch: {branch}
-Worktree: {path}
-
-## Environment
- Docker services: [list running containers]
- API keys: OpenRouter={present/missing}, E2B={present/missing}
-
-## Test Results
-
-### Scenario 1: {name}
-**Steps:**
-1. ...
-2. ...
-**Expected:** ...
-**Actual:** ...
-**Result:** PASS / FAIL
-**Screenshot:** {filename}.png
-**Logs:** (if relevant)
-
-### Scenario 2: {name}
-...
-
-## Summary
- Total: X scenarios
- Passed: Y
- Failed: Z
- Bugs found: [list]
-```
-
-Take screenshots at each significant step:
 ```bash
 agent-browser --session-name pr-test screenshot $RESULTS_DIR/{NN}-{description}.png
+# Examples:
+# $RESULTS_DIR/01-login-page.png
+# $RESULTS_DIR/02-builder-with-block.png
+# $RESULTS_DIR/03-copilot-response.png
+# $RESULTS_DIR/04-agent-execution-result.png
+# $RESULTS_DIR/05-error-state.png
 ```

-## Step 6: Report results
+**Aim for at least one screenshot per test scenario.** More is better — screenshots are the primary evidence that tests were actually run.

-After all tests complete, output a summary to the user:
+## Step 6: Show results to user with screenshots

-1. Table of all scenarios with PASS/FAIL
-2. Screenshots of failures (read the PNG files to show them)
-3. Any bugs found with details
-4. Recommendations
+**CRITICAL: After all tests complete, you MUST show every screenshot to the user using the Read tool, with an explanation of what each screenshot shows.** This is the most important part of the test report — the user needs to visually verify the results.

-### Post test results as PR comment with screenshots
+For each screenshot:
+1. Use the `Read` tool to display the PNG file (Claude can read images)
+2. Write a 1-2 sentence explanation below it describing:
+   - What page/state is being shown
+   - What the screenshot proves (which test scenario it validates)
+   - Any notable details visible in the UI

-Upload screenshots to the PR using the GitHub Git API (no local git operations — safe for worktrees).
+Format the output like this:
+
+```markdown
+### Screenshot 1: {descriptive title}
+[Read the PNG file here]
+
+**What it shows:** {1-2 sentence explanation of what this screenshot proves}
+
+---
+```
+
+After showing all screenshots, output a summary table:
+
+| # | Scenario | Result |
+|---|----------|--------|
+| 1 | {name} | PASS/FAIL |
+| 2 | ... | ... |
+
+## Step 7: Post test report as PR comment with screenshots
+
+Upload screenshots to the PR using the GitHub Git API (no local git operations — safe for worktrees), then post a comment with inline images and per-screenshot explanations.

 ```bash
 # Upload screenshots via GitHub Git API (creates blobs, tree, commit, and ref remotely)
@@ -406,17 +401,7 @@ REPO="Significant-Gravitas/AutoGPT"
 SCREENSHOTS_BRANCH="test-screenshots/pr-${PR_NUMBER}"
 SCREENSHOTS_DIR="test-screenshots/PR-${PR_NUMBER}"

-# Step 1: Create blobs for each screenshot
-declare -a TREE_ENTRIES
-for img in $RESULTS_DIR/*.png; do
-  BASENAME=$(basename "$img")
-  B64=$(base64 < "$img")
-  BLOB_SHA=$(gh api "repos/${REPO}/git/blobs" -f content="$B64" -f encoding="base64" --jq '.sha')
-  TREE_ENTRIES+=("-f" "tree[][path]=${SCREENSHOTS_DIR}/${BASENAME}" "-f" "tree[][mode]=100644" "-f" "tree[][type]=blob" "-f" "tree[][sha]=${BLOB_SHA}")
-done
-
-# Step 2: Create a tree with all screenshot blobs
-# Build the tree JSON manually since gh api doesn't handle arrays well
+# Step 1: Create blobs for each screenshot and build tree JSON
 TREE_JSON='['
 FIRST=true
 for img in $RESULTS_DIR/*.png; do
@@ -428,42 +413,72 @@ for img in $RESULTS_DIR/*.png; do
 done
 TREE_JSON+=']'

-TREE_SHA=$(echo "$TREE_JSON" | gh api "repos/${REPO}/git/trees" --input - -f base_tree="" --jq '.sha' 2>/dev/null \
-  || echo "$TREE_JSON" | jq -c '{tree: .}' | gh api "repos/${REPO}/git/trees" --input - --jq '.sha')
-
-# Step 3: Create a commit pointing to that tree
+# Step 2: Create tree, commit, and branch ref
+TREE_SHA=$(echo "$TREE_JSON" | jq -c '{tree: .}' | gh api "repos/${REPO}/git/trees" --input - --jq '.sha')
 COMMIT_SHA=$(gh api "repos/${REPO}/git/commits" \
  -f message="test: add E2E test screenshots for PR #${PR_NUMBER}" \
  -f tree="$TREE_SHA" \
  --jq '.sha')
-
-# Step 4: Create or update the ref (branch) — no local checkout needed
 gh api "repos/${REPO}/git/refs" \
  -f ref="refs/heads/${SCREENSHOTS_BRANCH}" \
  -f sha="$COMMIT_SHA" 2>/dev/null \
  || gh api "repos/${REPO}/git/refs/heads/${SCREENSHOTS_BRANCH}" \
    -X PATCH -f sha="$COMMIT_SHA" -f force=true
+```

-# Step 5: Build image markdown and post the comment
+Then post the comment with **inline images AND explanations for each screenshot**:
+
+```bash
 REPO_URL="https://raw.githubusercontent.com/${REPO}/${SCREENSHOTS_BRANCH}"
+
+# Build image markdown — each screenshot gets a heading, image, and explanation.
+# This loop is a TEMPLATE. Before running it, you must define a SCREENSHOT_EXPLANATIONS
+# associative array mapping each basename to a 1-2 sentence explanation from Step 6.
+# Example: declare -A SCREENSHOT_EXPLANATIONS=(
+#   ["01-login-page.png"]="Shows the login page loaded successfully with SSO options visible."
+#   ["02-builder-with-block.png"]="The builder canvas displays the newly added block connected to the trigger."
+# )
+declare -A SCREENSHOT_EXPLANATIONS
+# ^^^ Populate this with real explanations from Step 6 before running the loop
+
 IMAGE_MARKDOWN=""
 for img in $RESULTS_DIR/*.png; do
  BASENAME=$(basename "$img")
-  IMAGE_MARKDOWN="$IMAGE_MARKDOWN
-![${BASENAME}](${REPO_URL}/${SCREENSHOTS_DIR}/${BASENAME})"
+  TITLE=$(echo "${BASENAME%.png}" | sed 's/^[0-9]*-//' | sed 's/-/ /g' | awk '{for(i=1;i<=NF;i++) $i=toupper(substr($i,1,1)) tolower(substr($i,2))}1')
+  EXPLANATION="${SCREENSHOT_EXPLANATIONS[$BASENAME]}"
+  IMAGE_MARKDOWN="${IMAGE_MARKDOWN}
+### ${TITLE}
+![${BASENAME}](${REPO_URL}/${SCREENSHOTS_DIR}/${BASENAME})
+${EXPLANATION}
+"
 done

-gh api "repos/${REPO}/issues/$PR_NUMBER/comments" -f body="$(cat <<EOF
+# Write comment body to file to avoid shell interpretation issues with special characters.
+# IMPORTANT: Before running this, you must replace TEST_RESULTS_TABLE below with actual
+# markdown table rows from your test results, e.g.:
+# TEST_RESULTS_TABLE="| 1 | Login flow | PASS |
+# | 2 | Copilot chat | PASS |
+# | 3 | Agent execution | FAIL |"
+COMMENT_FILE=$(mktemp)
+cat > "$COMMENT_FILE" <<INNEREOF
 ## 🧪 E2E Test Report

-$(cat $RESULTS_DIR/test-report.md)
+| # | Scenario | Result |
+|---|----------|--------|
+${TEST_RESULTS_TABLE}

-### Screenshots
 ${IMAGE_MARKDOWN}
-EOF
-)"
+INNEREOF
+
+gh api "repos/${REPO}/issues/$PR_NUMBER/comments" -F body=@"$COMMENT_FILE"
+rm -f "$COMMENT_FILE"
 ```

+**The PR comment MUST include:**
+1. A summary table of all scenarios with PASS/FAIL
+2. Every screenshot rendered inline (not just linked)
+3. A 1-2 sentence explanation below each screenshot describing what it proves
+
 This approach uses the GitHub Git API to create blobs, trees, commits, and refs entirely server-side. No local `git checkout` or `git push` — safe for worktrees and won't interfere with the PR branch.

 ## Fix mode (--fix flag)
--- a/autogpt_platform/backend/backend/data/db_manager.py
+++ b/autogpt_platform/backend/backend/data/db_manager.py
@@ -1,4 +1,6 @@
 import logging
+import os
+import resource
 from contextlib import asynccontextmanager
 from typing import TYPE_CHECKING, Callable, Concatenate, ParamSpec, TypeVar, cast

@@ -173,6 +175,42 @@ class DatabaseManager(AppService):
            logger.info(f"[{self.service_name}] ⏳ Disconnecting Database...")
            await db.disconnect()

+    def _get_memory_usage_bytes(self) -> int:
+        """Get current process memory usage in bytes (RSS).
+
+        Tries cgroup v2 first (GKE with containerd), then cgroup v1,
+        then falls back to getrusage.
+        """
+        for path in (
+            "/sys/fs/cgroup/memory.current",  # cgroup v2
+            "/sys/fs/cgroup/memory/memory.usage_in_bytes",  # cgroup v1
+        ):
+            try:
+                with open(path) as f:
+                    return int(f.read().strip())
+            except (FileNotFoundError, PermissionError):
+                continue
+        # Fallback: RSS from getrusage (macOS returns bytes, Linux returns KB)
+        rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
+        return rss if os.uname().sysname == "Darwin" else rss * 1024
+
+    async def readiness_check(self) -> str:
+        """Readiness probe: fails when memory exceeds threshold.
+
+        When memory is too high, Kubernetes removes the pod from Service
+        endpoints (no new traffic), letting in-flight requests drain before
+        the liveness probe eventually kills and restarts the pod.
+        """
+        # Default 1.5GB; configurable via env var
+        max_memory = int(os.environ.get("READINESS_MEMORY_LIMIT", 1_610_612_736))
+        current = self._get_memory_usage_bytes()
+        if current > max_memory:
+            raise UnhealthyServiceError(
+                f"Memory {current / 1024**2:.0f}Mi exceeds readiness limit "
+                f"{max_memory / 1024**2:.0f}Mi"
+            )
+        return await self.health_check()
+
    async def health_check(self) -> str:
        if not db.is_connected():
            raise UnhealthyServiceError("Database is not connected")
--- a/autogpt_platform/backend/backend/util/service.py
+++ b/autogpt_platform/backend/backend/util/service.py
@@ -372,6 +372,10 @@ class AppService(BaseAppService, ABC):
        """A method to check the health of the process."""
        return "OK"

+    async def readiness_check(self) -> str:
+        """Readiness probe. Override to add memory/resource checks."""
+        return await self.health_check()
+
    def run(self):
        sentry_init()
        super().run()
@@ -411,6 +415,9 @@ class AppService(BaseAppService, ABC):
        self.fastapi_app.add_api_route(
            "/health_check_async", self.health_check, methods=["POST", "GET"]
        )
+        self.fastapi_app.add_api_route(
+            "/ready", self.readiness_check, methods=["POST", "GET"]
+        )
        self.fastapi_app.add_exception_handler(
            ValueError, self._handle_internal_http_error(400)
        )
Author	SHA1	Message	Date
Zamil Majdy	c7056136d2	fix(backend/db): expose /ready endpoint via POST and GET for consistency	2026-03-23 23:00:32 +07:00
Zamil Majdy	5df20cbbf7	fix(backend/db): add memory-aware readiness probe for database-manager Add /ready endpoint that checks process memory usage against a configurable threshold (READINESS_MEMORY_LIMIT env var, default 1.5GB). When memory exceeds the threshold, the readiness probe fails, causing Kubernetes to remove the pod from Service endpoints. In-flight requests drain gracefully while the liveness probe (/health_check) continues to pass. Once the pod is fully drained, Kubernetes terminates and restarts it with fresh memory. Base AppService gets a default readiness_check() that delegates to health_check(), so other services are unaffected.	2026-03-23 22:56:27 +07:00
Zamil Majdy	c141e3cd33	fix(skills): replace test results table placeholder with variable template Add TEST_RESULTS_TABLE variable with instructions for the agent to populate it with actual test results before posting the PR comment.	2026-03-23 22:54:43 +07:00
Zamil Majdy	2927dce5c3	fix(skills): use associative array for screenshot explanations instead of placeholder Replace static SCREENSHOT_EXPLANATION placeholder with a declare -A map that the agent populates from Step 6 observations before running the loop.	2026-03-23 22:49:48 +07:00
Zamil Majdy	645eb59d60	fix(skills): unify explanation length to 1-2 sentences across Steps 6 and 7	2026-03-23 22:45:23 +07:00
Zamil Majdy	221873ccc0	fix(skills): address PR review comments on /pr-test skill - Replace HTML placeholder comments with clear instruction to insert real screenshot explanations from Step 6 observations - Make Step 6 summary table consistent with Step 7 (3 columns, no Screenshot col) - Use --body-file for gh api comment posting to avoid shell interpretation issues - Add markdown language identifier to fenced code block example	2026-03-23 22:31:37 +07:00
Zamil Majdy	26961444d9	feat(devops): improve /pr-test skill to show screenshots with explanations Update the pr-test skill to: - Take screenshots at every significant test step (not just failures) - Show every screenshot to the user via Read tool with 2-3 sentence explanations - Post PR comments with inline images and per-screenshot descriptions - Simplify the GitHub Git API upload flow (single tree build) - Add clear formatting guidance for the PR comment structure Previously screenshots were saved locally but not shown to the user or included in PR comments with context. Now every screenshot gets displayed inline with an explanation of what it proves.	2026-03-23 21:33:50 +07:00