fix(platform): reduce Sentry noise by filtering expected errors and downgrading log levels (#12430)

## Summary Reduces Sentry error noise by ~90% by filtering out expected/transient errors and downgrading inappropriate error-level logs to warnings. Most of the top Sentry issues are not actual bugs but expected conditions (user errors, transient infra, business logic) that were incorrectly logged at ERROR level, causing them to be captured as Sentry events. ## Changes ### 1. Sentry `before_send` filter (`metrics.py`) Added a `before_send` hook to filter known expected errors before they reach Sentry: - **AMQP/RabbitMQ connection errors** — transient during deploys/restarts - **User credential errors** — invalid API keys, missing auth headers (user error, not platform bug) - **Insufficient balance** — expected business logic - **Blocked IP access** — security check working as intended - **Discord bot token errors** — misconfiguration, not runtime error - **Google metadata DNS errors** — expected in non-GCP environments - **Inactive email recipients** — expected for bounced addresses - **Unclosed client sessions/connectors** — resource cleanup noise ### 2. Connection retry log levels (`retry.py`) - `conn_retry` final failure: `error` → `warning` (these are infra retries, not bugs) - `conn_retry` wrapper final failure: `error` → `warning` - Discord alert send failure: `error` → `warning` ### 3. Block execution Sentry capture (`manager.py`) - Skip `sentry_sdk.capture_exception()` for `ValueError` subclasses (BlockExecutionError, BlockInputError, InsufficientBalanceError, etc.) — these are user-caused errors, not platform bugs - Downgrade executor shutdown/disconnect errors to warning ### 4. Scheduler log levels (`scheduler.py`) - Graph validation failure: `error` → `warning` (expected for old/invalid graphs) - Unable to unschedule graph: `error` → `warning` - Job listener failure: `error` → `warning` - Async operation failure: `error` → `warning` ### 5. Discord system alert (`notifications.py`) - Wrapped `discord_system_alert` endpoint with try/catch to prevent unhandled exceptions (fixes AUTOGPT-SERVER-743, AUTOGPT-SERVER-7MW) ### 6. Notification system log levels (`notifications.py`) - All batch processing errors: `error` → `warning` - User email not found: `error` → `warning` - Notification parsing errors: `error` → `warning` - Email sending failures: `error` → `warning` - Summary data gathering failure: `error` → `warning` - Cleaned up unprofessional error messages ### 7. Cloud storage cleanup (`cloud_storage.py`) - Cleanup error: `error` → `warning` ## Sentry Issues Addressed ### AMQP/RabbitMQ (~3.4M events total) - [AUTOGPT-SERVER-3H2](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H2) — AMQPConnector ConnectionRefusedError (1.2M events) - [AUTOGPT-SERVER-3H3](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H3) — AMQPConnectionWorkflowFailed (770K events) - [AUTOGPT-SERVER-3H4](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H4) — AMQP connection workflow failed (770K events) - [AUTOGPT-SERVER-3H5](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H5) — AMQPConnectionWorkflow reporting failure (770K events) - [AUTOGPT-SERVER-3H7](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H7) — Socket failed to connect (514K events) - [AUTOGPT-SERVER-3H8](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H8) — TCP Connection attempt failed (514K events) - [AUTOGPT-SERVER-3H6](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-3H6) — AMQPConnectionError (93K events) - [AUTOGPT-SERVER-7SX](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-7SX) — Error creating transport (69K events) - [AUTOGPT-SERVER-1TN](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-1TN) — ChannelInvalidStateError (39K events) - [AUTOGPT-SERVER-6JC](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6JC) — ConnectionClosedByBroker (2K events) - [AUTOGPT-SERVER-6RJ/6RK/6RN/6RQ/6RP/6RR](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6RJ) — Various connection failures (~15K events) - [AUTOGPT-SERVER-4A5/6RM/7XN](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-4A5) — Connection close/transport errors (~540 events) ### User Credential Errors (~15K events) - [AUTOGPT-SERVER-6S5](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6S5) — Incorrect OpenAI API key (9.2K events) - [AUTOGPT-SERVER-7W4](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-7W4) — Incorrect API key in AIConditionBlock (3.4K events) - [AUTOGPT-SERVER-83Y](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-83Y) — AI condition invalid key (2.3K events) - [AUTOGPT-SERVER-7ZP](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-7ZP) — Perplexity missing auth header (451 events) - [AUTOGPT-SERVER-7XK/7XM](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-7XK) — Anthropic invalid key (125 events) - [AUTOGPT-SERVER-82C](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-82C) — Missing auth header (27 events) - [AUTOGPT-SERVER-721](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-721) — Ideogram invalid token (165 events) ### Business Logic / Validation (~120K events) - [AUTOGPT-SERVER-7YQ](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-7YQ) — Disabled block used in graph (56K events) - [AUTOGPT-SERVER-6W3](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6W3) — Graph failed validation (46K events) - [AUTOGPT-SERVER-6W2](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6W2) — Unable to unschedule graph (46K events) - [AUTOGPT-SERVER-83X](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-83X) — Blocked IP access (15K events) - [AUTOGPT-SERVER-6K9](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6K9) — Insufficient balance (4K events) ### Discord Alert Failures (~24K events) - [AUTOGPT-SERVER-743](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-743) — Discord improper token (22K events) - [AUTOGPT-SERVER-7MW](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-7MW) — Discord 403 Missing Access (1.5K events) ### Notification System (~16K events) - [AUTOGPT-SERVER-550](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-550) — Notification batch create error (8.3K events) - [AUTOGPT-SERVER-58H](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-58H) — ValidationError for NotificationEventModel (3K events) - [AUTOGPT-SERVER-5C6](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-5C6) — Get notification batch error (2.1K events) - [AUTOGPT-SERVER-4BT](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-4BT) — Notification batch create error (1.8K events) - [AUTOGPT-SERVER-5E4](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-5E4) — NotificationPreference validation (1.4K events) - [AUTOGPT-SERVER-508](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-508) — Inactive email recipients (702 events) ### Infrastructure / Transient (~20K events) - [AUTOGPT-SERVER-6WJ](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-6WJ) — Unclosed client session (13K events) - [AUTOGPT-SERVER-745](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-745) — Unclosed connector (5.8K events) - [AUTOGPT-SERVER-4V1](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-4V1) — Google metadata DNS error (2.2K events) - [AUTOGPT-SERVER-80J](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-80J) — CloudStorage DNS error (35 events) ### Executor Shutdown - [AUTOGPT-SERVER-55J](https://significant-gravitas.sentry.io/issues/AUTOGPT-SERVER-55J) — Error disconnecting run client (118 events) ## Test plan - [x] All pre-commit hooks pass (Ruff, isort, Black, Pyright typecheck) - [x] All changed modules import successfully - [ ] Deploy to staging and verify Sentry event volume drops significantly - [ ] Verify legitimate errors still appear in Sentry
2026-04-08 03:00:28 -04:00 · 2026-03-16 17:29:01 +07:00
parent 53d58e21d3
commit c9c3d54b2b
7 changed files with 171 additions and 57 deletions
--- a/autogpt_platform/backend/backend/executor/manager.py
+++ b/autogpt_platform/backend/backend/executor/manager.py
@@ -61,7 +61,12 @@ from backend.util.decorator import (
    error_logged,
    time_measured,
 )
-from backend.util.exceptions import InsufficientBalanceError, ModerationError
+from backend.util.exceptions import (
+    GraphNotFoundError,
+    InsufficientBalanceError,
+    ModerationError,
+    NotFoundError,
+)
 from backend.util.file import clean_exec_files
 from backend.util.logging import TruncatedLogger, configure_logging
 from backend.util.metrics import DiscordChannel
@@ -375,9 +380,16 @@ async def execute_node(
            log_metadata.debug("Node produced output", **{output_name: output_data})
            yield output_name, output_data
    except Exception as ex:
-        # Capture exception WITH context still set before restoring scope
-        sentry_sdk.capture_exception(error=ex, scope=scope)
-        sentry_sdk.flush()  # Ensure it's sent before we restore scope
+        # Only capture unexpected errors to Sentry, not user-caused ones.
+        # Most ValueError subclasses here are expected (BlockExecutionError,
+        # InsufficientBalanceError, plain ValueError for auth/disabled blocks, etc.)
+        # but NotFoundError/GraphNotFoundError could indicate real platform issues.
+        is_expected = isinstance(ex, ValueError) and not isinstance(
+            ex, (NotFoundError, GraphNotFoundError)
+        )
+        if not is_expected:
+            sentry_sdk.capture_exception(error=ex, scope=scope)
+            sentry_sdk.flush()
        # Re-raise to maintain normal error flow
        raise
    finally:
@@ -1478,7 +1490,7 @@ class ExecutionProcessor:
                    alert_message, DiscordChannel.PRODUCT
                )
            except Exception as e:
-                logger.error(f"Failed to send low balance Discord alert: {e}")
+                logger.warning(f"Failed to send low balance Discord alert: {e}")


 class ExecutionManager(AppProcess):
@@ -1900,17 +1912,16 @@ class ExecutionManager(AppProcess):
            channel = client.get_channel()
            channel.connection.add_callback_threadsafe(lambda: channel.stop_consuming())

-            try:
-                thread.join(timeout=300)
-            except TimeoutError:
-                logger.error(
+            thread.join(timeout=300)
+            if thread.is_alive():
+                logger.warning(
                    f"{prefix} ⚠️ Run thread did not finish in time, forcing disconnect"
                )

            client.disconnect()
            logger.info(f"{prefix} ✅ Run client disconnected")
        except Exception as e:
-            logger.error(f"{prefix} ⚠️ Error disconnecting run client: {type(e)} {e}")
+            logger.warning(f"{prefix} ⚠️ Error disconnecting run client: {type(e)} {e}")

    def cleanup(self):
        """Override cleanup to implement graceful shutdown with active execution waiting."""
@@ -1926,7 +1937,9 @@ class ExecutionManager(AppProcess):
            )
            logger.info(f"{prefix} ✅ Exec consumer has been signaled to stop")
        except Exception as e:
-            logger.error(f"{prefix} ⚠️ Error signaling consumer to stop: {type(e)} {e}")
+            logger.warning(
+                f"{prefix} ⚠️ Error signaling consumer to stop: {type(e)} {e}"
+            )

        # Wait for active executions to complete
        if self.active_graph_runs:
@@ -1957,7 +1970,7 @@ class ExecutionManager(AppProcess):
                waited += wait_interval

            if self.active_graph_runs:
-                logger.error(
+                logger.warning(
                    f"{prefix} ⚠️ {len(self.active_graph_runs)} executions still running after {max_wait}s"
                )
            else:
@@ -1968,7 +1981,7 @@ class ExecutionManager(AppProcess):
            self.executor.shutdown(cancel_futures=True, wait=False)
            logger.info(f"{prefix} ✅ Executor shutdown completed")
        except Exception as e:
-            logger.error(f"{prefix} ⚠️ Error during executor shutdown: {type(e)} {e}")
+            logger.warning(f"{prefix} ⚠️ Error during executor shutdown: {type(e)} {e}")

        # Release remaining execution locks
        try:
--- a/autogpt_platform/backend/backend/executor/scheduler.py
+++ b/autogpt_platform/backend/backend/executor/scheduler.py
@@ -94,7 +94,7 @@ SCHEDULER_OPERATION_TIMEOUT_SECONDS = 300  # 5 minutes for scheduler operations
 def job_listener(event):
    """Logs job execution outcomes for better monitoring."""
    if event.exception:
-        logger.error(
+        logger.warning(
            f"Job {event.job_id} failed: {type(event.exception).__name__}: {event.exception}"
        )
    else:
@@ -137,7 +137,7 @@ def run_async(coro, timeout: float = SCHEDULER_OPERATION_TIMEOUT_SECONDS):
    try:
        return future.result(timeout=timeout)
    except Exception as e:
-        logger.error(f"Async operation failed: {type(e).__name__}: {e}")
+        logger.warning(f"Async operation failed: {type(e).__name__}: {e}")
        raise


@@ -186,7 +186,7 @@ async def _execute_graph(**kwargs):


 async def _handle_graph_validation_error(args: "GraphExecutionJobArgs") -> None:
-    logger.error(
+    logger.warning(
        f"Scheduled Graph {args.graph_id} failed validation. Unscheduling graph"
    )
    if args.schedule_id:
@@ -196,8 +196,9 @@ async def _handle_graph_validation_error(args: "GraphExecutionJobArgs") -> None:
            user_id=args.user_id,
        )
    else:
-        logger.error(
-            f"Unable to unschedule graph: {args.graph_id} as this is an old job with no associated schedule_id please remove manually"
+        logger.warning(
+            f"Unable to unschedule graph: {args.graph_id} as this is an old job "
+            f"with no associated schedule_id please remove manually"
        )


--- a/autogpt_platform/backend/backend/notifications/notifications.py
+++ b/autogpt_platform/backend/backend/notifications/notifications.py
@@ -303,9 +303,9 @@ class NotificationManager(AppService):
                    )

                    if not oldest_message:
-                        # this should never happen
-                        logger.error(
-                            f"Batch for user {batch.user_id} and type {notification_type} has no oldest message whichshould never happen!!!!!!!!!!!!!!!!"
+                        logger.warning(
+                            f"Batch for user {batch.user_id} and type {notification_type} "
+                            f"has no oldest message — batch may have been cleared concurrently"
                        )
                        continue

@@ -318,7 +318,7 @@ class NotificationManager(AppService):
                        ).get_user_email_by_id(batch.user_id)

                        if not recipient_email:
-                            logger.error(
+                            logger.warning(
                                f"User email not found for user {batch.user_id}"
                            )
                            continue
@@ -344,7 +344,7 @@ class NotificationManager(AppService):
                        ).get_user_notification_batch(batch.user_id, notification_type)

                        if not batch_data or not batch_data.notifications:
-                            logger.error(
+                            logger.warning(
                                f"Batch data not found for user {batch.user_id}"
                            )
                            # Clear the batch
@@ -372,7 +372,7 @@ class NotificationManager(AppService):
                                    )
                                )
                            except Exception as e:
-                                logger.error(
+                                logger.warning(
                                    f"Error parsing notification event: {e=}, {db_event=}"
                                )
                                continue
@@ -415,7 +415,10 @@ class NotificationManager(AppService):
    async def discord_system_alert(
        self, content: str, channel: DiscordChannel = DiscordChannel.PLATFORM
    ):
-        await discord_send_alert(content, channel)
+        try:
+            await discord_send_alert(content, channel)
+        except Exception as e:
+            logger.warning(f"Failed to send Discord system alert: {e}")

    async def _queue_scheduled_notification(self, event: SummaryParamsEventModel):
        """Queue a scheduled notification - exposed method for other services to call"""
@@ -516,7 +519,7 @@ class NotificationManager(AppService):
                raise ValueError("Invalid event type or params")

        except Exception as e:
-            logger.error(f"Failed to gather summary data: {e}")
+            logger.warning(f"Failed to gather summary data: {e}")
            # Return sensible defaults in case of error
            if event_type == NotificationType.DAILY_SUMMARY and isinstance(
                params, DailySummaryParams
@@ -562,8 +565,9 @@ class NotificationManager(AppService):
            should_retry=False
        ).get_user_notification_oldest_message_in_batch(user_id, event_type)
        if not oldest_message:
-            logger.error(
-                f"Batch for user {user_id} and type {event_type} has no oldest message whichshould never happen!!!!!!!!!!!!!!!!"
+            logger.warning(
+                f"Batch for user {user_id} and type {event_type} "
+                f"has no oldest message — batch may have been cleared concurrently"
            )
            return False
        oldest_age = oldest_message.created_at
@@ -585,7 +589,7 @@ class NotificationManager(AppService):
                get_notif_data_type(event.type)
            ].model_validate_json(message)
        except Exception as e:
-            logger.error(f"Error parsing message due to non matching schema {e}")
+            logger.warning(f"Error parsing message due to non matching schema {e}")
            return None

    async def _process_admin_message(self, message: str) -> bool:
@@ -614,7 +618,7 @@ class NotificationManager(AppService):
                should_retry=False
            ).get_user_email_by_id(event.user_id)
            if not recipient_email:
-                logger.error(f"User email not found for user {event.user_id}")
+                logger.warning(f"User email not found for user {event.user_id}")
                return False

            should_send = await self._should_email_user_based_on_preference(
@@ -651,7 +655,7 @@ class NotificationManager(AppService):
                should_retry=False
            ).get_user_email_by_id(event.user_id)
            if not recipient_email:
-                logger.error(f"User email not found for user {event.user_id}")
+                logger.warning(f"User email not found for user {event.user_id}")
                return False

            should_send = await self._should_email_user_based_on_preference(
@@ -672,7 +676,7 @@ class NotificationManager(AppService):
                should_retry=False
            ).get_user_notification_batch(event.user_id, event.type)
            if not batch or not batch.notifications:
-                logger.error(f"Batch not found for user {event.user_id}")
+                logger.warning(f"Batch not found for user {event.user_id}")
                return False
            unsub_link = generate_unsubscribe_link(event.user_id)

@@ -745,7 +749,7 @@ class NotificationManager(AppService):
                                        f"Removed {len(chunk_ids)} sent notifications from batch"
                                    )
                                except Exception as e:
-                                    logger.error(
+                                    logger.warning(
                                        f"Failed to remove sent notifications: {e}"
                                    )
                                    # Continue anyway - better to risk duplicates than lose emails
@@ -770,7 +774,7 @@ class NotificationManager(AppService):
                        else:
                            # Message is too large even after size reduction
                            if attempt_size == 1:
-                                logger.error(
+                                logger.warning(
                                    f"Failed to send notification at index {i}: "
                                    f"Single notification exceeds email size limit "
                                    f"({len(test_message):,} chars > {MAX_EMAIL_SIZE:,} chars). "
@@ -789,7 +793,7 @@ class NotificationManager(AppService):
                                            f"Removed oversized notification {chunk_ids[0]} from batch permanently"
                                        )
                                    except Exception as e:
-                                        logger.error(
+                                        logger.warning(
                                            f"Failed to remove oversized notification: {e}"
                                        )

@@ -823,7 +827,7 @@ class NotificationManager(AppService):
                                        f"Set email verification to false for user {event.user_id}"
                                    )
                                except Exception as deactivation_error:
-                                    logger.error(
+                                    logger.warning(
                                        f"Failed to deactivate email for user {event.user_id}: "
                                        f"{deactivation_error}"
                                    )
@@ -835,7 +839,7 @@ class NotificationManager(AppService):
                                        f"Disabled all notification preferences for user {event.user_id}"
                                    )
                                except Exception as disable_error:
-                                    logger.error(
+                                    logger.warning(
                                        f"Failed to disable notification preferences: {disable_error}"
                                    )

@@ -848,7 +852,7 @@ class NotificationManager(AppService):
                                        f"Cleared ALL notification batches for user {event.user_id}"
                                    )
                                except Exception as remove_error:
-                                    logger.error(
+                                    logger.warning(
                                        f"Failed to clear batches for inactive recipient: {remove_error}"
                                    )

@@ -859,7 +863,7 @@ class NotificationManager(AppService):
                                "422" in error_message
                                or "unprocessable" in error_message
                            ):
-                                logger.error(
+                                logger.warning(
                                    f"Failed to send notification at index {i}: "
                                    f"Malformed notification data rejected by Postmark. "
                                    f"Error: {e}. Removing from batch permanently."
@@ -877,7 +881,7 @@ class NotificationManager(AppService):
                                            "Removed malformed notification from batch permanently"
                                        )
                                    except Exception as remove_error:
-                                        logger.error(
+                                        logger.warning(
                                            f"Failed to remove malformed notification: {remove_error}"
                                        )
                            # Check if it's a ValueError for size limit
@@ -885,14 +889,14 @@ class NotificationManager(AppService):
                                isinstance(e, ValueError)
                                and "too large" in error_message
                            ):
-                                logger.error(
+                                logger.warning(
                                    f"Failed to send notification at index {i}: "
                                    f"Notification size exceeds email limit. "
                                    f"Error: {e}. Skipping this notification."
                                )
                            # Other API errors
                            else:
-                                logger.error(
+                                logger.warning(
                                    f"Failed to send notification at index {i}: "
                                    f"Email API error ({error_type}): {e}. "
                                    f"Skipping this notification."
@@ -907,7 +911,9 @@ class NotificationManager(AppService):

                if not chunk_sent:
                    # Should not reach here due to single notification handling
-                    logger.error(f"Failed to send notifications starting at index {i}")
+                    logger.warning(
+                        f"Failed to send notifications starting at index {i}"
+                    )
                    failed_indices.append(i)
                    i += 1

@@ -946,7 +952,7 @@ class NotificationManager(AppService):
                should_retry=False
            ).get_user_email_by_id(event.user_id)
            if not recipient_email:
-                logger.error(f"User email not found for user {event.user_id}")
+                logger.warning(f"User email not found for user {event.user_id}")
                return False
            should_send = await self._should_email_user_based_on_preference(
                event.user_id, event.type
@@ -1007,7 +1013,10 @@ class NotificationManager(AppService):
                        # Let message.process() handle the rejection
                        pass
                    except Exception as e:
-                        logger.error(f"Error processing message in {queue_name}: {e}")
+                        logger.warning(
+                            f"Error processing message in {queue_name}: {e}",
+                            exc_info=True,
+                        )
                        # Let message.process() handle the rejection
                        raise
        except asyncio.CancelledError:
--- a/autogpt_platform/backend/backend/notifications/test_notifications.py
+++ b/autogpt_platform/backend/backend/notifications/test_notifications.py
@@ -256,9 +256,9 @@ class TestNotificationErrorHandling:
            assert 2 not in successful_indices  # Index 2 failed

            # Verify 422 error was logged
-            error_calls = [call[0][0] for call in mock_logger.error.call_args_list]
+            warning_calls = [call[0][0] for call in mock_logger.warning.call_args_list]
            assert any(
-                "422" in call or "malformed" in call.lower() for call in error_calls
+                "422" in call or "malformed" in call.lower() for call in warning_calls
            )

            # Verify all notifications were removed (4 successful + 1 malformed)
@@ -371,10 +371,10 @@ class TestNotificationErrorHandling:
            assert 3 not in successful_indices  # Index 3 was not sent

            # Verify oversized error was logged
-            error_calls = [call[0][0] for call in mock_logger.error.call_args_list]
+            warning_calls = [call[0][0] for call in mock_logger.warning.call_args_list]
            assert any(
                "exceeds email size limit" in call or "oversized" in call.lower()
-                for call in error_calls
+                for call in warning_calls
            )

    @pytest.mark.asyncio
@@ -478,10 +478,10 @@ class TestNotificationErrorHandling:
            assert 1 in failed_indices  # Index 1 failed

            # Verify generic error was logged
-            error_calls = [call[0][0] for call in mock_logger.error.call_args_list]
+            warning_calls = [call[0][0] for call in mock_logger.warning.call_args_list]
            assert any(
                "api error" in call.lower() or "skipping" in call.lower()
-                for call in error_calls
+                for call in warning_calls
            )

            # Only successful ones should be removed from batch (failed one stays for retry)
--- a/autogpt_platform/backend/backend/util/cloud_storage.py
+++ b/autogpt_platform/backend/backend/util/cloud_storage.py
@@ -613,5 +613,5 @@ async def cleanup_expired_files_async() -> int:
            )
            return deleted_count
        except Exception as e:
-            logger.error(f"[CloudStorage] Error during cloud storage cleanup: {e}")
+            logger.warning(f"[CloudStorage] Error during cloud storage cleanup: {e}")
            return 0
--- a/autogpt_platform/backend/backend/util/metrics.py
+++ b/autogpt_platform/backend/backend/util/metrics.py
@@ -10,7 +10,7 @@ from sentry_sdk.integrations.launchdarkly import LaunchDarklyIntegration
 from sentry_sdk.integrations.logging import LoggingIntegration

 from backend.util import feature_flag
-from backend.util.settings import Settings
+from backend.util.settings import BehaveAs, Settings

 settings = Settings()
 logger = logging.getLogger(__name__)
@@ -21,6 +21,95 @@ class DiscordChannel(str, Enum):
    PRODUCT = "product"  # For product alerts (low balance, zero balance, etc.)


+def _before_send(event, hint):
+    """Filter out expected/transient errors from Sentry to reduce noise."""
+    if "exc_info" in hint:
+        exc_type, exc_value, _ = hint["exc_info"]
+        exc_msg = str(exc_value).lower() if exc_value else ""
+
+        # AMQP/RabbitMQ transient connection errors — expected during deploys
+        amqp_keywords = [
+            "amqpconnection",
+            "amqpconnector",
+            "connection_forced",
+            "channelinvalidstateerror",
+            "no active transport",
+        ]
+        if any(kw in exc_msg for kw in amqp_keywords):
+            return None
+
+        # "connection refused" only for AMQP-related exceptions (not other services)
+        if "connection refused" in exc_msg:
+            exc_module = getattr(exc_type, "__module__", "") or ""
+            exc_name = getattr(exc_type, "__name__", "") or ""
+            amqp_indicators = ["aio_pika", "aiormq", "amqp", "pika", "rabbitmq"]
+            if any(
+                ind in exc_module.lower() or ind in exc_name.lower()
+                for ind in amqp_indicators
+            ) or any(kw in exc_msg for kw in ["amqp", "pika", "rabbitmq"]):
+                return None
+
+        # User-caused credential/auth errors — not platform bugs
+        user_auth_keywords = [
+            "incorrect api key",
+            "invalid x-api-key",
+            "missing authentication header",
+            "invalid api token",
+            "authentication_error",
+        ]
+        if any(kw in exc_msg for kw in user_auth_keywords):
+            return None
+
+        # Expected business logic — insufficient balance
+        if "insufficient balance" in exc_msg or "no credits left" in exc_msg:
+            return None
+
+        # Expected security check — blocked IP access
+        if "access to blocked or private ip" in exc_msg:
+            return None
+
+        # Discord bot token misconfiguration — not a platform error
+        if "improper token has been passed" in exc_msg or (
+            exc_type and exc_type.__name__ == "Forbidden" and "50001" in exc_msg
+        ):
+            return None
+
+        # Google metadata DNS errors — expected in non-GCP environments
+        if (
+            "metadata.google.internal" in exc_msg
+            and settings.config.behave_as != BehaveAs.CLOUD
+        ):
+            return None
+
+        # Inactive email recipients — expected for bounced addresses
+        if "marked as inactive" in exc_msg or "inactive addresses" in exc_msg:
+            return None
+
+    # Also filter log-based events for known noisy messages.
+    # Sentry's LoggingIntegration stores log messages under "logentry", not "message".
+    logentry = event.get("logentry") or {}
+    log_msg = (
+        logentry.get("formatted") or logentry.get("message") or event.get("message")
+    )
+    if event.get("logger") and log_msg:
+        msg = log_msg.lower()
+        noisy_patterns = [
+            "amqpconnection",
+            "connection_forced",
+            "unclosed client session",
+            "unclosed connector",
+        ]
+        if any(p in msg for p in noisy_patterns):
+            return None
+        # "connection refused" in logs only when AMQP-related context is present
+        if "connection refused" in msg and any(
+            ind in msg for ind in ("amqp", "pika", "rabbitmq", "aio_pika", "aiormq")
+        ):
+            return None
+
+    return event
+
+
 def sentry_init():
    sentry_dsn = settings.secrets.sentry_dsn
    integrations = []
@@ -35,6 +124,7 @@ def sentry_init():
        profiles_sample_rate=1.0,
        environment=f"app:{settings.config.app_env.value}-behave:{settings.config.behave_as.value}",
        _experiments={"enable_logs": True},
+        before_send=_before_send,
        integrations=[
            AsyncioIntegration(),
            LoggingIntegration(sentry_logs_level=logging.INFO),
--- a/autogpt_platform/backend/backend/util/retry.py
+++ b/autogpt_platform/backend/backend/util/retry.py
@@ -64,7 +64,7 @@ def send_rate_limited_discord_alert(
        return True

    except Exception as alert_error:
-        logger.error(f"Failed to send Discord alert: {alert_error}")
+        logger.warning(f"Failed to send Discord alert: {alert_error}")
        return False


@@ -182,7 +182,8 @@ def conn_retry(
        func_name = getattr(retry_state.fn, "__name__", "unknown")

        if retry_state.outcome.failed and retry_state.next_action is None:
-            logger.error(f"{prefix} {action_name} failed after retries: {exception}")
+            # Final failure is logged by sync_wrapper/async_wrapper — skip here to avoid duplicates
+            pass
        else:
            if attempt_number == EXCESSIVE_RETRY_THRESHOLD:
                if send_rate_limited_discord_alert(
@@ -225,7 +226,7 @@ def conn_retry(
                logger.info(f"{prefix} {action_name} completed successfully.")
                return result
            except Exception as e:
-                logger.error(f"{prefix} {action_name} failed after retries: {e}")
+                logger.warning(f"{prefix} {action_name} failed after retries: {e}")
                raise

        @wraps(func)
@@ -237,7 +238,7 @@ def conn_retry(
                logger.info(f"{prefix} {action_name} completed successfully.")
                return result
            except Exception as e:
-                logger.error(f"{prefix} {action_name} failed after retries: {e}")
+                logger.warning(f"{prefix} {action_name} failed after retries: {e}")
                raise

        return async_wrapper if is_coroutine else sync_wrapper