Fix Gmail body parsing for multipart messages (#9863) (#10071)

The `GmailReadBlock._get_email_body()` method was only inspecting the top-level payload and a single `text/plain` part, causing it to return the fallback string "This email does not contain a text body." for most Gmail messages. This occurred because Gmail messages are typically wrapped in `multipart/alternative` or other multipart containers, which the original implementation couldn't handle. This critical issue made the Gmail integration unusable for reading email body content, as virtually every real Gmail message uses multipart MIME structures.  ### Changes #### Core Implementation: - **Replaced simple `_get_email_body()` with recursive multipart parser** that can walk through nested MIME structures - **Added `_walk_for_body()` method** for recursive traversal of email parts with depth limiting (max 10 levels) - **Implemented safe base64 decoding** with automatic padding correction in `_decode_base64()` - **Added attachment body support** via `_download_attachment_body()` for emails where body content is stored as attachments #### Email Format Support: - **HTML to text conversion** using `html2text` library for HTML-only emails - **Multipart/alternative handling** with preference for `text/plain` over `text/html` - **Nested multipart structure support** (e.g., `multipart/mixed` containing `multipart/alternative`) - **Single-part email support** (maintains backward compatibility) #### Dependencies & Testing: - **Added `html2text = "^2024.2.26"`** to `pyproject.toml` for HTML conversion - **Created comprehensive unit tests** in `test/blocks/test_gmail.py` covering all email types and edge cases - **Added error handling and graceful fallbacks** for malformed data and missing dependencies #### Security & Performance: - **Recursion depth limiting** prevents infinite loops on malformed email structures - **Exception handling** ensures graceful degradation when API calls fail - **Efficient tree traversal** with early returns for better performance ### Checklist #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: <details> <summary>Test Plan</summary> - **Single-part text/plain emails** - Verified correct extraction of plain text content - **Multipart/alternative emails** - Tested preference for plain text over HTML when both available - **HTML-only emails** - Confirmed HTML to text conversion works correctly - **Nested multipart structures** - Tested deeply nested `multipart/mixed` containing `multipart/alternative` - **Attachment-based body content** - Verified downloading and decoding of body stored as attachments - **Base64 padding edge cases** - Tested malformed base64 data with missing padding - **Recursion depth limits** - Confirmed protection against infinite recursion - **Error handling scenarios** - Tested graceful fallbacks for API failures and missing dependencies - **Backward compatibility** - Ensured existing functionality remains unchanged for edge cases - **Integration testing** - Ran standalone verification script with 100% test pass rate </details> #### For configuration changes: - [x] `.env.example` is updated or already compatible with my changes - [x] `docker-compose.yml` is updated or already compatible with my changes - [x] I have included a list of my configuration changes in the PR description (under **Changes**) <details> <summary>Configuration Changes</summary> - Added `html2text` dependency to `pyproject.toml` - no environment or infrastructure changes required - No changes to ports, services, secrets, or databases - Fully backward compatible with existing Gmail API configuration </details> --------- Co-authored-by: Toran Bruce Richards <toran.richards@gmail.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
2026-04-08 03:00:28 -04:00 · 2025-07-16 01:09:02 +05:30
parent f45bf53681
commit db1f034544
4 changed files with 307 additions and 14 deletions
--- a/autogpt_platform/backend/backend/blocks/google/gmail.py
+++ b/autogpt_platform/backend/backend/blocks/google/gmail.py
@@ -194,7 +194,7 @@ class GmailReadBlock(Block):
                from_=parseaddr(headers.get("from", ""))[1],
                to=parseaddr(headers.get("to", ""))[1],
                date=headers.get("date", ""),
-                body=self._get_email_body(msg),
+                body=self._get_email_body(msg, service),
                sizeEstimate=msg["sizeEstimate"],
                attachments=attachments,
            )
@@ -202,19 +202,81 @@ class GmailReadBlock(Block):

        return email_data

-    def _get_email_body(self, msg):
-        if "parts" in msg["payload"]:
-            for part in msg["payload"]["parts"]:
-                if part["mimeType"] == "text/plain":
-                    return base64.urlsafe_b64decode(part["body"]["data"]).decode(
-                        "utf-8"
-                    )
-        elif msg["payload"]["mimeType"] == "text/plain":
-            return base64.urlsafe_b64decode(msg["payload"]["body"]["data"]).decode(
-                "utf-8"
-            )
+    def _get_email_body(self, msg, service):
+        """Extract email body content with support for multipart messages and HTML conversion."""
+        text = self._walk_for_body(msg["payload"], msg["id"], service)
+        return text or "This email does not contain a readable body."

-        return "This email does not contain a text body."
+    def _walk_for_body(self, part, msg_id, service, depth=0):
+        """Recursively walk through email parts to find readable body content."""
+        # Prevent infinite recursion by limiting depth
+        if depth > 10:
+            return None
+
+        mime_type = part.get("mimeType", "")
+        body = part.get("body", {})
+
+        # Handle text/plain content
+        if mime_type == "text/plain" and body.get("data"):
+            return self._decode_base64(body["data"])
+
+        # Handle text/html content (convert to plain text)
+        if mime_type == "text/html" and body.get("data"):
+            html_content = self._decode_base64(body["data"])
+            if html_content:
+                try:
+                    import html2text
+
+                    h = html2text.HTML2Text()
+                    h.ignore_links = False
+                    h.ignore_images = True
+                    return h.handle(html_content)
+                except ImportError:
+                    # Fallback: return raw HTML if html2text is not available
+                    return html_content
+
+        # Handle content stored as attachment
+        if body.get("attachmentId"):
+            attachment_data = self._download_attachment_body(
+                body["attachmentId"], msg_id, service
+            )
+            if attachment_data:
+                return self._decode_base64(attachment_data)
+
+        # Recursively search in parts
+        for sub_part in part.get("parts", []):
+            text = self._walk_for_body(sub_part, msg_id, service, depth + 1)
+            if text:
+                return text
+
+        return None
+
+    def _decode_base64(self, data):
+        """Safely decode base64 URL-safe data with proper padding."""
+        if not data:
+            return None
+        try:
+            # Add padding if necessary
+            missing_padding = len(data) % 4
+            if missing_padding:
+                data += "=" * (4 - missing_padding)
+            return base64.urlsafe_b64decode(data).decode("utf-8")
+        except Exception:
+            return None
+
+    def _download_attachment_body(self, attachment_id, msg_id, service):
+        """Download attachment content when email body is stored as attachment."""
+        try:
+            attachment = (
+                service.users()
+                .messages()
+                .attachments()
+                .get(userId="me", messageId=msg_id, id=attachment_id)
+                .execute()
+            )
+            return attachment.get("data")
+        except Exception:
+            return None

    def _get_attachments(self, service, message):
        attachments = []
--- a/autogpt_platform/backend/poetry.lock
+++ b/autogpt_platform/backend/poetry.lock
@@ -1904,6 +1904,17 @@ files = [
    {file = "hpack-4.1.0.tar.gz", hash = "sha256:ec5eca154f7056aa06f196a557655c5b009b382873ac8d1e66e79e87535f1dca"},
 ]

+[[package]]
+name = "html2text"
+version = "2024.2.26"
+description = "Turn HTML into equivalent Markdown-structured text."
+optional = false
+python-versions = ">=3.8"
+groups = ["main"]
+files = [
+    {file = "html2text-2024.2.26.tar.gz", hash = "sha256:05f8e367d15aaabc96415376776cdd11afd5127a77fce6e36afc60c563ca2c32"},
+]
+
 [[package]]
 name = "httpcore"
 version = "1.0.9"
@@ -6429,4 +6440,4 @@ cffi = ["cffi (>=1.11)"]
 [metadata]
 lock-version = "2.1"
 python-versions = ">=3.10,<3.13"
-content-hash = "476228d2bf59b90edc5425c462c1263cbc1f2d346f79a826ac5e7efe7823aaa6"
+content-hash = "0f3dfd7fdfb50ffd9b9a046cce0be1b9f290d4e6055ff13c2fbda4faa610ba34"
--- a/autogpt_platform/backend/pyproject.toml
+++ b/autogpt_platform/backend/pyproject.toml
@@ -28,6 +28,7 @@ google-cloud-storage = "^3.2.0"
 googlemaps = "^4.10.0"
 gravitasml = "^0.1.3"
 groq = "^0.29.0"
+html2text = "^2024.2.26"
 jinja2 = "^3.1.6"
 jsonref = "^1.1.0"
 jsonschema = "^4.22.0"
--- a/autogpt_platform/backend/test/blocks/test_gmail.py
+++ b/autogpt_platform/backend/test/blocks/test_gmail.py
@@ -0,0 +1,219 @@
+import base64
+from unittest.mock import Mock, patch
+
+import pytest
+
+from backend.blocks.google.gmail import GmailReadBlock
+
+
+class TestGmailReadBlock:
+    """Test cases for GmailReadBlock email body parsing functionality."""
+
+    def setup_method(self):
+        """Set up test fixtures."""
+        self.gmail_block = GmailReadBlock()
+        self.mock_service = Mock()
+
+    def _encode_base64(self, text: str) -> str:
+        """Helper to encode text as base64 URL-safe."""
+        return base64.urlsafe_b64encode(text.encode("utf-8")).decode("utf-8")
+
+    def test_single_part_text_plain(self):
+        """Test parsing single-part text/plain email."""
+        body_text = "This is a plain text email body."
+        msg = {
+            "id": "test_msg_1",
+            "payload": {
+                "mimeType": "text/plain",
+                "body": {"data": self._encode_base64(body_text)},
+            },
+        }
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        assert result == body_text
+
+    def test_multipart_alternative_plain_and_html(self):
+        """Test parsing multipart/alternative with both plain and HTML parts."""
+        plain_text = "This is the plain text version."
+        html_text = "<html><body><p>This is the HTML version.</p></body></html>"
+
+        msg = {
+            "id": "test_msg_2",
+            "payload": {
+                "mimeType": "multipart/alternative",
+                "parts": [
+                    {
+                        "mimeType": "text/plain",
+                        "body": {"data": self._encode_base64(plain_text)},
+                    },
+                    {
+                        "mimeType": "text/html",
+                        "body": {"data": self._encode_base64(html_text)},
+                    },
+                ],
+            },
+        }
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        # Should prefer plain text over HTML
+        assert result == plain_text
+
+    def test_html_only_email(self):
+        """Test parsing HTML-only email with conversion to plain text."""
+        html_text = (
+            "<html><body><h1>Hello World</h1><p>This is HTML content.</p></body></html>"
+        )
+
+        msg = {
+            "id": "test_msg_3",
+            "payload": {
+                "mimeType": "text/html",
+                "body": {"data": self._encode_base64(html_text)},
+            },
+        }
+
+        with patch("html2text.HTML2Text") as mock_html2text:
+            mock_converter = Mock()
+            mock_converter.handle.return_value = "Hello World\n\nThis is HTML content."
+            mock_html2text.return_value = mock_converter
+
+            result = self.gmail_block._get_email_body(msg, self.mock_service)
+            assert "Hello World" in result
+            assert "This is HTML content" in result
+
+    def test_html_fallback_when_html2text_unavailable(self):
+        """Test fallback to raw HTML when html2text is not available."""
+        html_text = "<html><body><p>HTML content</p></body></html>"
+
+        msg = {
+            "id": "test_msg_4",
+            "payload": {
+                "mimeType": "text/html",
+                "body": {"data": self._encode_base64(html_text)},
+            },
+        }
+
+        with patch("html2text.HTML2Text", side_effect=ImportError):
+            result = self.gmail_block._get_email_body(msg, self.mock_service)
+            assert result == html_text
+
+    def test_nested_multipart_structure(self):
+        """Test parsing deeply nested multipart structure."""
+        plain_text = "Nested plain text content."
+
+        msg = {
+            "id": "test_msg_5",
+            "payload": {
+                "mimeType": "multipart/mixed",
+                "parts": [
+                    {
+                        "mimeType": "multipart/alternative",
+                        "parts": [
+                            {
+                                "mimeType": "text/plain",
+                                "body": {"data": self._encode_base64(plain_text)},
+                            },
+                        ],
+                    },
+                ],
+            },
+        }
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        assert result == plain_text
+
+    def test_attachment_body_content(self):
+        """Test parsing email where body is stored as attachment."""
+        attachment_data = self._encode_base64("Body content from attachment.")
+
+        msg = {
+            "id": "test_msg_6",
+            "payload": {
+                "mimeType": "text/plain",
+                "body": {"attachmentId": "attachment_123"},
+            },
+        }
+
+        # Mock the attachment download
+        self.mock_service.users().messages().attachments().get().execute.return_value = {
+            "data": attachment_data
+        }
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        assert result == "Body content from attachment."
+
+    def test_no_readable_body(self):
+        """Test email with no readable body content."""
+        msg = {
+            "id": "test_msg_7",
+            "payload": {
+                "mimeType": "application/octet-stream",
+                "body": {},
+            },
+        }
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        assert result == "This email does not contain a readable body."
+
+    def test_base64_padding_handling(self):
+        """Test proper handling of base64 data with missing padding."""
+        # Create base64 data with missing padding
+        text = "Test content"
+        encoded = base64.urlsafe_b64encode(text.encode("utf-8")).decode("utf-8")
+        # Remove padding
+        encoded_no_padding = encoded.rstrip("=")
+
+        result = self.gmail_block._decode_base64(encoded_no_padding)
+        assert result == text
+
+    def test_recursion_depth_limit(self):
+        """Test that recursion depth is properly limited."""
+
+        # Create a deeply nested structure that would exceed the limit
+        def create_nested_part(depth):
+            if depth > 15:  # Exceed the limit of 10
+                return {
+                    "mimeType": "text/plain",
+                    "body": {"data": self._encode_base64("Deep content")},
+                }
+            return {
+                "mimeType": "multipart/mixed",
+                "parts": [create_nested_part(depth + 1)],
+            }
+
+        msg = {
+            "id": "test_msg_8",
+            "payload": create_nested_part(0),
+        }
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        # Should return fallback message due to depth limit
+        assert result == "This email does not contain a readable body."
+
+    def test_malformed_base64_handling(self):
+        """Test handling of malformed base64 data."""
+        result = self.gmail_block._decode_base64("invalid_base64_data!!!")
+        assert result is None
+
+    def test_empty_data_handling(self):
+        """Test handling of empty or None data."""
+        assert self.gmail_block._decode_base64("") is None
+        assert self.gmail_block._decode_base64(None) is None
+
+    def test_attachment_download_failure(self):
+        """Test handling of attachment download failure."""
+        msg = {
+            "id": "test_msg_9",
+            "payload": {
+                "mimeType": "text/plain",
+                "body": {"attachmentId": "invalid_attachment"},
+            },
+        }
+
+        # Mock attachment download failure
+        self.mock_service.users().messages().attachments().get().execute.side_effect = (
+            Exception("Download failed")
+        )
+
+        result = self.gmail_block._get_email_body(msg, self.mock_service)
+        assert result == "This email does not contain a readable body."