Fix Gmail body parsing for multipart messages (#9863) (#10071)

<!-- Clearly explain the need for these changes: -->

The `GmailReadBlock._get_email_body()` method was only inspecting the
top-level payload and a single `text/plain` part, causing it to return
the fallback string "This email does not contain a text body." for most
Gmail messages. This occurred because Gmail messages are typically
wrapped in `multipart/alternative` or other multipart containers, which
the original implementation couldn't handle.

This critical issue made the Gmail integration unusable for reading
email body content, as virtually every real Gmail message uses multipart
MIME structures.

<!-- Concisely describe all of the changes made in this pull request:
-->

### Changes 

#### Core Implementation:
- **Replaced simple `_get_email_body()` with recursive multipart
parser** that can walk through nested MIME structures
- **Added `_walk_for_body()` method** for recursive traversal of email
parts with depth limiting (max 10 levels)
- **Implemented safe base64 decoding** with automatic padding correction
in `_decode_base64()`
- **Added attachment body support** via `_download_attachment_body()`
for emails where body content is stored as attachments

#### Email Format Support:
- **HTML to text conversion** using `html2text` library for HTML-only
emails
- **Multipart/alternative handling** with preference for `text/plain`
over `text/html`
- **Nested multipart structure support** (e.g., `multipart/mixed`
containing `multipart/alternative`)
- **Single-part email support** (maintains backward compatibility)

#### Dependencies & Testing:
- **Added `html2text = "^2024.2.26"`** to `pyproject.toml` for HTML
conversion
- **Created comprehensive unit tests** in `test/blocks/test_gmail.py`
covering all email types and edge cases
- **Added error handling and graceful fallbacks** for malformed data and
missing dependencies

#### Security & Performance:
- **Recursion depth limiting** prevents infinite loops on malformed
email structures
- **Exception handling** ensures graceful degradation when API calls
fail
- **Efficient tree traversal** with early returns for better performance

### Checklist 

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:

<details>
<summary>Test Plan</summary>

- **Single-part text/plain emails** - Verified correct extraction of
plain text content
- **Multipart/alternative emails** - Tested preference for plain text
over HTML when both available
- **HTML-only emails** - Confirmed HTML to text conversion works
correctly
- **Nested multipart structures** - Tested deeply nested
`multipart/mixed` containing `multipart/alternative`
- **Attachment-based body content** - Verified downloading and decoding
of body stored as attachments
- **Base64 padding edge cases** - Tested malformed base64 data with
missing padding
- **Recursion depth limits** - Confirmed protection against infinite
recursion
- **Error handling scenarios** - Tested graceful fallbacks for API
failures and missing dependencies
- **Backward compatibility** - Ensured existing functionality remains
unchanged for edge cases
- **Integration testing** - Ran standalone verification script with 100%
test pass rate

</details>

#### For configuration changes:
- [x] `.env.example` is updated or already compatible with my changes
- [x] `docker-compose.yml` is updated or already compatible with my
changes
- [x] I have included a list of my configuration changes in the PR
description (under **Changes**)

<details>
<summary>Configuration Changes</summary>

- Added `html2text` dependency to `pyproject.toml` - no environment or
infrastructure changes required
- No changes to ports, services, secrets, or databases
- Fully backward compatible with existing Gmail API configuration

</details>

---------

Co-authored-by: Toran Bruce Richards <toran.richards@gmail.com>
Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
This commit is contained in:
Tejas Dharani
2025-07-16 01:09:02 +05:30
committed by GitHub
parent f45bf53681
commit db1f034544
4 changed files with 307 additions and 14 deletions

View File

@@ -194,7 +194,7 @@ class GmailReadBlock(Block):
from_=parseaddr(headers.get("from", ""))[1],
to=parseaddr(headers.get("to", ""))[1],
date=headers.get("date", ""),
body=self._get_email_body(msg),
body=self._get_email_body(msg, service),
sizeEstimate=msg["sizeEstimate"],
attachments=attachments,
)
@@ -202,19 +202,81 @@ class GmailReadBlock(Block):
return email_data
def _get_email_body(self, msg):
if "parts" in msg["payload"]:
for part in msg["payload"]["parts"]:
if part["mimeType"] == "text/plain":
return base64.urlsafe_b64decode(part["body"]["data"]).decode(
"utf-8"
)
elif msg["payload"]["mimeType"] == "text/plain":
return base64.urlsafe_b64decode(msg["payload"]["body"]["data"]).decode(
"utf-8"
)
def _get_email_body(self, msg, service):
"""Extract email body content with support for multipart messages and HTML conversion."""
text = self._walk_for_body(msg["payload"], msg["id"], service)
return text or "This email does not contain a readable body."
return "This email does not contain a text body."
def _walk_for_body(self, part, msg_id, service, depth=0):
"""Recursively walk through email parts to find readable body content."""
# Prevent infinite recursion by limiting depth
if depth > 10:
return None
mime_type = part.get("mimeType", "")
body = part.get("body", {})
# Handle text/plain content
if mime_type == "text/plain" and body.get("data"):
return self._decode_base64(body["data"])
# Handle text/html content (convert to plain text)
if mime_type == "text/html" and body.get("data"):
html_content = self._decode_base64(body["data"])
if html_content:
try:
import html2text
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
return h.handle(html_content)
except ImportError:
# Fallback: return raw HTML if html2text is not available
return html_content
# Handle content stored as attachment
if body.get("attachmentId"):
attachment_data = self._download_attachment_body(
body["attachmentId"], msg_id, service
)
if attachment_data:
return self._decode_base64(attachment_data)
# Recursively search in parts
for sub_part in part.get("parts", []):
text = self._walk_for_body(sub_part, msg_id, service, depth + 1)
if text:
return text
return None
def _decode_base64(self, data):
"""Safely decode base64 URL-safe data with proper padding."""
if not data:
return None
try:
# Add padding if necessary
missing_padding = len(data) % 4
if missing_padding:
data += "=" * (4 - missing_padding)
return base64.urlsafe_b64decode(data).decode("utf-8")
except Exception:
return None
def _download_attachment_body(self, attachment_id, msg_id, service):
"""Download attachment content when email body is stored as attachment."""
try:
attachment = (
service.users()
.messages()
.attachments()
.get(userId="me", messageId=msg_id, id=attachment_id)
.execute()
)
return attachment.get("data")
except Exception:
return None
def _get_attachments(self, service, message):
attachments = []

View File

@@ -1904,6 +1904,17 @@ files = [
{file = "hpack-4.1.0.tar.gz", hash = "sha256:ec5eca154f7056aa06f196a557655c5b009b382873ac8d1e66e79e87535f1dca"},
]
[[package]]
name = "html2text"
version = "2024.2.26"
description = "Turn HTML into equivalent Markdown-structured text."
optional = false
python-versions = ">=3.8"
groups = ["main"]
files = [
{file = "html2text-2024.2.26.tar.gz", hash = "sha256:05f8e367d15aaabc96415376776cdd11afd5127a77fce6e36afc60c563ca2c32"},
]
[[package]]
name = "httpcore"
version = "1.0.9"
@@ -6429,4 +6440,4 @@ cffi = ["cffi (>=1.11)"]
[metadata]
lock-version = "2.1"
python-versions = ">=3.10,<3.13"
content-hash = "476228d2bf59b90edc5425c462c1263cbc1f2d346f79a826ac5e7efe7823aaa6"
content-hash = "0f3dfd7fdfb50ffd9b9a046cce0be1b9f290d4e6055ff13c2fbda4faa610ba34"

View File

@@ -28,6 +28,7 @@ google-cloud-storage = "^3.2.0"
googlemaps = "^4.10.0"
gravitasml = "^0.1.3"
groq = "^0.29.0"
html2text = "^2024.2.26"
jinja2 = "^3.1.6"
jsonref = "^1.1.0"
jsonschema = "^4.22.0"

View File

@@ -0,0 +1,219 @@
import base64
from unittest.mock import Mock, patch
import pytest
from backend.blocks.google.gmail import GmailReadBlock
class TestGmailReadBlock:
"""Test cases for GmailReadBlock email body parsing functionality."""
def setup_method(self):
"""Set up test fixtures."""
self.gmail_block = GmailReadBlock()
self.mock_service = Mock()
def _encode_base64(self, text: str) -> str:
"""Helper to encode text as base64 URL-safe."""
return base64.urlsafe_b64encode(text.encode("utf-8")).decode("utf-8")
def test_single_part_text_plain(self):
"""Test parsing single-part text/plain email."""
body_text = "This is a plain text email body."
msg = {
"id": "test_msg_1",
"payload": {
"mimeType": "text/plain",
"body": {"data": self._encode_base64(body_text)},
},
}
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert result == body_text
def test_multipart_alternative_plain_and_html(self):
"""Test parsing multipart/alternative with both plain and HTML parts."""
plain_text = "This is the plain text version."
html_text = "<html><body><p>This is the HTML version.</p></body></html>"
msg = {
"id": "test_msg_2",
"payload": {
"mimeType": "multipart/alternative",
"parts": [
{
"mimeType": "text/plain",
"body": {"data": self._encode_base64(plain_text)},
},
{
"mimeType": "text/html",
"body": {"data": self._encode_base64(html_text)},
},
],
},
}
result = self.gmail_block._get_email_body(msg, self.mock_service)
# Should prefer plain text over HTML
assert result == plain_text
def test_html_only_email(self):
"""Test parsing HTML-only email with conversion to plain text."""
html_text = (
"<html><body><h1>Hello World</h1><p>This is HTML content.</p></body></html>"
)
msg = {
"id": "test_msg_3",
"payload": {
"mimeType": "text/html",
"body": {"data": self._encode_base64(html_text)},
},
}
with patch("html2text.HTML2Text") as mock_html2text:
mock_converter = Mock()
mock_converter.handle.return_value = "Hello World\n\nThis is HTML content."
mock_html2text.return_value = mock_converter
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert "Hello World" in result
assert "This is HTML content" in result
def test_html_fallback_when_html2text_unavailable(self):
"""Test fallback to raw HTML when html2text is not available."""
html_text = "<html><body><p>HTML content</p></body></html>"
msg = {
"id": "test_msg_4",
"payload": {
"mimeType": "text/html",
"body": {"data": self._encode_base64(html_text)},
},
}
with patch("html2text.HTML2Text", side_effect=ImportError):
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert result == html_text
def test_nested_multipart_structure(self):
"""Test parsing deeply nested multipart structure."""
plain_text = "Nested plain text content."
msg = {
"id": "test_msg_5",
"payload": {
"mimeType": "multipart/mixed",
"parts": [
{
"mimeType": "multipart/alternative",
"parts": [
{
"mimeType": "text/plain",
"body": {"data": self._encode_base64(plain_text)},
},
],
},
],
},
}
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert result == plain_text
def test_attachment_body_content(self):
"""Test parsing email where body is stored as attachment."""
attachment_data = self._encode_base64("Body content from attachment.")
msg = {
"id": "test_msg_6",
"payload": {
"mimeType": "text/plain",
"body": {"attachmentId": "attachment_123"},
},
}
# Mock the attachment download
self.mock_service.users().messages().attachments().get().execute.return_value = {
"data": attachment_data
}
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert result == "Body content from attachment."
def test_no_readable_body(self):
"""Test email with no readable body content."""
msg = {
"id": "test_msg_7",
"payload": {
"mimeType": "application/octet-stream",
"body": {},
},
}
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert result == "This email does not contain a readable body."
def test_base64_padding_handling(self):
"""Test proper handling of base64 data with missing padding."""
# Create base64 data with missing padding
text = "Test content"
encoded = base64.urlsafe_b64encode(text.encode("utf-8")).decode("utf-8")
# Remove padding
encoded_no_padding = encoded.rstrip("=")
result = self.gmail_block._decode_base64(encoded_no_padding)
assert result == text
def test_recursion_depth_limit(self):
"""Test that recursion depth is properly limited."""
# Create a deeply nested structure that would exceed the limit
def create_nested_part(depth):
if depth > 15: # Exceed the limit of 10
return {
"mimeType": "text/plain",
"body": {"data": self._encode_base64("Deep content")},
}
return {
"mimeType": "multipart/mixed",
"parts": [create_nested_part(depth + 1)],
}
msg = {
"id": "test_msg_8",
"payload": create_nested_part(0),
}
result = self.gmail_block._get_email_body(msg, self.mock_service)
# Should return fallback message due to depth limit
assert result == "This email does not contain a readable body."
def test_malformed_base64_handling(self):
"""Test handling of malformed base64 data."""
result = self.gmail_block._decode_base64("invalid_base64_data!!!")
assert result is None
def test_empty_data_handling(self):
"""Test handling of empty or None data."""
assert self.gmail_block._decode_base64("") is None
assert self.gmail_block._decode_base64(None) is None
def test_attachment_download_failure(self):
"""Test handling of attachment download failure."""
msg = {
"id": "test_msg_9",
"payload": {
"mimeType": "text/plain",
"body": {"attachmentId": "invalid_attachment"},
},
}
# Mock attachment download failure
self.mock_service.users().messages().attachments().get().execute.side_effect = (
Exception("Download failed")
)
result = self.gmail_block._get_email_body(msg, self.mock_service)
assert result == "This email does not contain a readable body."