mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-01-09 15:17:59 -05:00
## Problem The YouTube transcription block would fail when attempting to transcribe videos that only had transcripts available in non-English languages. Even when usable transcripts existed in other languages, the block would raise a `NoTranscriptFound` error because it only requested English transcripts. **Example video that would fail:** https://www.youtube.com/watch?v=3AMl5d2NKpQ (only has Hungarian transcripts) **Error message:** ``` Could not retrieve a transcript for the video https://www.youtube.com/watch?v=3AMl5d2NKpQ! No transcripts were found for any of the requested language codes: ('en',) For this video (3AMl5d2NKpQ) transcripts are available in the following languages: (GENERATED) - hu ("Hungarian (auto-generated)") ``` ## Solution Implemented intelligent language fallback in the `TranscribeYoutubeVideoBlock.get_transcript()` method: 1. **First**, tries to fetch English transcript (maintains backward compatibility) 2. **If English unavailable**, lists all available transcripts and selects the first one using this priority: - Manually created transcripts (any language) - Auto-generated transcripts (any language) 3. **Only fails** if no transcripts exist at all **Example behavior:** ```python # Before: Video with only Hungarian transcript get_transcript("3AMl5d2NKpQ") # ❌ Raises NoTranscriptFound # After: Video with only Hungarian transcript get_transcript("3AMl5d2NKpQ") # ✅ Returns Hungarian transcript ``` ## Changes - **Modified** `backend/blocks/youtube.py`: Added try-catch logic to fallback to any available language when English is not found - **Added** `test/blocks/test_youtube.py`: Comprehensive test suite covering URL extraction, language fallback, transcript preferences, and error handling (7 tests) - **Updated** `docs/content/platform/blocks/youtube.md`: Documented the language fallback behavior and transcript priority order ## Testing - ✅ All 7 new unit tests pass - ✅ Block integration test passes - ✅ Full test suite: 621 passed, 0 failed (no regressions) - ✅ Code formatting and linting pass ## Impact This fix enables the YouTube transcription block to work with international content while maintaining full backward compatibility: - ✅ Videos in any language can now be transcribed - ✅ English is still preferred when available - ✅ No breaking changes to existing functionality - ✅ Graceful degradation to available languages Fixes #10637 Fixes https://linear.app/autogpt/issue/OPEN-2626 > [!WARNING] > > <details> > <summary>Firewall rules blocked me from connecting to one or more addresses (expand for details)</summary> > > #### I tried to connect to the following addresses, but was blocked by firewall rules: > > - `www.youtube.com` > - Triggering command: `/home/REDACTED/.cache/pypoetry/virtualenvs/autogpt-platform-backend-Ajv4iu2i-py3.11/bin/python3` (dns block) > > If you need me to access, download, or install something from one of these locations, you can either: > > - Configure [Actions setup steps](https://gh.io/copilot/actions-setup-steps) to set up my environment, which run before the firewall is enabled > - Add the appropriate URLs or hosts to the custom allowlist in this repository's [Copilot coding agent settings](https://github.com/Significant-Gravitas/AutoGPT/settings/copilot/coding_agent) (admins only) > > </details> <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > Issue Title: if theres only one lanague available for transcribe youtube return that langage not an error > Issue Description: `Could not retrieve a transcript for the video https://www.youtube.com/watch?v=3AMl5d2NKpQ! This is most likely caused by: No transcripts were found for any of the requested language codes: ('en',) For this video (3AMl5d2NKpQ) transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - hu ("Hungarian (auto-generated)") (TRANSLATION LANGUAGES) None If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!` you can use this video to test: [https://www.youtube.com/watch?v=3AMl5d2NKpQ\`](https://www.youtube.com/watch?v=3AMl5d2NKpQ%60) > Fixes https://linear.app/autogpt/issue/OPEN-2626/if-theres-only-one-lanague-available-for-transcribe-youtube-return > > > Comment by User : > This thread is for an agent session with githubcopilotcodingagent. > > Comment by User : > This thread is for an agent session with githubcopilotcodingagent. > > Comment by User : > This comment thread is synced to a corresponding [GitHub issue](https://github.com/Significant-Gravitas/AutoGPT/issues/10637). All replies are displayed in both locations. > > </details> <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/Significant-Gravitas/AutoGPT/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ntindle <8845353+ntindle@users.noreply.github.com> Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>
141 lines
5.8 KiB
Python
141 lines
5.8 KiB
Python
from unittest.mock import Mock, patch
|
|
|
|
import pytest
|
|
from youtube_transcript_api._errors import NoTranscriptFound
|
|
from youtube_transcript_api._transcripts import FetchedTranscript, Transcript
|
|
|
|
from backend.blocks.youtube import TranscribeYoutubeVideoBlock
|
|
|
|
|
|
class TestTranscribeYoutubeVideoBlock:
|
|
"""Test cases for TranscribeYoutubeVideoBlock language fallback functionality."""
|
|
|
|
def setup_method(self):
|
|
"""Set up test fixtures."""
|
|
self.youtube_block = TranscribeYoutubeVideoBlock()
|
|
|
|
def test_extract_video_id_standard_url(self):
|
|
"""Test extracting video ID from standard YouTube URL."""
|
|
url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
|
video_id = self.youtube_block.extract_video_id(url)
|
|
assert video_id == "dQw4w9WgXcQ"
|
|
|
|
def test_extract_video_id_short_url(self):
|
|
"""Test extracting video ID from shortened youtu.be URL."""
|
|
url = "https://youtu.be/dQw4w9WgXcQ"
|
|
video_id = self.youtube_block.extract_video_id(url)
|
|
assert video_id == "dQw4w9WgXcQ"
|
|
|
|
def test_extract_video_id_embed_url(self):
|
|
"""Test extracting video ID from embed URL."""
|
|
url = "https://www.youtube.com/embed/dQw4w9WgXcQ"
|
|
video_id = self.youtube_block.extract_video_id(url)
|
|
assert video_id == "dQw4w9WgXcQ"
|
|
|
|
@patch("backend.blocks.youtube.YouTubeTranscriptApi")
|
|
def test_get_transcript_english_available(self, mock_api_class):
|
|
"""Test getting transcript when English is available."""
|
|
# Setup mock
|
|
mock_api = Mock()
|
|
mock_api_class.return_value = mock_api
|
|
mock_transcript = Mock(spec=FetchedTranscript)
|
|
mock_api.fetch.return_value = mock_transcript
|
|
|
|
# Execute
|
|
result = TranscribeYoutubeVideoBlock.get_transcript("test_video_id")
|
|
|
|
# Assert
|
|
assert result == mock_transcript
|
|
mock_api.fetch.assert_called_once_with(video_id="test_video_id")
|
|
mock_api.list.assert_not_called()
|
|
|
|
@patch("backend.blocks.youtube.YouTubeTranscriptApi")
|
|
def test_get_transcript_fallback_to_first_available(self, mock_api_class):
|
|
"""Test fallback to first available language when English is not available."""
|
|
# Setup mock
|
|
mock_api = Mock()
|
|
mock_api_class.return_value = mock_api
|
|
|
|
# Create mock transcript list with Hungarian transcript
|
|
mock_transcript_list = Mock()
|
|
mock_transcript_hu = Mock(spec=Transcript)
|
|
mock_fetched_transcript = Mock(spec=FetchedTranscript)
|
|
mock_transcript_hu.fetch.return_value = mock_fetched_transcript
|
|
|
|
# Set up the transcript list to have manually created transcripts empty
|
|
# and generated transcripts with Hungarian
|
|
mock_transcript_list._manually_created_transcripts = {}
|
|
mock_transcript_list._generated_transcripts = {"hu": mock_transcript_hu}
|
|
|
|
# Mock API to raise NoTranscriptFound for English, then return list
|
|
mock_api.fetch.side_effect = NoTranscriptFound(
|
|
"test_video_id", ("en",), mock_transcript_list
|
|
)
|
|
mock_api.list.return_value = mock_transcript_list
|
|
|
|
# Execute
|
|
result = TranscribeYoutubeVideoBlock.get_transcript("test_video_id")
|
|
|
|
# Assert
|
|
assert result == mock_fetched_transcript
|
|
mock_api.fetch.assert_called_once_with(video_id="test_video_id")
|
|
mock_api.list.assert_called_once_with("test_video_id")
|
|
mock_transcript_hu.fetch.assert_called_once()
|
|
|
|
@patch("backend.blocks.youtube.YouTubeTranscriptApi")
|
|
def test_get_transcript_prefers_manually_created(self, mock_api_class):
|
|
"""Test that manually created transcripts are preferred over generated ones."""
|
|
# Setup mock
|
|
mock_api = Mock()
|
|
mock_api_class.return_value = mock_api
|
|
|
|
# Create mock transcript list with both manual and generated transcripts
|
|
mock_transcript_list = Mock()
|
|
mock_transcript_manual = Mock(spec=Transcript)
|
|
mock_transcript_generated = Mock(spec=Transcript)
|
|
mock_fetched_manual = Mock(spec=FetchedTranscript)
|
|
mock_transcript_manual.fetch.return_value = mock_fetched_manual
|
|
|
|
# Set up the transcript list
|
|
mock_transcript_list._manually_created_transcripts = {
|
|
"es": mock_transcript_manual
|
|
}
|
|
mock_transcript_list._generated_transcripts = {"hu": mock_transcript_generated}
|
|
|
|
# Mock API to raise NoTranscriptFound for English
|
|
mock_api.fetch.side_effect = NoTranscriptFound(
|
|
"test_video_id", ("en",), mock_transcript_list
|
|
)
|
|
mock_api.list.return_value = mock_transcript_list
|
|
|
|
# Execute
|
|
result = TranscribeYoutubeVideoBlock.get_transcript("test_video_id")
|
|
|
|
# Assert - should use manually created transcript first
|
|
assert result == mock_fetched_manual
|
|
mock_transcript_manual.fetch.assert_called_once()
|
|
mock_transcript_generated.fetch.assert_not_called()
|
|
|
|
@patch("backend.blocks.youtube.YouTubeTranscriptApi")
|
|
def test_get_transcript_no_transcripts_available(self, mock_api_class):
|
|
"""Test that exception is re-raised when no transcripts are available at all."""
|
|
# Setup mock
|
|
mock_api = Mock()
|
|
mock_api_class.return_value = mock_api
|
|
|
|
# Create mock transcript list with no transcripts
|
|
mock_transcript_list = Mock()
|
|
mock_transcript_list._manually_created_transcripts = {}
|
|
mock_transcript_list._generated_transcripts = {}
|
|
|
|
# Mock API to raise NoTranscriptFound
|
|
original_exception = NoTranscriptFound(
|
|
"test_video_id", ("en",), mock_transcript_list
|
|
)
|
|
mock_api.fetch.side_effect = original_exception
|
|
mock_api.list.return_value = mock_transcript_list
|
|
|
|
# Execute and assert exception is raised
|
|
with pytest.raises(NoTranscriptFound):
|
|
TranscribeYoutubeVideoBlock.get_transcript("test_video_id")
|