Adaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset
Abstract
This paper introduces an innovative approach to Retrieval-Augmented Generation (RAG) for video question answering (VideoQA) through the development of an adaptive chunking methodology and the creation of a bilingual educational dataset. Our proposed adaptive chunking technique, powered by CLIP embeddings and SSIM scores, identifies meaningful transitions in video content by segmenting educational videos into semantically coherent chunks. This methodology optimizes the processing of slide-based lectures, ensuring efficient integration of visual and textual modalities for downstream RAG tasks. To support this work, we gathered a bilingual dataset comprising Persian and English mid- to long-duration academic videos, curated to reflect diverse topics, teaching styles, and multilingual content. Each video is enriched with synthetic question-answer pairs designed to challenge pure large language models (LLMs) and underscore the necessity of retrieval-augmented systems. The evaluation compares our CLIP-SSIM-based chunking approach against conventional video slicing methods, demonstrating significant improvements across RAGAS metrics, including Answer Relevance, Context Relevance, and Faithfulness. Furthermore, our findings reveal that the multimodal image-text retrieval scenario achieves the best overall performance, emphasizing the importance of integrating complementary modalities. This research establishes a robust framework for video RAG pipelines, expanding the capabilities of multimodal AI systems for educational content analysis and retrieval. © 2025 IEEE.