page_index (u/page_index)

page_index commented on · Posted by u/page_index

page_index · 2 months ago

This vectorless, vision RAG notebook passes PDF page images directly to Vision Language Models without OCR or embeddings. This eliminates the text extraction pipeline entirely, no layout detection, no character recognition, no vector search — only visual reasoning over document images, retrieved through reasoning-based hierarchical tree search.

This challenges a foundational assumption in document AI: that text must first be extracted before it can be understood. Traditional RAG pipelines depend on OCR for text recognition, chunk the extracted text, embed those chunks into vectors, and retrieve by similarity.

Each transformation step introduces error: tables fragment, spatial relationships dissolve, annotations separate from their anchors. Vectorless Vision RAG collapses this multi-stage process into just two steps: reasoning-based page retrieval, then visual interpretation. The VLM sees the document as it was meant to be read — a complete visual artifact with intact structure, typography, and spatial semantics.

The implication isn't that OCR or embeddings are obsolete, it's that preprocessing pipelines should justify their complexity cost. When the final model itself can consume the original representation, intermediate transformations become architectural overhead, rather than enabling infrastructure, a relic of a text-first paradigm in a world moving toward reasoning-native, vectorless document understanding.

page_index commented on Vectorless, Vision-Based RAG colab.research.google.com... · Posted by u/page_index

page_index · 2 months ago

In modern document question answering (QA) systems, OCR serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). The resulting text can provide contextual input that enables LLMs to perform question answering over document content.

Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

However, this paradigm shift raises an important question:

> If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?

We build a practical vectorless, vision-based question-answering implementation for long documents, without relying on OCR. Specifically, we adopt a vectlorless, reasoning-based retrieval layer and the multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.

page_index commented on · Posted by u/page_index

page_index · 2 months ago

In modern document question answering (QA) systems, OCR serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). The resulting text can provide contextual input that enables LLMs to perform question answering over document content.

Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

However, this paradigm shift raises an important question:

> If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?

We build a practical vision-based question-answering implementation for long documents, without relying on OCR. Specifically, we adopt a vectlorless, reasoning-based retrieval layer and the multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.

page_index commented on CausalRAG: Integrating Causal Graphs into RAG arxiv.org/abs/2503.19878... · Posted by u/page_index

page_index · 2 months ago

Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.