The embedding model (bge-m3 in this case) has a sequence length of 8192 tokens, i.e. rlama tries to embed the whole book, but Ollama can only put the first few pages into the embedding request.
Then when retrieving, it retrieves the entire document instead of the relevant passage (because there is no chunking), but truncates this to the first 1000 characters, i.e. the first half-page of Table of Contents.
As a result, when queried, the model says: "There is no direct mention of the Buddha in the provided documents." (The word Buddha appears 44,121 times in the documents I indexed.)
A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.
---
I'd also recommend showing the search results to the user (I think just having a vector search engine is already an extremely useful feature, even without the AI summary / question answering), and altering the prompt to provide references (e.g. the based on the chunk metadata like page number).