New highlights added January 23, 2024 at 8:45 PM

  • Chunking documents sounds trivial. However, the quality ofchunk- ing affects the retrieval process in many ways and in particular on the embeddings of the chunk then affects the similarity and matching of chunks to user queries. There are two ways of chunk- ing: heuristics based (using punctuation, end of paragraph, etc.), and semantic chunking (using the semantics in the text to inform start-end of a chunk). Further research should explore the tradeoffs between these methods and their effects on critical downstream processes like embedding and similarity matching. (View Highlight)
  • From the case studies we identified a set of failure points presented below. The following section addresses the research question What are the failure points that occur when engineering a RAG system? FP1 Missing Content The first fail case is when asking a ques- tion that cannot be answered from the available documents. In the happy case the RAG system will respond with some- thing like “Sorry, I don’t know”. However, for questions that are related to the content but don’t have answers the system could be fooled into giving a response. FP2 Missed the Top Ranked Documents The answer to the question is in the document but did not rank highly enough to be returned to the user. In theory, all documents are ranked and used in the next steps. However, in practice the top K documents are returned where K is a value selected based on performance. FP3 Not in Context - Consolidation strategy Limitations Documents with the answer were retrieved from the data- base but did not make it into the context for generating an answer. This occurs when many documents are returned from the database and a consolidation process takes place to retrieve the answer. FP4 Not Extracted Here the answer is present in the context, but the large language model failed to extract out the correct answer. Typically, this occurs when there is too much noise or contradicting information in the context. FP5 Wrong Format The question involved extracting informa- tion in a certain format such as a table or list and the large language model ignored the instruction. FP6 Incorrect Specificity The answer is returned in the re- sponse but is not specific enough or is too specific to address the user’s need. This occurs when the RAG system designers have a desired outcome for a given question such as teach- ers for students. In this case, specific educational content should be provided with answers not just the answer. Incor- rect specificity also occurs when users are not sure how to ask a question and are too general. 6https://github.com/openai/evals FP7 Incomplete Incomplete answers are not incorrect but miss some ofthe information even though that information was in the context and available for extraction. An example question such as “What are the key points covered in documents A, B and C?” A better approach is to ask these questions separately. (View Highlight)
  • Software engineering best practices are still emerging for RAG sys- tems. Software testing and test case generation are one of the areas for refinement. RAG systems require questions and answers that are application specific often unavailable when indexing unstructured documents. Emerging work has considered using LLMs for gen- erating questions from multiple documents [4]. How to generate realistic domain relevant questions and answers remains an open problem. (View Highlight)