Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Endigest AI Core Summary
This post explores building a RAG-based code knowledge assistant using three chunking strategies and evaluating them with MLflow.
•Naive fixed-size character chunking treats code as plain text, ignoring syntax boundaries, causing functions to be split mid-body and losing semantic context
•LangChain's RecursiveCharacterTextSplitter uses language-specific separators like \nclass and \ndef to prefer logical splits, but still enforces strict size limits
•AST-based chunking with Tree-sitter parses code into syntax trees, splits at semantic boundaries (functions, classes), and prepends metadata headers showing file path and class/function hierarchy
•MLflow's GenAI evaluation framework was used with 46 test questions across categories including pinpointing specific values, retrieving definitions, and comparing app implementations
•Three LLM judges scored RetrievalSufficiency, RetrievalGroundedness, and a custom answer_correctness metric to compare chunking strategies fairly
This summary was automatically generated by AI based on the original article and may not be fully accurate.