Building a Knowledge Assistant over Code

2026-03-23

1 min read

Tags:

Engineering

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

This post explores building a RAG-based code knowledge assistant using three chunking strategies and evaluating them with MLflow.

•Naive fixed-size character chunking treats code as plain text, ignoring syntax boundaries, causing functions to be split mid-body and losing semantic context
•LangChain's RecursiveCharacterTextSplitter uses language-specific separators like \nclass and \ndef to prefer logical splits, but still enforces strict size limits
•AST-based chunking with Tree-sitter parses code into syntax trees, splits at semantic boundaries (functions, classes), and prepends metadata headers showing file path and class/function hierarchy
•MLflow's GenAI evaluation framework was used with 46 test questions across categories including pinpointing specific values, retrieving definitions, and comparing app implementations

Related Articles