State-of-the-art codebase question-answering using automated codebase evals

June 24, 2024

Introduction

Systematically evaluating the helpfulness of AI coding assistants in the context of large and complex codebases is challenging because those coding assistants must be able to answer questions about many codebase-specific concepts. Morph Labs has built a system to automatically generate evaluation sets for individual codebases. The technology underlying our autogenerated evaluations can be combined with self-teaching to create high-quality synthetic codebase fine-tuning datasets. Prominent AI companies, including Together AI, utilize automated codebase evaluations and synthetic fine-tuning data. Morph and Together AI are collaborating on research projects aimed at enhancing the performance of RAG systems for codebases, leveraging automated codebase evaluations and high-quality synthetic data for codebase fine-tuning, and you can read more in their latest blogpost here.

Results

To validate the quality of answers generated by our Morph Code API, we derived a separate test set of ~70 questions using resolved GitHub issues from five prominent open-source codebases. We produced blinded human judgments of the answers from Morph Code API, as well as from a number of commercially available AI coding assistants. The answers from Morph Code API attain state-of-the-art quality, as measured by the average human-judged helpfulness rating. Notably, our optimized system far outperforms a similar unoptimized system configured without guidance from our automatically generated codebase-specific eval sets.

System Score (blind human)
Morph Code API, eval-optimized 4.50
Cursor with GPT-4 4.33
Cursor with GPT-3.5 3.83
Morph Code API, base 3.22
Cody 3.00
Greptile 2.17
Codeium (base model) 2.33
GPT-4 without retrieval 1.50
Average score on 1-5 scale

Methods

Automatic benchmark generation

We construct a benchmark for a repository in several stages. We use static analysis and automated reasoning over the code graph to determine highly relevant definitions and to synthesize missing specifications. Based on the output of this pipeline, we generate and filter thousands of evaluation questions which require a set of expected identifiers from the codebase to be mentioned in the answer. Upon manual inspection, questions in the autogenerated evaluation sets require the expected identifiers in order to satisfactorily answer the question ~80% of the time (i.e., for ~80% of the questions, our automatically determined answer is relevant, necessary, and correct).

At evaluation time, a system's score is the fraction of questions for which the system's response contains the expected identifiers (the system's "final-answer recall" score).

Human-rated helpfulness evaluation

We collect human judgments to estimate the helpfulness of systems' responses on a sample of GitHub issues from each repository.

We sample an initial set of issues randomly from the already-resolved issues in each repository. We then inspect each issue for whether a specific solution is evident in the record (e.g. the issue has a linked PR that fixes X method to do Y thing). If a specific solution is evident, we record a natural language summary of it as the reference solution for the issue. If no solution is evident (e.g. the issue was closed without being solved and the original question was not answered), the issue is discarded from our evaluation set.

To evaluate a system's response for a given question, we compare the system's response against the recorded reference solution. We then rate the answer on a 5-point scale based on how much insight into the solution the system's response demonstrates.

Discussion

Human judgment is an essential north star for evaluating AI systems. Cheaper benchmarks are useful to the extent that they are predictive of the results of fuller evaluation. We developed and validated an automatic benchmark generator that not only identifies higher-performing codebase question answering systems, but enables systems optimized on the benchmarks' scores to attain higher average human-rated helpfulness.

Because our autogenerated benchmarks are codebase-specific, they can be used to measure the effect of any design choice in an end-to-end RAG system over any repository. For example, our benchmarks can detect the impact of poor chunking for a niche programming language, or the effect of omitting certain metadata in a retrieval index. Autogenerated benchmarks are ideal for providing quick feedback on each of the many detailed design decisions that a developer of a codebase question answering system needs to make as they iterate. As codebase question answering systems become more sophisticated, automatic evaluation systems like ours will enable scalable oversight of AI coding systems while addressing the high cost of creating and manually evaluating benchmarks of increasingly challenging questions over diverse codebases.

To use automated codebase evaluations and synthetic fine-tuning data for developing your own coding assistants or evaluating which one fits your use case best, reach out to jesse@morph.so. If you’re excited by working on the frontier of AI programming assistants, reach out to jobs@morph.so .