State-of-the-art codebase question-answering using automated codebase evals

June 24, 2024

Introduction

Systematically evaluating the helpfulness of AI coding assistants in the context of large and complex codebases is challenging because those coding assistants must be able to answer questions about many codebase-specific concepts. Morph Labs has built a system to automatically generate evaluation sets for individual codebases. The technology underlying our autogenerated evaluations can be combined with self-teaching to create high-quality synthetic codebase fine-tuning datasets. Prominent AI companies, including Together AI, utilize automated codebase evaluations and synthetic fine-tuning data. Morph and Together AI are collaborating on research projects aimed at enhancing the performance of RAG systems for codebases, leveraging automated codebase evaluations and high-quality synthetic data for codebase fine-tuning, and you can read more in their latest blogpost here.

Results

To validate the quality of answers generated by our Morph Code API, we derived a separate test set of ~70 questions using resolved GitHub issues from five prominent open-source codebases. We produced blinded human judgments of the answers from Morph Code API, as well as from a number of commercially available AI coding assistants. The answers from Morph Code API attain state-of-the-art quality, as measured by the average human-judged helpfulness rating. Notably, our optimized system far outperforms a similar unoptimized system configured without guidance from our automatically generated codebase-specific eval sets.

System	Score (blind human)
Morph Code API, eval-optimized	4.50
Cursor with GPT-4	4.33
Cursor with GPT-3.5	3.83
Morph Code API, base	3.22
Cody	3.00
Greptile	2.17
Codeium (base model)	2.33
GPT-4 without retrieval	1.50

Average score on 1-5 scale

Methods

Automatic benchmark generation

We construct a benchmark for a repository in several stages. We use static analysis and automated reasoning over the code graph to determine highly relevant definitions and to synthesize missing specifications. Based on the output of this pipeline, we generate and filter thousands of evaluation questions which require a set of expected identifiers from the codebase to be mentioned in the answer. Upon manual inspection, questions in the autogenerated evaluation sets require the expected identifiers in order to satisfactorily answer the question ~80% of the time (i.e., for ~80% of the questions, our automatically determined answer is relevant, necessary, and correct).

At evaluation time, a system's score is the fraction of questions for which the system's response contains the expected identifiers (the system's "final-answer recall" score).

Human-rated helpfulness evaluation

We collect human judgments to estimate the helpfulness of systems' responses on a sample of GitHub issues from each repository.

We sample an initial set of issues randomly from the already-resolved issues in each repository. We then inspect each issue for whether a specific solution is evident in the record (e.g. the issue has a linked PR that fixes X method to do Y thing). If a specific solution is evident, we record a natural language summary of it as the reference solution for the issue. If no solution is evident (e.g. the issue was closed without being solved and the original question was not answered), the issue is discarded from our evaluation set.

To evaluate a system's response for a given question, we compare the system's response against the recorded reference solution. We then rate the answer on a 5-point scale based on how much insight into the solution the system's response demonstrates.

5 points if the response is almost entirely correct
- Examples include:
  - Correct implementation. The question asks how to save a model, and the response uses the correct method in the correct way.
  - Successful debugging. The question asks why a bug is happening, and the response successfully identifies the key error in the offending method's code.
4 points if the response is essentially correct but contains a mistake or misconception or knowledge gap
- Examples include:
  - Minor type error. The question asks how to save a model, and the response identifies the correct method but passes one of the arguments as a list instead of a singleton.
  - Functional but nonstandard. The question asks how to save a model, and the response does so in a nonstandard way by iterating over all of the model's tensors in a custom loop.
3 points if the response is "half-correct", containing substantial helpful information but also major misconceptions or knowledge gaps
- Examples include:
  - Major misconception. The question asks how to save a model, and the response identifies the correct method but incorrectly claims that before a model can be saved, it must be loaded onto a GPU.
  - Long implementation with many errors. The question asks how to define a PyTorch model, and the response mentions nn.Module and says to subclass it, but several required instance methods are missing, and the implementations of the methods fail to use any of PyTorch's library methods.
  - Violating an important requirement. The question asks how to define a PyTorch model class for a convolutional neural network (CNN), and the response implements a convolutional neural network in a Python method, and the implementation is a functional CNN, but the implementation is not a class and is not a subclass of nn.Module.
2 points if the response is largely incorrect but suggests at least a little bit of the insight into the reference solution
- Examples include:
  - Recalling a name but nothing else. The question asks how to define a PyTorch model, and the response says to use nn.Module, but does not demonstrate any knowledge of how to use nn.Module.
1 point if the response completely misses the idea of the reference solution
- Examples include:
  - Refusal. The question asks how to save a model, and the response says that this is not possible.
  - Hallucination. The question asks how to save a model, and the response suggests using a TensorFlow method, even though the model is a PyTorch model.
  - Ignoring the primary requirement. The question asks how to define a PyTorch model, and the response suggests creating some NumPy arrays and organizing them in a Python dictionary.

Discussion

Human judgment is an essential north star for evaluating AI systems. Cheaper benchmarks are useful to the extent that they are predictive of the results of fuller evaluation. We developed and validated an automatic benchmark generator that not only identifies higher-performing codebase question answering systems, but enables systems optimized on the benchmarks' scores to attain higher average human-rated helpfulness.

Because our autogenerated benchmarks are codebase-specific, they can be used to measure the effect of any design choice in an end-to-end RAG system over any repository. For example, our benchmarks can detect the impact of poor chunking for a niche programming language, or the effect of omitting certain metadata in a retrieval index. Autogenerated benchmarks are ideal for providing quick feedback on each of the many detailed design decisions that a developer of a codebase question answering system needs to make as they iterate. As codebase question answering systems become more sophisticated, automatic evaluation systems like ours will enable scalable oversight of AI coding systems while addressing the high cost of creating and manually evaluating benchmarks of increasingly challenging questions over diverse codebases.

To use automated codebase evaluations and synthetic fine-tuning data for developing your own coding assistants or evaluating which one fits your use case best, reach out to jesse@morph.so. If you’re excited by working on the frontier of AI programming assistants, reach out to jobs@morph.so .