Language model self-teaching for domain adaptation

tl;dr We have developed a proprietary method called self-teaching which allows chat language models to learn new knowledge with better multistep reasoning and significantly better preservation of previous capabilities than typical finetuning methods. Self-taught models forget less off-topic information, remember more on-topic information, and are better at complex reasoning over learned facts without looking at the source material. Message us at hello@morph.so to work with us on deploying smarter chat LLMs for your domain-specific use-case.

Introduction

Chat language models provide a powerful natural language interface to knowledge and capabilities internalized from their training data. Many applications require these models to reason over knowledge not in the training data. However, continuously updating language models with new, domain-specific knowledge without losing prior knowledge can be challenging. Existing solutions include placing the new knowledge into the context of a long context adapted model, embedding-based retrieval-augmented generation (RAG) for prioritizing semantically relevant knowledge within a limited context, and finetuning the model on the new knowledge. These methods can face severe limitations: long-context models can become “lost in the middle”; RAG can struggle with multi-document reasoning and distribution shift against a stationary embedder; after finetuning, models can catastrophically forget previously acquired knowledge and capabilities. These concerns are especially relevant for mathematical reasoning and code generation, which require precise reasoning over large amounts of long-tailed expertise.

We bypass these limitations with a new proprietary wake-sleep algorithm called self-teaching, which can be viewed as a form of test-time training with self-generated synthetic data. Self-teaching allows us to robustly bootstrap new knowledge into a chat language model. On a challenging multi-hop question-answering benchmark (MiniMuSiQue, described below), we observe that, compared to strong finetuning and off-the-shelf retrieval and long-context baselines,

Self-taught models exhibit stronger closed-book multi-document reasoning over independently internalized documents.
Self-taught models exhibit less forgetting for off-domain tasks and better in-context reasoning.
Self-teaching can be repeated over new material with better closed-book reasoning over previously self-taught knowledge.
Self-teaching scales — joint self-teaching over an order of magnitude more examples works even better than self-teaching on those examples individually.

The MiniMuSiQue benchmark

We describe a benchmark for closed-book multi-hop question-answering called MiniMuSiQue, which we have derived from the MuSiQue multi-hop question-answering dataset, called MiniMuSiQue-hard (filtered for questions answerable by GPT-4 but not GPT-3.5, where performance significantly degrades if the first pivot document is removed) and MiniMuSiQue-easy (a larger dataset of convoluted off-distribution single-hop question-answer pairs).

A particularly challenging form of question for models historically has been multi-hop questions, which require a series of interconnected reasoning steps over multiple documents. However, creating multi-hop questions that truly necessitate knowledge-based reasoning is challenging. For instance, early benchmarks like HotpotQA were found to be largely solvable through shortcuts. The construction of questions and corresponding contexts that avoid such shortcuts, and verifying their effectiveness, requires a comprehensive dataset development process. The MuSiQue dataset addresses many weaknesses of prior work and contains difficult multi-hop questions less susceptible to shortcuts.

For our experiments, we refined the MuSiQue dataset even further to focus on questions that demand complex multi-hop reasoning, by selecting questions which (1) GPT-4 could answer but GPT-3.5 could not, and which (2) were not answerable without the context relevant to the first reasoning step (the "first hop pivot document") for each question. Specifically, we selected 768 random examples from the MuSiQue training set, ranked them based on a combined score of difficulty (measured by the difference in ROUGE-L recall between GPT-4 and GPT-3.5) and the necessity for multi-hop reasoning (assessed by the change in ROUGE-L recall when the first hop pivot document was removed). We refer to the top-ranked 128 examples as MiniMuSiQue, and obtain MiniMuSiQue-hard by associating the original difficult MuSiQue multi-hop question-answer pair to each example. To additionally test off-distribution single-hop factual recall, for each example we synthesized convoluted off-distribution single-hop question-answer pairs for up to five entities per document in MiniMuSiQue, resulting in the much larger single-hop dataset MiniMuSiQue-easy. Each MiniMuSiQue example consists of twenty documents sampled from different Wikipedia articles, to which we associate a hard MuSiQue multi-hop reasoning question for MiniMuSiQue, and many single-hop questions for MiniMuSiQue-easy.

We train separate copies of Llama 2 7B Chat with self-teaching vs finetuning on the documents belonging to each example. Within a single example, we apply self-teaching to each document independently and finetune jointly to avoid learning explicit shortcuts. For finetuning, we train LoRA adapters. This choice is motivated by the observation that the low-rank information bottleneck regularizes training and LoRA-finetuned models tend to be more robust than fully finetuned models; besides, MiniMuSiQue is not large enough to cause LoRA to be capacity constrained.

Results

We test ROUGE-L recall for on-domain question-answering (i.e. closed-book QA on the example that was trained on), averaging results across MiniMuSiQue.

Self Teaching   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 32.4%
Finetuning      : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 24.3%
Base            : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.1%

Closed-book ROUGE-L recall on MiniMuSiQue-hard

We see that self-teaching attains the best performance for closed-book question-answering on MiniMuSiQue-hard, with a 33% improvement over finetuning.

Self Teaching   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48.7%
Finetuning      : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 42.0%
Base            : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 37.0%

Closed-book ROUGE-L recall on MiniMuSiQue-easy

On the easier distribution of long single-hop questions from MiniMuSiQue-easy, we see that self-teaching still attains the best closed-book recall, with a 16% improvement over finetuning.

Next, we measure the effect of jointly self-teaching vs finetuning on all 640 documents belonging to MiniMuSiQue-32, a subset of 32 examples from MiniMuSiQue. We compare performance of these models to the performance of the above models on this subset, as well as a 128K context adapted version of Llama 2 (Yarn Llama 2 128K with a five-shot prompt) with all ~100K document tokens for the 32 examples placed into context.

Self Teaching (joint)   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 26.0%
Long Context            : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 22.1%
Finetuning (joint)      : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19.7%
Base                    : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.2%

Closed-book (except for long context) ROUGE-L recall on MiniMuSiQue-32-hard

We note that the jointly trained self-taught model is better, on average, than the average of the individual self-taught models on this subset of MiniMuSiQue. The self-taught model is the only variant whose closed-book question answering performance exceeds the open-book question-answering performance of the 128K long context model, outperforming it by 17.6%.

We also measure the effect of retrieval augmentation on the models' question-answering ability on MiniMuSiQue-32-hard. We retrieve the top twenty most relevant documents using OpenAI embeddings.

Self Teaching (joint + RAG) : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 37.2%
Finetuning (joint + RAG)    : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23.7%
Base + RAG                  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 29.1%

Retrieval-augmented ROUGE-L recall on MiniMuSiQue-32-hard

We see that, compared to finetuning, self-teaching preserves the model's ability to use its context and synergizes much better with RAG, outperforming the base model with RAG by 27.8%. In terms of model performance, we see a similar pattern when measuring off-domain closed-book question-answering (i.e. question-answering on examples whose documents were not trained on in any way) with models trained individually on the examples from MiniMuSiQue-hard.

Self Teaching               : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.4%
Finetuning                  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.5%
Base                        : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.1%

Off-domain closed-book ROUGE-L recall on MiniMuSiQue-hard

We see that finetuning harms the performance beyond baseline while self-teaching even slightly improves upon the baseline for off-domain question-answering; for off-domain closed-book multi-hop question-answering, self-teaching improves upon the base model by 25% and upon finetuning by 52%.

Finally, to definitively measure robustness, we also measure the performance of these models on the Massive Multitask Language Understanding (MMLU) validation set.

Self Teaching (joint)       : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 42.7%
Finetuning (joint)          : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 39.9%
Base                        : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 44.2%

Zero-shot pass@1 on MMLU-valid

We see that even while the self-taught model more than doubles the finetuned model’s closed-book recall, on MMLU the self-taught model only degrades by 1.5% whereas the finetuned model’s performance degrades by 4.3%.

All this evidence indicates that self-teaching is robust, effective, and presents an Pareto improvement upon finetuning for bootstrapping knowledge into a model for downstream question-answering, even when the questions are off-distribution or require reasoning across multiple documents from memory. The experiments we have discussed so far have involved the well-studied but relatively weak (by current standards) Llama 2 Chat 7B, and we have observed that performance only improves when using stronger chat language models to self-teach. Message us at hello@morph.so if you are interested in partnering with us on developing more robust chat assistants for your domain-specific use-case.