By “Alignment”, What are we aligning LLMs to?

An Initial Thought Piece on the “Optimization Target” in LLM Alignment Research

Chenghao Yang

University of Chicago

[firstname]@uchicago.edu

Thanks for the insightful discussions with Ari Holtzman, Allyson Ettinger, Ruiqi Zhong, Shi Feng and Yifei Li, and colleagues from UChicago & TTIC. Names are not listed in particular order.

Thank Simons Institute for organizing the LLM & Transformers workshop and releasing the talk recordings. Thank all the great presenters for sharing their valuable insights.

[Initial version: 10/12/2023, Recent Update: 05/13/2024]

When we work with traditional NLP tasks, the “Optimization Target” seems quite obvious – improving accuracy, efficiency, etc. But when we want to achieve a better human-AI “Alignment”, the “Optimization Target” seems vaguer. In fact, after I started working on Alignment Research, perhaps the most frequent question I was asked was: What do you think is the “Optimization Target” we can optimize for in Alignment? A related question is: should we be optimistic about achieving that “Optimization Target” if it exists? In this blog, I am trying to summarize what I have learned so far to provide my thoughts on these questions, including:

[Well-defined Subset of the Alignment Research]

For those new to the concept of 'Alignment Research,' let's start with a simpler scenario: In situations where humans can easily provide feedback, what methods are commonly used to ensure AI systems act in ways that align with our expectations and values? I will start with single-objective optimization (with a focus on math/logic-“reasoning” ability as people may be more interested) and then talk about how we move towards multiple-objective settings. In LLM literature, this part of my blog mainly corresponds to the “instruction tuning” process in which we adapt a language model almost purely trained on next word prediction to a more capable multi-task solver.
[Under-explored, Challenging Remaining Problems of the Alignment Research]

Life does not always come with simple problems, so we also need to deal with the challenging parts: In tasks where humans are unable to give feedback, due to scalability, values diversity, etc, what are some typical strategies we can use to achieve a better human-LM alignment? This is the main challenge for current alignment research, and several strategies have been proposed to tackle these challenges.
[Attitudes: Cautiously Optimistic]

Given the limited well-defined and far more challenging under-explored problems we have discussed so far, it is clear that there do not exist uniformly operational “optimization targets” that you can tune your model to. Such targets will only be made clear if you narrow down the problem scope enough, just the same as in any other scientific research. Should we be depressed about it? What attitudes should we take? How do we ensure we are making progress? In this part, using the online commercial deployment of AI technologies (which has been exercised for many years) as an analogy, I will explain why we can be optimistic about the future of alignment research, and why, even if the goal is not universally clear, any trial-and-error would be quite meaningful progress.

Table of Contents

Well-Defined “Optimization Target”: The Story Starting with Next Token Prediction

If we only want an instruction-following “helpful” agent that can help solve exam questions, or help propose some science hypothesis like in AI4Science (for Physics, Medicine, Biology, etc.), then the goal can usually be simplified as hill-climbing on a benchmark of exam questions. Comparing LLaMA-2, Baichuan-2, and Tulu-2, I think the key is to carefully prepare a high-quality diverse set of datasets (Math/Code/…) and do pretraining, and instruction fine-tuning. In all these training stages, “next word prediction” is commonly used as an “optimization target” and a typical decoder-only language model is applied. From a view of optimization, this does not stray far from the original Transformer paper. (Perhaps architecture engineering as in Mistral-7B would also help, but it may still need further exploration.)

Why “Next Word Prediction” is so Powerful?

It may look surprising to see that “next word prediction” as a simple “optimization target” can already get us this far. But that target actually is more profound than we may imagine. One popular explanation is that this target can be viewed as “a compression” – better compression, better intelligence. Imagine an algorithm that's small enough to fit in just a kilobyte (kb) of code, yet it has the power to create millions of random numbers. This is a form of compression that's far more efficient than trying to store all those numbers in your memory or even compressing them into a zip file. In essence, the algorithm's ability to generate a vast amount of data from a tiny space represents a smarter way to compress information, much like how language models efficiently process and generate complex information. For this part, I recommend watching Ilya’s talk, where he explained that by doing compression via “next word prediction”, the model is looking for the optimal program to solve tasks reasonably well. Then generalization will happen because you can use Turing-complete programs to solve every task if you know how to program. There are already some mechanistic interpretability works trying to establish the connection between Transformers and programs (Thinking like Transformers, Transformer Programs).