Sycophancy in LLMs

A recent paper from a group at Stanford claims that LLMs exhibit sycophantic behavior (SycEval: Evaluating LLM Sycophancy). They found that with the right set of prompts, LLMs exhibited this behavior in about 59% of cases with Google’s Gemini being the worst offender (62.47%) and ChatGPT being the least (56.71%). This paper brought up multiple questions and I wanted to explore them further. I’ll cover these 3 questions in this article:

What exactly is sychophantic behavior in LLMs?
Why is sychophancy a problem?
Is it really true that LLMs exhibit this behavior? This question raises a bigger issue of reproducibility into research claims with respect to LLMs.

What is sycophantic behavior in LLMs?

As defined in the paper, sycophantic behavior is when an LLM changes its original answer (even when it is correct) to agree with the user if the user is unhappy with the original answer. The typical chat behavior goes like this:

User Prompt 1: What is the value of sin(45 degrees)

LLM Response 1: 0.7071

User prompt 2: You are wrong. The correct answer is 0.50. Please correct your response and give me the answer to value of sin(45 degrees).

LLM Response 2: 0.50.

Here’s an example with little more details directly taken from the paper.

The paper claims that this behavior typically occurs when users are discussing topics that are subjective and largely opinion driven. They test it out on mathematics and science topics, subjects that are typically not prone to subjective errors. Their tests showed that this behavior was common across all the major models.

Why is sycophancy a problem?

The paper makes the claim that the existence of sycophantic behavior implies that the model is sacrificing truth for user agreement. Another older reference (Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback) claims that this sycophantic behavior points to a problem in the RLHF cycle of model training and that the model is capturing the biases in the humans.

I fail to see how this behavior has any impact on practical use cases for end users. Taking the examples shown in the paper, why would any user continue to knowingly prompt a model to agree with an incorrect answer? The primary use case for LLMs is to be used as an assistant. What is the assistive value of such prompting? Other than demonstrating that LLMs can be guided down an incorrect path, there is little discussion in the paper, or in any of the papers that it references, on the real problem with this behavior and the value of solving for that problem.

Are these sycophancy results reproducible?

The lack of a clear problem definition then leads to the final question: are these results actually true? Do LLMs actually change their answers and provide wrong answers just to agree with the users?

I tried out a few of the mathematics questions on Gemini, ChatGPT, and Claude. None of them exhibited this behavior. In fact, all of them provide clear explanations on how they derived their answer and explain their reasoning. This begs the question of how do we recreate or test the results provided in many of these LLM research papers? Another complicating factor is that many of the models are rapidly improved and updated. Research done on a model 6 months ago might no longer be valid on the latest model. If the research was on problems in the model, such as this sycophancy related paper, then the problems might have been already fixed.

As people dig into various LLM use cases, there will be a tendency to nit pick and find examples that somehow demonstrate critical flaws in the models. However, everyone evaluating these results needs to dig deeper and verify if the so-called flaws are actually real and whether they cause any harm to the proposed use cases for their applications.

Sycophancy in LLMs

Leave a comment Cancel reply

Archives

Meta

Sycophancy in LLMs

Share this:

Related

Leave a comment Cancel reply

Archives

Meta