Asked to classify a list of examples, LLMs are sensitive to their order

Which makes them harder to rely on.

Author
Published

January 10, 2024

After asking an LLM to do something to each item in a list, I found that the LLM’s behavior varied based on the order of the items in the list. This was a surprise! For me, like for many software developers and data scientists, working with LLMs feels uncanny. They’re unlike ordinary software, which is deterministic, returning the same result no matter how you ask a question.

I was asking a model to classify a few items in a list according to the same rubric, when I realized that modifying one item changed the classification of another, unmodified item. The takeaway is that LLMs can be unreliable, giving different answers for what might seem like the same question. One way to mitigate the issue might be to query about each item independently. Nevertheless, this is an instance where LLMs behave differently than the popular mental model of a not-too-bright intern.

(I doubt I’m the first to notice this phenomenon. If you’re aware of others who have highlighted it, I’d be happy to know.)

Here’s the code that shows it.

Code
import llm
import random
import ast
from time import sleep
import re

Here are a series of tweet-style snippets that might be about a hockey game. We’ll ask the LLM to tell us which are really about hockey. Some of these are pretty ambiguous, which is important to show that these ambiguous snippets are treated inconsistently.


snippets = [("Stormy the Ice Hog is so cute!!", False), # he's the mascot!
            ("I love to drink Bud Light at the PNC Center in Raleigh, it tastes great.", False), # the Canes play at the PNC Center.
            ("Ron Francis chose the wrong players to take advantage of that power play.", True), # Francis used to be the captain, and later the manager.
            ("I can't believe Carolina missed that penalty shot, Rod Brind'Amour needs to have his players PRACTICE", True), # Brind'Amour used to be a star player, and is now the coach.
            ("I can't believe Carolina missed that free throw, Hubert Davis needs to have his players PRACTICE", False), # this is about basketball!
            ("Andrei Svechnikov scores at 3:23 in the third period against Arizona.", True), # Svechnikov is a player.
            ("Another Carolina Hurricanes goal, whoo!! Whoo-whoo!", True),
            ("The Hurricanes have a power play. Can they score on the Coyotes in the next two minutes?", True),
            ("Flooding and wind from Hurricane Floyd did a lot of damage to eastern NC -- and even to the RBC Center in Raleigh.", False), # the PNC Center used to be called the RBC Centre. But I'm not sure if it was built during Floyd (and I doubt Floyd damaged it, but I dunno.)
            ("Uh oh, with the Canes' forward Aho in the box, I'm not liking the way the Canadiens' line looks.", True)
             ]

correct_answers = [i for i, (snippet, val) in enumerate(snippets) if val]

(In real life, few-shot classification – where the prompt includes a few examples, along with their correct classifications – would probably work better.)

prompt_prefix = """
Of these tweets, which are referring to what's happening in an ongoing Carolina Hurricanes hockey game? 
Reply with ONLY a Python-style list of line numbers of the items that relate to the game.
""".strip().replace("\n", " ")
print(f"indexes of correct answers: {correct_answers}")
indexes of correct answers: [2, 3, 5, 6, 7, 9]

MODEL = "mistral-tiny"
NUM_ITERATIONS = 3
SLEEP_TIME=0

BRACKET_FINDER_RE = "\[[\d, ]+\]"


def get_results(model_name, model_key_alias, snippets=snippets, prompt_prefix=prompt_prefix, sleep_time=SLEEP_TIME):
    item_indexes = list(range(len(snippets)))
    random.shuffle(item_indexes)
    false_true_mapping = {false_index:true_index for true_index, false_index  in enumerate(item_indexes)}
    shuffled_snippets = [f"{fake_index+1}. {snippets[true_index][0]}" for fake_index, true_index in sorted(false_true_mapping.items())]

    # generate a prompt with snippets in random order
    prompt = prompt_prefix + "\n\n" + "\n\n".join(shuffled_snippets)

    # send it to `llm` and get the return value
    model = llm.get_model(model_name)
    model.key = llm.get_key(None, key_alias=model_key_alias)
    response = model.prompt(prompt)
    sleep(sleep_time)

    # parse the output and match up the output to the snippets

    list_from_response = re.search(BRACKET_FINDER_RE, response.text())[0]
    try:
        responses_by_shuffled_index = ast.literal_eval(list_from_response)
    except (ValueError, SyntaxError):
        print("ERROR, invalid response")
        print(response.text())
        print("")
        return None
    responses_by_real_index = sorted([false_true_mapping[i - 1] for i in responses_by_shuffled_index])
    return responses_by_real_index

def compare_results(model_name, model_key_alias, num_iterations=3, prompt_prefix=prompt_prefix, sleep_time=SLEEP_TIME):
    responses = []
    for _ in range(NUM_ITERATIONS):
        resp = get_results(model_name, model_key_alias, prompt_prefix=prompt_prefix)
        if resp:
            responses.append(tuple(resp))
        print(resp)
    if len(set(responses)) != 1:
        print("those results varied!")
compare_results("mistral-tiny", "mistral")

# we don't really care if the model always gets some of the answers wrong. We care that the results aren't stable with respect to the ordering.
[3, 5, 6, 7]
[2, 5, 7, 9]
[5, 6, 7, 9]
those results varied!
compare_results("mistral-small", "mistral")
[3, 5, 7]
[5, 6, 7, 9]
[5, 6, 7, 9]
those results varied!
compare_results("mistral-medium", "mistral")
[3, 5, 6, 7, 9]
[2, 5, 6, 7, 9]
[2, 3, 5, 6, 7, 9]
those results varied!

This pattern isn’t limited to Mistral’s models. It occurs even with GPT3.5.

compare_results("gpt-3.5-turbo", "openai", sleep_time=20)
[2, 3, 5, 6, 7, 9]
[5, 6, 7]
[2, 3, 5, 6, 7, 9]
those results varied!

What if we tell the model to consider each example independently?

It doesn’t really help.

prompt_prefix_independent = """
Of these tweets, which are referring to what's happening in an ongoing Carolina Hurricanes hockey game? 
Reply with ONLY a Python-style list of line numbers of the items that relate to the game.
Consider each tweet independently.
""".strip().replace("\n", " ")

compare_results("mistral-small", "mistral", prompt_prefix=prompt_prefix_independent)
[5, 6, 7, 9]
[2, 5, 6, 7, 9]
[5, 6, 7]
those results varied!