LLM Evaluation Metrics: Core

llm

Feb 13, 2024

💡 1. Intro to LLM Evaluation Core Metrics

Evaluating LLMs is challenging due to the complexity of language understanding and generation. Different use cases require different evaluation approaches, making it essential to choose the right metrics.

💡 2. Generation & Fluency Metrics

These metrics assess how well an LLM generates fluent and coherent text (logical, consistent, and easy to understand).

Perplexity: Measures how well a model predicts the next word in a sequence.
BLEU: Checks how closely the generated text matches a reference output (exact n-gram overlap).
METEOR: Improves on BLEU by considering synonyms, stemming, and recall for closer alignment with human judgment.
Diversity Metrics: Evaluates the variety of outputs, ensuring they aren’t repetitive and exhibit rich language.

💡 3. Precision & Structure-Based Metrics

These metrics focus on how precise and structurally aligned the generated text is with reference data (ground truth examples, human-written summaries, or benchmark datasets).

ROUGE (N, L, S): Evaluates n-gram and sentence overlap between the generated text and reference summaries, measuring content similarity.
BERTScore: Assesses semantic similarity by comparing embeddings of words rather than exact matches, providing a deeper understanding of content quality.

⚠️ This is Part 1 of a 3-part series.

Part 2: LLM Evaluation Metrics: Extractive & Generative Question-Answer.

Part 3: LLM Evaluation Metrics: Advanced & Topic Specific.

1. Intro to LLM Evaluation Core Metrics

Evaluating LLMs is challenging because metrics vary depending on the task and use case. Unlike traditional software, where performance can be measured objectively, LLMs generate text, which introduces a level of subjectivity in evaluation.

🧠 Mental Model : Evaluating LLMs as Essay Assessment

Consider evaluating an LLM to be similar to a teacher assessing an essay.

One evaluator may prioritize fluency and coherence, while another might focus on precision and accuracy. This makes it difficult to measure performance consistently across different tasks.

Before LLMs, evaluation metrics like precision and recall were widely used for tasks like ranking search results. However, since LLMs generate responses instead of returning lists, these traditional metrics become less effective for evaluating model performance.

To address this, researchers have developed a variety of specialized metrics to evaluate the quality of text generation. These include metrics like Perplexity, ROUGE, BLEU, and Exact Match (EM). Each metric focuses on different aspects of LLM performance, from evaluating word overlap to assessing the overall fluency and coherence.

The table below provides an overview of the key evaluation metrics used across different LLM tasks. In the following sections, we will dive deeper into these core metrics and explore when and how to use them effectively.

Overview of evaluation metrics used for assessing LLM performance.

2. Generation & Fluency Metrics

Generation & Fluency Metrics are essential for evaluating how well an LLM generates text that is both accurate and coherent. These metrics assess various aspects of the generated text, such as its predictability, fluency, and diversity.

🧠 Mental Model : Measuring a Robot's Story Fluency

Imagine a robot that writes stories. We want to know if the stories are good. To give the robot’s story a good score, we look at a few things:

Predictability: Does the story make sense? Can you guess what might happen next, like a good story should? It’s like knowing what comes next in your favorite song.

Fluency: Does the story read smoothly, like someone talking nicely? Are the words easy to understand and flow well together?

Diversity: Are the stories different and interesting, or does the robot keep telling the same story over and over? We want lots of different and fun stories!

Below are metrics that can be used to measure these characteristics

2.1. Perplexity

Perplexity measures how confident a LLM is in generating text. It comes from probability theory and is commonly used in language modeling. A lower perplexity indicates the model is more confident in its predictions, suggesting that the generated text closely follows typical language patterns. On the other hand, a higher perplexity suggests more uncertainty, meaning the model is struggling to predict the next word accurately.

🧠 Mental Model : Perplexity as GPS Navigation System

Think of Perplexity like a navigation system. The reference text is the destination, and the generated text is the route your navigation system suggests. Perplexity measures how confused the system is in finding the most efficient route to the destination.

A low perplexity means the system is confident in the route it’s taking (fewer alternatives), while a high perplexity means the system is unsure and considering many possible routes (more confusion).

For example, if your destination is “Paris,” and the system suggests “Route A” with minimal detours, the perplexity is low. If the system suggests “Route B” with many alternative routes or uncertain turns, the perplexity is higher.

Perplexity ≈ 1: The model is very confident in its predictions, meaning it almost always knows the next word. This is ideal but rare. Highly Confident

Perplexity between 10-100: The model has moderate confidence, making good predictions most of the time. This range is typical for well-trained models. Moderate Confidence

Perplexity > 1000: The model is highly uncertain, struggling to predict the next word accurately. This indicates poor performance. Highly Uncertain

📗 Example : Perplexity calculations

Reference text: “The cat sat on the mat”

Generated text: “The cat is sitting on a mat”

Probability for each word, obtained from a trained language model

Reference probability:

P("The") = 0.1, P("cat") = 0.2, P("sat") = 0.15, P("on") = 0.2, P("the") = 0.1, P("mat") = 0.25

Generated probability:

P("The") = 0.1, P("cat") = 0.2, P("is") = 0.05, P("sitting") = 0.05, P("on") = 0.2, P("a") = 0.1, P("mat") = 0.25

The likelihood of the generated text is calculated as the product of individual word probabilities:

Likelihood = P("The") * P("cat") * P("is") * P("sitting") * P("on") * P("a") * P("mat")

We calculate:

Likelihood = 0.1 * 0.2 * 0.05 * 0.05 * 0.2 * 0.1 * 0.25 = 0.00000025

Perplexity is the inverse probability of the test set normalized by the number of words:

Perplexity = exp(-1/N * sum(log(P(w_i))))

For a text length of 7 words, we calculate:

Perplexity = exp(-1/7 * log(0.00000025)) ≈ 42.64

Thus, the perplexity of the generated text is approximately 42.64, indicating a higher level of uncertainty in the model’s predictions.

Why Perplexity is Useful

Perplexity is useful for evaluating language model training. A model with low perplexity on a dataset has learned patterns well. However, if perplexity is too low, it might mean the model is overfitting, memorizing instead of generalizing.

⚠️ Limitations of Perplexity

Perplexity doesn’t measure output quality. A model might generate fluent but incorrect text and still have low perplexity. That’s why it is often combined with other metrics like ROUGE and BLEU.

2.2. BLEU (Bilingual Evaluation Understudy)

BLEU measures how similar the generated text is to a reference text by comparing n-grams (sequences of words). It is widely used in machine translation and summarization to evaluate accuracy, where a higher BLEU score indicates better similarity. BLEU works best for structured tasks like translation, where exact phrasing matters more than meaning.

🧠 Mental Model : BLEU Score as Treasure Hunt

Think of BLEU like a treasure hunt. The reference text is the treasure map, and the generated text is the path you’ve taken during the hunt. BLEU measures how closely the generated path matches the treasure map by checking if you’ve taken the right steps (matching words) and if the steps were in the correct order. The more steps that match, and the more in the right sequence, the higher the score, just like how finding the right treasures at the right points along the map leads to a better result.

BLEU Score ≈ 1.0: The generated text is nearly identical to the reference text. This is rare but ideal. Highly Similar

BLEU Score 0.5 - 0.9: The generated text has strong similarity but may contain slight variations. Moderate Similarity

BLEU Score < 0.5: The generated text differs significantly from the reference, indicating poor accuracy. Low Similarity

Simplified version of the BLEU formula:

Brevity Penalty: This adjusts the score if the generated text is shorter than the reference (to avoid short translations).

reference = "The cat sat on the mat"
generated = "The cat sat"
BLEU_SCORE = high precision but penalized for brevity

Average Precision of N-grams: This looks at how many words (or pairs of words) in the generated text match with the reference. Higher matches mean a better score.

📗 Example : BLEU calculations

Reference text: “The cat sat on the mat”

Generated text: “The cat is sitting on a mat”

Reference unigrams:

{"The", "cat", "sat", "on", "the", "mat"}

Generated unigrams:

{"The", "cat", "is", "sitting", "on", "a", "mat"}

Matching unigrams:

{"The", "cat", "on", "mat"}

We calculate the precision for unigrams as:

Precision = (matching unigrams) / (generated unigrams) = 4 / 7 = 0.57

Assuming a brevity penalty (BP) of 1 (since the generated text is not shorter than the reference), we calculate:

BLEU = 1 * exp(log(0.57)) = 0.57

Thus, the BLEU score is 0.57, indicating a moderate overlap with the reference text.

Why BLEU is Useful

BLEU is useful for quick and automatic evaluation of machine translation, ensuring the model follows the expected structure. It is fast and scalable for comparing large datasets.

⚠️ Limitations of BLEU

BLEU struggles with paraphrased but correct responses, as it focuses only on word overlap. It does not consider meaning or context, which can lead to misleading evaluations.

2.3. METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR improves upon BLEU by considering synonyms, stemming, and word order. This makes it better at evaluating semantic similarity, meaning it can recognize different ways of saying the same thing.

High METEOR Score: The generated text conveys a meaning very close to the reference text, even with different wording. High Semantic Match

Moderate METEOR Score: Some words or structures differ, but the core meaning is preserved. Moderate Semantic Match

Low METEOR Score: The generated text is significantly different in meaning or structure. Low Semantic Match

🧠 Mental Model : METEOR metric as Recipe Matching

Think of METEOR like matching recipes. The reference text is the original recipe, and the generated text is a version of the recipe that might have slightly different ingredients or instructions. METEOR looks at how closely the generated text matches the reference in terms of meaning, synonyms, and even the order of words, just like you might follow a recipe but substitute ingredients or change the steps while keeping the overall dish the same.

📗 Example

Original recipe: “You need 2 cups of flour and 1 cup of sugar.”

Generated recipe: “For the dough, use 2 cups of flour and a cup of sugar.”

Even though the word order is slightly different and “use” replaces “need,” METEOR understands that the meaning is the same, so it gives a higher score.

Why METEOR is Useful

METEOR is useful for more flexible text evaluation, as it accounts for synonyms and grammatical variations. This makes it a better choice for assessing human-like text quality.

⚠️ Limitations of METEOR

Unlike BLEU, METEOR is language-dependent and requires additional tools for different languages. It might be also slower to compute, making it less practical for large-scale evaluations.

2.4. Diversity Metrics

Diversity Metrics evaluate the variety of generated outputs, ensuring responses are not repetitive and exhibit richness in language.

High Diversity Score (0.8 - 1.0): The generated text uses a wide range of vocabulary and sentence structures, avoiding repetition. High Linguistic Variety

Example: “The sunset painted the sky in hues of crimson, tangerine, and violet, casting long shadows over the quiet streets.”

Moderate Diversity Score (0.4 - 0.7): Some repetition is present, but the text still shows a reasonable level of variation. Moderate Linguistic Variety.

Example: “The sunset was beautiful, with shades of orange and pink. The streets were quiet as the sun went down.”

Low Diversity Score (0.0 - 0.3): The generated text is highly repetitive or lacks variety in word choice and structure. Low Linguistic Variety.

Example: “The sunset was nice. The sunset was colorful. The sunset was in the sky. The sunset was beautiful.”

Type-Token Ratio (TTR)

High TTR (0.8 - 1.0): Rich vocabulary with minimal repetition.

Example: “The cat observed the fluttering butterfly, then leaped gracefully onto the windowsill to bask in the golden sunlight.”

Moderate TTR (0.4 - 0.7): Some words are repeated, but the text still maintains a reasonable level of variety.

Example: “The cat sat on the mat, then walked to the door. The mat was soft, and the cat enjoyed sitting on it.”

Low TTR (0.0 - 0.3): The text contains frequent repetition, with limited vocabulary variety.

Example: “The cat, the cat, the cat sat. The cat sat on the mat. The mat, the mat, the mat.”

Distinct-N

High Distinct-1 & Distinct-2: The text uses a wide range of unique words and phrase combinations, indicating high lexical and phrase diversity.

Example: “The gentle breeze rustled through the autumn leaves as birds chirped melodiously in the distance.”

Moderate Distinct-1 & Distinct-2: Some words and phrases are repeated, but the text still maintains a reasonable level of variety.

Example: “The wind blew through the trees. The leaves swayed in the breeze as the birds chirped.”

Low Distinct-1 & Distinct-2: The text contains frequent repetition, leading to low lexical and phrase diversity.

Example: “Hello hello hello. The wind, the wind, the wind.”

Why Diversity Metrics are Useful

Diversity Metrics help ensure that AI-generated responses remain engaging, avoiding repetitive or overly formulaic outputs. This is especially important in creative applications like storytelling and dialogue generation.

⚠️ Limitations of Diversity Metrics

While promoting variation, high diversity can sometimes reduce coherence and relevance. Additionally, measuring diversity effectively requires balancing it with semantic consistency.

3. Precision & Structure-Based Metrics

Precision & Structure-Based Metrics help evaluate how well a generated text preserves key information and meaning. These metrics compare the output to a reference text, assessing aspects like precision, recall, and semantic similarity.

🧠 Mental Model : Imagine a robot summarizing a book

We want to check if it captures the most important points without adding unnecessary details. To score the summary well, we look at a few things:

Precision: Does the robot include only the right details without extra fluff? Think of it like a chef picking the best ingredients for a dish. Relevant

Recall: Does the summary capture all the key points, or does it leave out important information? It’s like making sure you don’t forget key scenes when summarizing a movie. Comprehensive

Semantic Similarity: Does the generated text match the meaning of the original, even if phrased differently? Imagine explaining a concept in your own words while keeping the core idea intact. Meaningful

To measure these characteristics, we use structured metrics that compare the generated text to a reference. Two commonly used metrics for this are ROUGE-N and BERTScore

3.1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Measures how well a generated text matches a reference text. It is commonly used in summarization and text generation tasks. ROUGE scores evaluate overlap between generated and human-written texts, helping assess content quality.

🧠 Mental Model : Shopping List Matching

Think of ROUGE like a shopping list. The words in the reference text are items on a shopping list, and the generated text is what you actually bought. ROUGE looks at how many of the items you bought match the items on the list, focusing on exact matches.

📗 Example

Reference shopping list: “I need to buy apples, bananas, and milk.”

Generated: “I bought apples, milk, and bread.”

Even though you bought a different item (bread), ROUGE focuses on the exact matches (“apples” and “milk”). The more items that match, the higher the ROUGE score.

ROUGE-N: Counting Matching Words

ROUGE-N measures the overlap of N-grams (sequences of N words) between the generated and reference texts. Higher scores indicate better content similarity. An n-gram is a sequence of N words. For example, in the sentence “The cat sat”, the bigrams (2-grams) are “The cat” and “cat sat”.

ROUGE-1: Measures overlap of unigrams (single words). Useful for evaluating lexical similarity. Basic Word Overlap

ROUGE-2: Measures overlap of bigrams (two-word sequences), capturing more contextual accuracy. Phrase-Level Similarity

📗 Example : ROUGE Calculations

Reference text: “The cat sat on the mat”

Generated text: “The cat is sitting on a mat”

We compute their ROUGE-2 score.

Reference bigrams:

{"The cat", "cat sat", "sat on", "on the", "the mat"}

Generated bigrams:

{"The cat", "cat is", "is sitting", "sitting on", "on a", "a mat"}

The matching bigram is:

{"The cat"}

Using the formula:

ROUGE-2 score = (matching bigrams) / (reference bigrams)

We calculate:

ROUGE-2 score = 1 / 5 = 0.20 (20%)

20% indicates partial similarity for the generated text

ROUGE-S: Skip-Bigram Overlap

ROUGE-S (ROUGE-Skip) accounts for skip-bigrams—pairs of words appearing in the same order but not necessarily adjacent. It helps evaluate semantic similarity even if the exact word sequence differs.

Higher ROUGE-S: Suggests strong similarity in meaning, even with slight word rearrangements. Flexible Similarity Matching

Lower ROUGE-S: Indicates significant structural differences, reducing semantic alignment. Weak Content Overlap

📗 Example : Calculations

Reference text: “The cat sat on the mat”

Generated text: “The cat is sitting on a mat”

We compute their skip-bigrams with a maximum skip distance of 2.

Reference skip-bigrams:

("The cat", "cat sat", "sat on", "on the", "the mat", "The sat", "cat on", "sat the", "on mat")

Generated skip-bigrams:

(“The cat”, “cat is”, “is sitting”, “sitting on”, “on a”, “a mat”, “The is”, “cat sitting”, “is on”, “sitting a”, “on mat”)

The matching skip-bigrams are those present in both sets: (“The cat”, “on mat”)

Using the formula:

ROUGE-S score = (matching skip-bigrams) / (reference skip-bigrams)

We calculate:

ROUGE-S score = 2 / 9 = 0.22 (22%)

Thus, the generated text retains 22% of the reference skip-bigrams, indicating a partial similarity.

ROUGE-L: Longest Common Subsequence

ROUGE-L measures the longest common subsequence (LCS) between the generated and reference texts. Unlike ROUGE-N, it considers word order and structure.

Higher ROUGE-L: Indicates better structural alignment, meaning the generated text follows a similar word order as the reference. Strong Structural Match

Lower ROUGE-L: Suggests significant deviations in sentence structure, reducing fluency. Weak Structural Match

📗 Example : ROUGE-L calculations

Reference text: “The cat sat on the mat”

Generated text: “The cat is sitting on a mat”

We calculate the Longest Common Subsequence (LCS) between the reference and generated texts.

Reference text (tokens):

["The", "cat", "sat", "on", "the", "mat"]

Generated text (tokens):

["The", "cat", "is", "sitting", "on", "a", "mat"]

The Longest Common Subsequence (LCS) is the longest sequence of words that appear in both the reference and generated texts in the same order. In this case, the LCS is:

["The", "cat", "on", "mat"]

Now, we compute the ROUGE-L score using the formula:

ROUGE-L = (LCS Length) / (Length of Reference Text)

The LCS length is 4 (since “The”, “cat”, “on”, and “mat” are the longest common subsequence), and the reference text has 6 words.

We calculate:

ROUGE-L = 4 / 6 = 0.67 (67%)

Thus, the ROUGE-L score is 67%, indicating that the generated text shares 67% of the longest common subsequences with the reference text.

⚠️ Limitations of ROUGE Metric

One key issue is that ROUGE heavily relies on n-gram overlap, which can overlook the semantic meaning of the generated content. Texts with similar meaning but different wordings may receive lower scores, even if they are highly relevant.

Additionally, ROUGE does not account for fluency or grammar, meaning a generated text with poor sentence structure but high n-gram overlap could still receive a high score. This leads to a potential misrepresentation of the actual quality of the text.

Furthermore, ROUGE metrics like ROUGE-N and ROUGE-L may struggle with tasks that require creativity or paraphrasing, as they are primarily based on surface-level word matching. They may not accurately measure how well the generated text captures the underlying meaning, especially in cases where synonyms or rephrased structures are used. For example, a paraphrased sentence with different wording but the same meaning may receive a low ROUGE score, despite being a valid and high-quality generation.

3.2. BERTScore

Measures how similar two texts are using word embeddings. Unlike ROUGE, which checks exact word matches, BERTScore understands meaning by comparing words in a semantic space.

How BERTScore Works

Imagine each word is a point in space. BERTScore finds how close these points are between the generated and reference text. Closer points mean higher similarity.

High BERTScore: The generated text is semantically close to the reference, even if words are different. Strong Meaning Match Example: Reference: “The cat is sitting on the mat.” Generated: “A cat is resting on the carpet.” BERTScore: 0.92

Medium BERTScore: The generated text has some semantic overlap with the reference, but not a perfect match. Partial Meaning Match Example: Reference: “The cat is sitting on the mat.” Generated: “A dog is lying on the rug.” BERTScore: 0.78

Low BERTScore: The generated text has different meaning, even if some words overlap. Weak Meaning Match Example: Reference: “The cat is sitting on the mat.” Generated: “The weather is rainy and cold.” BERTScore: 0.45

🧠 Mental Model : BERTScore and Puzzle Pieces

Imagine BERTScore as putting together a puzzle. Each word in the generated text is like a puzzle piece, and each word in the reference text is another puzzle piece. BERTScore looks at how closely the puzzle pieces (words) fit together based on their shape (meaning), even if the words themselves aren’t the same.

📗 Example

Reference text: "The dog is happy." Generated: "The puppy is joyful."

Even though the words are different, BERTScore checks if the shapes of the pieces match - and because “dog” and “puppy,” or “happy” and “joyful” are similar in meaning, the pieces fit together well!

Why BERTScore is Useful

BERTScore is great for evaluating summaries and translations, since it looks beyond exact words and focuses on meaning. However, it requires pre-trained models, making it more complex than ROUGE.

📗 Example : BERTScore Calculation

Reference text: “The quick brown fox jumps over the lazy dog”

Generated text: “A fast brown fox leaps over the lazy dog”

We calculate BERTScore using pre-trained BERT embeddings for each token in the reference and generated texts. BERTScore measures the cosine similarity between token embeddings to determine the similarity between two texts.

Reference text (tokens):

["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Generated text (tokens):

["A", "fast", "brown", "fox", "leaps", "over", "the", "lazy", "dog"]

For each token in the reference and generated texts, we retrieve the corresponding embeddings from BERT and compute the cosine similarity between the embeddings. We then aggregate these scores to get the final BERTScore.

Here’s a simple visualization of the cosine similarity for a few token pairs:

  Cosine similarity("quick", "fast") = 0.88
  Cosine similarity("jumps", "leaps") = 0.92
  Cosine similarity("the", "the") = 1.00
  Cosine similarity("lazy", "lazy") = 1.00
  Cosine similarity("dog", "dog") = 1.00

We then compute the average similarity across all token pairs to calculate the overall BERTScore. The final formula for BERTScore is:

BERTScore = 1 / N * Σ cosine_similarity

In this case, the average similarity across all tokens is:

BERTScore = (0.88 + 0.92 + 1.00 + 1.00 + 1.00) / 5 = 0.96

Thus, the BERTScore is 0.96, indicating a high similarity between the reference and generated texts based on BERT embeddings.

⚠️ Limitations of BERTScore

Not perfect for all text types: BERTScore might struggle with evaluating texts where the meaning is very abstract or figurative, like poetry or humor.

Relies on BERT embeddings: BERTScore uses pre-trained models (like BERT), which might not capture every subtlety of meaning, especially for specialized or niche topics.

Context matters: BERTScore might not always account for the context of word usage. For example, “bank” can mean a financial institution or the side of a river, and the score may not distinguish between the two without more context.

4. Compare different metrics

METEOR vs BERTScore

Both METEOR and BERTScore evaluate semantic similarity, but differ in approach:

Matching approach: METEOR uses synonym matching, stemming, and word order for surface-level matches. BERTScore uses contextualized embeddings for deeper semantic understanding.

Handling abstract meaning: METEOR works with straightforward content but struggles with figurative language. BERTScore better captures meaning but has limitations with creative content.

Computation complexity: METEOR is simpler and less resource-intensive. BERTScore requires pre-trained models, making it slower but more accurate.

Context understanding: METEOR considers word order but misses contextual meanings. BERTScore better understands context and multiple word meanings.

ROUGE vs BLEU

Both ROUGE and BLEU are evaluation metrics for text generation, but they approach measurement differently:

Measurement focus: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) prioritizes recall, measuring how much of the reference text appears in the generated text. BLEU (Bilingual Evaluation Understudy) emphasizes precision, measuring how much of the generated text appears in the reference.

Application domain: ROUGE was designed for summarization tasks, making it better for evaluating condensed content. BLEU was created for machine translation and works best when evaluating texts of similar length to their references.

N-gram handling: ROUGE offers variants (ROUGE-N, ROUGE-L, ROUGE-S) to capture different linguistic aspects. BLEU uses a weighted average of n-gram precisions with a brevity penalty to prevent very short outputs from scoring highly.

Limitations: ROUGE may reward verbose outputs that include reference content among filler. BLEU may penalize valid paraphrases that use different wording than references. Neither fully captures semantic meaning or fluency.