Blog logo

LLM Evaluation Metrics: Extractive & Generative Question-Answer

llm

💡 1. Question Answering (QA) Evaluation Specifics

Unlike general text generation, QA evaluation needs precision in extractive QA and coherence in generative QA.

💡 2. Extractive QA Metrics

  • Exact Match (EM): Ensures word-for-word accuracy by checking if the answer matches the reference exactly.

  • F1-score & Token-Level F1: Provides partial credit when responses contain similar words but not an exact match.

💡 3. Generative QA Metrics

  • ROUGE: Measures n-gram overlap, evaluating how similar generated responses are to reference answers.

  • QAG-FactEval: Checks factual accuracy to ensure generated QA responses are not misleading.


1. Why QA Evaluation is Different?

Evaluating Question Answering (QA) models is not the same as evaluating general text generation. In QA, the correctness of the answer is critical, but how we measure correctness depends on the type of QA system being used.

Two main types of QA models:

Extractive QA – The model identifies and extracts a specific section of text from a given content.

📗 Example : Extractive QA

Context: “The Eiffel Tower is located in Paris, France.”

Question: “Where is the Eiffel Tower?”

  • Expected: “Paris, France”
  • Incorrect: “France” – Not specific enough

Generative QA – The model generates an answer based on its understanding of the content.

📗 Example : Generative QA

Context: “The Eiffel Tower, a global landmark, stands in the heart of Paris, France.”

Question: “Where is the Eiffel Tower?”

  • Expected: “The Eiffel Tower is located in Paris, France, and is one of the most famous landmarks in the world.”
  • Incorrect: “The Eiffel Tower is in Germany.” – Factually incorrect

Challenges in QA Evaluation

The key challenge in QA evaluation is that different QA models require different metrics to measure their accuracy and usefulness.

🧠 Mental Model : Precision vs. Understanding

Think of extractive and generative QA as two different ways of answering a quiz:

Extractive QA is like a fill-in-the-blank test where only an exact answer is correct.

Generative QA is like an essay question where multiple answers can be valid as long as they are factually correct and well-formed.

Extractive QA requires exact correctness. A single wrong word can change the meaning of an answer.

Generative QA needs to be evaluated for coherence, fluency, and factual accuracy rather than exact word match.

⚖️ Pros and Cons of Extractive vs Generative QA

Extractive QA is more reliable and works well when answers are directly available in the source text.

It may fail if the phrasing of the question or answer is slightly different from the source text.

Generative QA is flexible and can create more natural-sounding answers.

It risks generating hallucinated or incorrect answers that are not based on the source text.

Because of these differences, a single metric cannot evaluate all QA models. This is why we use different evaluation methods for extractive and generative QA.

2. Exact Matching & Extractive QA Metrics

Extractive QA models find and return exact spans of text from a passage. To evaluate their performance, we need strict accuracy-based metrics that check how well the extracted answer matches the correct answer.

1. Exact Match (EM)

Exact Match (EM) is the simplest way to evaluate extractive QA. It checks if the model’s answer is word-for-word identical to the reference answer.

If the predicted answer matches the reference answer exactly , EM = 1 (100% correct).

If there is any difference (even small), EM = 0 (incorrect).

📗 Example : Exact Match in QA

Context: “The capital of France is Paris.”

Question: “What is the capital of France?”

Case 1: Perfect Match

  • Expected: “Paris”
  • Predicted: “Paris”
  • EM Score: 1 (Correct)

Case 2: Slight Difference

  • Expected: “Paris”
  • Predicted: “Paris, France”
  • EM Score: 0 (Incorrect) – Extra words make it a mismatch

⚖️ Pros and Cons of EM

Simple, easy to compute, and reliable when exact answers are needed.

Works well in tasks where precision and exact matching are critical.

Too strict - small differences like punctuation or synonyms can lead to false negatives.

Doesn’t allow for partial correctness, which can be a limitation in real-world answers.

2. F1-Score & Token-Level F1

It is useful when evaluating models that generate text, ensuring that even partially correct answers receive some credit. Since EM is too strict, there is another metric called F1-score to give partial credit when the model’s answer is close but not an exact match.

Token-Level F1 is a specific type of F1-score that operates at the word/token level, measuring how much the predicted answer overlaps with the reference answer. In the description below, F1-score refers to Token-Level F1, as we use it to evaluate LLM performance.

🧠 Mental Model : Token-Level F1 vs. Exact Match

Think of different evaluation methods as different ways of grading a fill-in-the-blank test:

Exact Match (EM) is like a binary pass/fail system – if your answer is even slightly different, you get zero points.

Token-Level F1 is like giving partial credit for each correct word, even if some extra or missing words are present.

To understand how F1-score is calculated, we need to break it down into its two key components: Precision and Recall. These metrics help measure the quality of a model’s predictions by evaluating how many words it gets right and how much of the correct answer is captured.

Precision measures how many words in the predicted answer are correct.

Recall measures how much of the reference answer is covered.

Token-Level F1-score – A way to measure accuracy by comparing words one by one, balancing how many correct words were found (Precision) and how much of the expected answer was covered (Recall).

F1 score good bad

How to calculate F1 score

Precision = Correct Words in Prediction / Total Words in Prediction

Recall = Correct Words in Prediction / Total Words in Reference Answer

F1-score = 2 × (Precision × Recall) / (Precision + Recall)

📗 Example : Token-Level F1

Context: “The capital of France is Paris.”

Question: “What is the capital of France?”

Case 1: Partial Match

  • Expected: “Paris”
  • Predicted: “Paris, France”
  • Precision: 1/2 (50%) – Only “Paris” is correct
  • Recall: 1/1 (100%) – All of “Paris” is covered
  • F1-score: 0.67

Case 2: Exact Match

  • Expected: “Paris”
  • Predicted: “Paris”
  • Precision: 1/1 (100%)
  • Recall: 1/1 (100%)
  • F1-score: 1.0

⚖️ Pros and Cons of QAG-FactEval

More flexible than EM, allows partial correctness, and better for evaluating real-world answers.

Balances precision and recall, making it a more reliable measure in imbalanced datasets.

Doesn’t consider meaning, only word overlap – so answers with correct intent but different wording may score low.

Can be misleading if the dataset is highly imbalanced or contains many irrelevant answers.

3. Metrics for Generative QA

Evaluating generative QA models is different from extractive QA because answers are free-form text rather than exact spans. This means evaluation must consider:

Relevance – Does the answer correctly respond to the question?

Fluency – Is the response natural and well-formed?

Factual correctness – Is the information accurate?

1. ROUGE for QA (Brief Recap)

ROUGE measures n-gram overlap between generated and reference answers. While useful, it has one major limitation in QA: it does not check factual correctness.

📗 Example : ROUGE Score vs. Factual Accuracy

Reference Answer: “The Eiffel Tower is in Paris, France.”

Generated Answer: “The Eiffel Tower is in Berlin, Germany.”

  • ROUGE Score: High – Because the sentence structure is similar
  • Issue: ❌ The answer is factually incorrect

To fix this, QAG-FactEval can be used.

2. QAG-FactEval: Checking Factual Correctness

One of the biggest problems in generative QA is that models can produce hallucinated (made-up) facts. QAG-FactEval solves this by verifying generated answers against the original passage.

How QAG-FactEval Works

First, the generated answer is split into fact units (individual factual claims). This is usually done using sentence parsing and entity extraction.

Each fact unit is compared with the reference passage to check its correctness.

A confidence score is assigned based on the number of correct vs. incorrect facts.

🧠 Mental Model : Fact Checking Like a Detective

Imagine a detective verifying a witness statement against a crime scene report.

Step 1: Break the statement into key facts – The detective extracts important details (who, what, where, when).

Step 2: Cross-check with evidence – Each fact is compared with official records to see if it matches.

Step 3: Assign credibility – If all facts match, the statement is fully reliable. If some are wrong, the credibility drops.

QAG-FactEval works the same way: it breaks answers into factual claims, checks them against the source, and assigns a confidence score.

📗 Example : Fact Checking in QA

Context: “The Eiffel Tower was built in 1889 and is located in Paris, France.”

Generated Answer 1 (Correct)

  • Answer: “The Eiffel Tower was built in 1889 and is in Paris.”
  • Fact Check: All facts are correct
  • Score: 1.0 (Perfect Accuracy)

Generated Answer 2 (Partially Incorrect)

  • Answer: “The Eiffel Tower was built in 1890 and is in Paris.”
  • ⚠️ Fact Check: "Built in 1890" ❌ (Incorrect year)
  • Score: 0.5 (Partially Correct)

Generated Answer 3 (Completely Wrong)

  • Answer: “The Eiffel Tower is in Germany.”
  • Fact Check: "Eiffel Tower in Germany" ❌ (Completely wrong fact)
  • Score: 0.0 (Completely Incorrect)

Because generative QA must balance fluency, relevance, and factual correctness, QAG-FactEval is essential for evaluating real-world QA models.

⚖️ Pros and Cons of QAG-FactEval

Helps to ensure the information is correct, improves the reliability of answers, and builds trust in QA models.

Makes it easier to detect misleading or false information in generated answers.

Needs supporting text, which can be expensive to compute, especially for large datasets.

Relies on accurate and relevant sources, which can sometimes be difficult to find or verify.