πŸ“Š Extraction Metrics Guide

Understanding quality metrics for PDF-to-Excel semantic extraction

πŸ”€ N-Gram Precision Metrics

1-Gram Precision (Unigram)

Percentage of individual words in the extracted value that appear in the original PDF text.
⚑ Why It Matters: Measures vocabulary accuracy. A high 1-gram score means the LLM used the correct words from the source document. Low scores indicate the LLM introduced entirely new vocabulary not present in the original PDF.
Example:
Original (PDF): "The quick brown fox jumps over the lazy dog"
Extracted: "The quick brown fox jumps over the dog"
1-gram precision: 7/8 = 87.5% (missing "lazy")
Score Range Interpretation 90-100% βœ… Excellent - All or nearly all words match the source 75-89% ⚠️ Good - Most words match; minor omissions or additions 50-74% ❌ Fair - Significant word-level changes; possible paraphrasing <50% ❌ Poor - Major vocabulary differences; possible hallucination

2-Gram Precision (Bigram)

Percentage of consecutive word-pairs in the extracted value that appear in the original PDF text.
⚑ Why It Matters: Measures word order and phrasing accuracy. While 1-gram might be high (all words present), 2-gram drops if words are reordered or separated. Detects subtle paraphrasing where the same words are used but in different sequences.
Example:
Original (PDF): "The quick brown fox"
Extracted: "The brown quick fox"
1-gram precision: 100% (all 4 words present)
2-gram precision: 50% (only "The" pairs match; "quick brown" vs "brown quick" differ)
Score Range Interpretation 90-100% βœ… Excellent - Word sequences preserved perfectly 75-89% ⚠️ Good - Mostly correct sequences; minor reordering 50-74% ❌ Fair - Significant reordering detected <50% ❌ Poor - Text heavily rearranged or rewritten

3-Gram Precision (Trigram)

Percentage of three-word sequences in the extracted value that appear in the original PDF text.
⚑ Why It Matters: Measures phrase and clause-level retention. This is where you detect if the LLM preserved the original phrasing style and grammar. Low 3-gram scores with high 1/2-gram scores indicate the LLM rearranged larger chunks of text.
Example:
Original (PDF): "The board of directors meets quarterly"
Extracted: "The board meets quarterly"
3-gram precision: 50% (only "The board" triplets match; "of directors meets" vs "board meets quarterly" differ)
Score Range Interpretation 85-100% βœ… Excellent - Phrases preserved exactly 70-84% ⚠️ Good - Minor phrase modifications 50-69% ❌ Fair - Phrases partially altered <50% ❌ Poor - Phrases significantly rewritten

4-Gram Precision (Quadgram)

Percentage of four-word sequences in the extracted value that appear in the original PDF text.
⚑ Why It Matters: Most important for assignment compliance. Measures sentence-level and idiom retention. A high 4-gram score indicates the LLM retained the exact original phrasing, grammar, and sentence structureβ€”fulfilling the "retain exact original wording" requirement.
Example:
Original (PDF): "Please submit your assignment by Friday morning"
Extracted: "Please submit your assignment by Friday morning"
4-gram precision: 100% (all quadgrams match exactly)
Score Range Interpretation 90-100% βœ… EXCELLENT MATCH - Original wording perfectly preserved 75-89% ⚠️ GOOD MATCH - Original intent retained; minor rewording <75% ❌ SOMEWHAT MATCH - Significant rewording detected

βš™οΈ How Metrics Are Calculated

N-Gram Extraction Process & BLEU Proxy

  1. Tokenization: Original PDF text and extracted value are split into individual words and punctuation marks (tokens).
  2. N-Gram Generation: For each value of n (1, 2, 3, 4), all possible sequences of n consecutive tokens are extracted.
    Example: "The quick brown" generates:
    1-grams: [The, quick, brown]
    2-grams: [(The, quick), (quick, brown)]
    3-grams: [(The, quick, brown)]
  3. Overlap Calculation: Count how many n-grams from the extracted value appear in the original PDF text.
  4. Precision Computation: Precision = (Matching n-grams) / (Total n-grams in extracted value) β€” computed separately for n=1..4.
  5. BLEU Proxy: When no gold reference exists the system computes a BLEU-like proxy as the arithmetic mean of the 1-gram through 4-gram precisions:
    BLEU_proxy = (p1 + p2 + p3 + p4) / 4
    This proxy is simple, interpretable, and stable for short extracted values typical in form extraction.

Why N-Grams Matter for This Task

  • 1-grams: Detect if the LLM used completely foreign vocabulary not in the PDF.
  • 2-grams: Detect if words are reordered or the sequence is altered.
  • 3-grams: Detect if larger phrases are compressed or rewritten.
  • 4-grams: Detect if the overall sentence structure and idiom are preserved (most critical for assignments).

✍️ Assignment Compliance Guide

Requirement: "Retain exact original wording, sentence structure, and phrasing"

What the assignment is asking for:

  • Extract information from the PDF as-is, without paraphrasing or rewriting.
  • Preserve the original author's voice, style, and phrasing.
  • Maintain sentence structure and grammatical choices.
  • Do not compress, simplify, or improve upon the original text.

How to use metrics to verify compliance:

Metric Result Match Quality Action BLEU_proxy β‰₯ 70% βœ… EXCELLENT MATCH Submit with confidence BLEU_proxy 50–69% ⚠️ GOOD MATCH Review for minor rewording; accept if rubric allows BLEU_proxy < 50% ❌ SOMEWHAT MATCH Manual correction or reject; likely paraphrased
Example Calculation:
p1 = 80%   p2 = 60%   p3 = 40%   p4 = 20%
BLEU_proxy = (0.80 + 0.60 + 0.40 + 0.20) / 4 = 0.50 (50%)
This example would be classified as SOMEWHAT MATCH (BLEU_proxy < 50%).

Red Flags to Watch For:

  • 1-gram < 80%: LLM introduced new vocabulary. Possible hallucination.
  • High 1-gram, Low 4-gram: Same words used but reordered/compressed. Likely paraphrasing.
  • Multiple PARAPHRASED fields: LLM is not suitable for this task; consider rule-based extraction.
  • Values not in PDF: LLM hallucinated; definitely reject.

πŸ’‘ Tips & Best Practices

For Best Results:

  • Use digital PDFs: Scanned PDFs may have OCR errors, leading to mismatches even if extraction is correct.
  • Review SOMEWHAT MATCH fields immediately: Don't submit extractions with low BLEU_proxy scores without manual review.
  • Check context: If a field has high 1-gram but low 4-gram, read the top "missing n-grams" to see what changed.
  • Trust the BLEU_proxy score: For assignment compliance, focus on the BLEU_proxy as the primary indicator of wording preservation.
  • Manual override: If the extraction is functionally correct but has low n-gram scores due to necessary abbreviation, you can manually approve it.

Understanding Missing & Extra N-Grams:

Missing N-Grams: Sequences present in the original PDF but absent from the extraction. Usually indicates:

  • Words/phrases were omitted (compression)
  • Synonyms were used (paraphrasing)
  • Word order was changed (restructuring)

Extra N-Grams: Sequences in the extraction not found in the original PDF. Usually indicates:

  • LLM added new content (hallucination or context blending)
  • Different phrasing was used (paraphrasing)
  • Formatting or punctuation was altered (usually minor)

πŸ“š Reference

Summary of All Metrics

Metric What It Measures Ideal Range Relevance to Assignment
1-Gram Word-level vocabulary match β‰₯ 85% Detects hallucination
2-Gram Word-pair sequence match β‰₯ 80% Detects word reordering
3-Gram Phrase-level match β‰₯ 80% Detects phrase rewording
4-Gram Sentence structure & idiom match β‰₯ 90% PRIMARY COMPLIANCE INDICATOR

Common Questions

Q: Why is 4-gram the most important metric?
A: Because your assignment specifically requires "exact original wording, sentence structure, and phrasing." 4-gram precision directly measures whether sentence structure and multi-word phrases are preserved, making it the best single indicator of compliance.

Q: What if 1-gram is high but 4-gram is low?
A: This means the LLM used all the right words but rearranged them or rewrote the sentences. This is still a form of paraphrasing and does NOT meet the requirement. Reject or manually fix.

Q: Can I use extractions with "GOOD MATCH" status?
A: Only if your assignment allows minor rewording or if you manually verify the changes are acceptable. If the rubric demands exact wording, prefer "EXCELLENT MATCH".

Q: Why do I see 0% on some n-grams?
A: This typically means the value was significantly rewritten or is entirely different from the PDF. The LLM may have hallucinated or misunderstood the field. Reject and investigate.