📊 Extraction Metrics Guide

🔤 N-Gram Precision Metrics

1-Gram Precision (Unigram)

Percentage of individual words in the extracted value that appear in the original PDF text.

⚡ Why It Matters: Measures vocabulary accuracy. A high 1-gram score means the LLM used the correct words from the source document. Low scores indicate the LLM introduced entirely new vocabulary not present in the original PDF.

Example:
Original (PDF): "The quick brown fox jumps over the lazy dog"
Extracted: "The quick brown fox jumps over the dog"
1-gram precision: 7/8 = 87.5% (missing "lazy")

Score Range Interpretation 90-100% ✅ Excellent - All or nearly all words match the source 75-89% ⚠️ Good - Most words match; minor omissions or additions 50-74% ❌ Fair - Significant word-level changes; possible paraphrasing <50% ❌ Poor - Major vocabulary differences; possible hallucination

2-Gram Precision (Bigram)

Percentage of consecutive word-pairs in the extracted value that appear in the original PDF text.

⚡ Why It Matters: Measures word order and phrasing accuracy. While 1-gram might be high (all words present), 2-gram drops if words are reordered or separated. Detects subtle paraphrasing where the same words are used but in different sequences.

Example:
Original (PDF): "The quick brown fox"
Extracted: "The brown quick fox"
1-gram precision: 100% (all 4 words present)
2-gram precision: 50% (only "The" pairs match; "quick brown" vs "brown quick" differ)

Score Range Interpretation 90-100% ✅ Excellent - Word sequences preserved perfectly 75-89% ⚠️ Good - Mostly correct sequences; minor reordering 50-74% ❌ Fair - Significant reordering detected <50% ❌ Poor - Text heavily rearranged or rewritten

3-Gram Precision (Trigram)

Percentage of three-word sequences in the extracted value that appear in the original PDF text.

⚡ Why It Matters: Measures phrase and clause-level retention. This is where you detect if the LLM preserved the original phrasing style and grammar. Low 3-gram scores with high 1/2-gram scores indicate the LLM rearranged larger chunks of text.

Example:
Original (PDF): "The board of directors meets quarterly"
Extracted: "The board meets quarterly"
3-gram precision: 50% (only "The board" triplets match; "of directors meets" vs "board meets quarterly" differ)

Score Range Interpretation 85-100% ✅ Excellent - Phrases preserved exactly 70-84% ⚠️ Good - Minor phrase modifications 50-69% ❌ Fair - Phrases partially altered <50% ❌ Poor - Phrases significantly rewritten

4-Gram Precision (Quadgram)

Percentage of four-word sequences in the extracted value that appear in the original PDF text.

⚡ Why It Matters: Most important for assignment compliance. Measures sentence-level and idiom retention. A high 4-gram score indicates the LLM retained the exact original phrasing, grammar, and sentence structure—fulfilling the "retain exact original wording" requirement.

Example:
Original (PDF): "Please submit your assignment by Friday morning"
Extracted: "Please submit your assignment by Friday morning"
4-gram precision: 100% (all quadgrams match exactly)

Score Range Interpretation 90-100% ✅ EXCELLENT MATCH - Original wording perfectly preserved 75-89% ⚠️ GOOD MATCH - Original intent retained; minor rewording <75% ❌ SOMEWHAT MATCH - Significant rewording detected

⚙️ How Metrics Are Calculated

N-Gram Extraction Process & BLEU Proxy

Tokenization: Original PDF text and extracted value are split into individual words and punctuation marks (tokens).
N-Gram Generation: For each value of n (1, 2, 3, 4), all possible sequences of n consecutive tokens are extracted.
Example: "The quick brown" generates:
1-grams: [The, quick, brown]
2-grams: [(The, quick), (quick, brown)]
3-grams: [(The, quick, brown)]
Overlap Calculation: Count how many n-grams from the extracted value appear in the original PDF text.
Precision Computation: Precision = (Matching n-grams) / (Total n-grams in extracted value) — computed separately for n=1..4.
BLEU Proxy: When no gold reference exists the system computes a BLEU-like proxy as the arithmetic mean of the 1-gram through 4-gram precisions:
BLEU_proxy = (p1 + p2 + p3 + p4) / 4
This proxy is simple, interpretable, and stable for short extracted values typical in form extraction.

Why N-Grams Matter for This Task

1-grams: Detect if the LLM used completely foreign vocabulary not in the PDF.
2-grams: Detect if words are reordered or the sequence is altered.
3-grams: Detect if larger phrases are compressed or rewritten.
4-grams: Detect if the overall sentence structure and idiom are preserved (most critical for assignments).

✍️ Assignment Compliance Guide

Requirement: "Retain exact original wording, sentence structure, and phrasing"

What the assignment is asking for:

Extract information from the PDF as-is, without paraphrasing or rewriting.
Preserve the original author's voice, style, and phrasing.
Maintain sentence structure and grammatical choices.
Do not compress, simplify, or improve upon the original text.

How to use metrics to verify compliance:

Metric Result Match Quality Action BLEU_proxy ≥ 70% ✅ EXCELLENT MATCH Submit with confidence BLEU_proxy 50–69% ⚠️ GOOD MATCH Review for minor rewording; accept if rubric allows BLEU_proxy < 50% ❌ SOMEWHAT MATCH Manual correction or reject; likely paraphrased

Example Calculation:

                      p1 = 80%   p2 = 60%   p3 = 40%   p4 =
                      20%

                      BLEU_proxy = (0.80 + 0.60 + 0.40 + 0.20) / 4 = 0.50 (50%)
                    

This example would be classified as SOMEWHAT MATCH (BLEU_proxy < 50%).

Red Flags to Watch For:

1-gram < 80%: LLM introduced new vocabulary. Possible hallucination.
High 1-gram, Low 4-gram: Same words used but reordered/compressed. Likely paraphrasing.
Multiple PARAPHRASED fields: LLM is not suitable for this task; consider rule-based extraction.
Values not in PDF: LLM hallucinated; definitely reject.

💡 Tips & Best Practices

For Best Results:

Use digital PDFs: Scanned PDFs may have OCR errors, leading to mismatches even if extraction is correct.
Review SOMEWHAT MATCH fields immediately: Don't submit extractions with low BLEU_proxy scores without manual review.
Check context: If a field has high 1-gram but low 4-gram, read the top "missing n-grams" to see what changed.
Trust the BLEU_proxy score: For assignment compliance, focus on the BLEU_proxy as the primary indicator of wording preservation.
Manual override: If the extraction is functionally correct but has low n-gram scores due to necessary abbreviation, you can manually approve it.

Understanding Missing & Extra N-Grams:

Missing N-Grams: Sequences present in the original PDF but absent from the extraction. Usually indicates:

Words/phrases were omitted (compression)
Synonyms were used (paraphrasing)
Word order was changed (restructuring)

Extra N-Grams: Sequences in the extraction not found in the original PDF. Usually indicates:

LLM added new content (hallucination or context blending)
Different phrasing was used (paraphrasing)
Formatting or punctuation was altered (usually minor)

📚 Reference

Summary of All Metrics

Metric	What It Measures	Ideal Range	Relevance to Assignment
1-Gram	Word-level vocabulary match	≥ 85%	Detects hallucination
2-Gram	Word-pair sequence match	≥ 80%	Detects word reordering
3-Gram	Phrase-level match	≥ 80%	Detects phrase rewording
4-Gram	Sentence structure & idiom match	≥ 90%	PRIMARY COMPLIANCE INDICATOR

Common Questions

Q: Why is 4-gram the most important metric?
A: Because your assignment specifically requires "exact original wording, sentence structure, and phrasing." 4-gram precision directly measures whether sentence structure and multi-word phrases are preserved, making it the best single indicator of compliance.

Q: What if 1-gram is high but 4-gram is low?
A: This means the LLM used all the right words but rearranged them or rewrote the sentences. This is still a form of paraphrasing and does NOT meet the requirement. Reject or manually fix.

Q: Can I use extractions with "GOOD MATCH" status?
A: Only if your assignment allows minor rewording or if you manually verify the changes are acceptable. If the rubric demands exact wording, prefer "EXCELLENT MATCH".

Q: Why do I see 0% on some n-grams?
A: This typically means the value was significantly rewritten or is entirely different from the PDF. The LLM may have hallucinated or misunderstood the field. Reject and investigate.