RedactionBench

  1. A10 Networks, Inc.

* Corresponding author: sbrynjolfson@a10networks.com

Abstract

LLMs are increasingly being applied to sensitive domains that require redacting personally-identifiable information (PII) before processing. While redacting PII has become a de facto data-cleaning prerequisite, existing benchmarks conflate the mechanics of extraction with the semantics of privacy. A phone number in a public directory is not equivalent to one in a medical record. Whether a given piece of information constitutes a violation depends heavily on who holds it, why, and in what context—fundamentally differentiating the redaction task from simple entity recognition. Grounded in this principle of contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, with a majority seeded from real-world sources. RedactionBench also introduces a novel character-level redaction metric called R-Score that treats semantically similar redactions equally and nullifies the impact of shallow formatting choices (e.g., redacting a phone_number as: "(***) ***-****" vs. "**************"). Extensive evaluations across Named-Entity Recognition (NER) models, entity-extraction Small Language Models (SLM), and frontier LLMs equipped with agentic tools (Claude Opus, OpenAI GPT) demonstrate that contextual redaction remains an unsolved problem. Results from our human evaluation (85 participants) on RedactionBench reveal a stark dichotomy in privacy perception: annotators show consensus with our target labels for mandatory redactions (89.4%) and safe text preservations (94.1%), but fail to agree with contextual redactions (47.7%). This variance demonstrates the subjective nature of contextual privacy and motivates our evaluation metric R-Score, which decouples contextual ambiguity from strict redaction precision. We compare 35 models using RedactionBench across model families and report their performance for PII redaction. Finally, we release RedactionBench publicly to establish a baseline for future privacy-preserving redaction systems. We hope this benchmark inspires a shift towards efficient model design and standardized evaluations for text redaction.

Sample Documents

Evaluations & Tables

Pareto plot of RedactionBench mean R-Score against model size.
Model evaluation Pareto plot: mean R-Score against model size.

Overall Model Leaderboard

RankModelFamilyR-Score mean
1claude-opus-4-6Frontier LLMs0.714
2gpt-5.4Frontier LLMs0.659
3Qwen/Qwen3.5-397B-A17BFrontier LLMs0.592
4openai/privacy-filterOpenAI Privacy Filter0.578
5zai-org/GLM-5.1Frontier LLMs0.562
6gretel_gliner_bi_large_v1_0GLiNER0.472
7OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1DeBERTa-v30.459
8B2NER-InternLM2.5B2NER0.447
9nvidia_gliner_piiGLiNER0.421
10B2NER-InternLM2.5-7BB2NER0.402
11jakobhuss/pii-extractor-gemma-3-270m-itSLMs / Extractors0.401
12hydroxai_pii_maskerDeBERTa-v30.382
13eternisai/Anonymizer-4BSLMs / Extractors0.362
14E3-JSI/gliner-multi-pii-domains-v1GLiNER0.350
15iiiorg/piiranha-v1-detect-personal-informationRoBERTa / Other0.343
16distil-labs/Distil-PII-Llama-3.2-3B-InstructSLMs / Extractors0.337
17urchade/gliner_multi_pii-v1GLiNER0.329
18numind/NuExtract-2.0-2BSLMs / Extractors0.317
19Universal-NER/UniNER-7B-allSLMs / Extractors0.316
20numind/NuExtract-1.5-tinySLMs / Extractors0.308
21knowledgator/gliner-pii-base-v1.0GLiNER0.304
22numind/NuExtract-2.0-4BSLMs / Extractors0.293
23ai4privacy/llama-english-anonymiser-openpiiModernBERT (BIO)0.270
24h2oai/deberta_finetuned_pii †DeBERTa-v30.250
25lakshyakh93/deberta_finetuned_pii †DeBERTa-v30.250
26hivetrace/gliner-guard-uniencoderGLiNER0.235
27hivetrace/gliner-guard-biencoderGLiNER0.221
28Isotonic/distilbert_finetuned_ai4privacy_v2DistilBERT (BIO)0.216
29ai4privacy/llama-multilingual-categorical-anonymiser-openpiiModernBERT (BIO)0.213
30urchade/gliner_multi-v2.1GLiNER0.209
31tanaos/tanaos-text-anonymizer-v1RoBERTa / Other0.202
32deepaksiloka/PII-Detection-V2.1DistilBERT (BIO)0.196
33distil-labs/Distil-PII-Llama-3.2-1B-InstructSLMs / Extractors0.175
34OpenPipe/PII-Redact-GeneralSLMs / Extractors0.113
35distil-labs/Distil-PII-gemma-3-270m-itSLMs / Extractors0.095

† h2oai/deberta_finetuned_pii and lakshyakh93/deberta_finetuned_pii are mirrored uploads of the same checkpoint and produce identical scores.

Interactive R-Score

Bib

@misc{brynjolfsson2026redactionbench,
  title         = {RedactionBench},
  author        = {Brynjólfsson, Sean and Jayakrishnan, Shashvat and Sali, Esha and Purwar, Diptanshu and Aggarwal, Madhav},
  year          = {2026},
  eprint        = {2606.18782},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  doi           = {10.48550/arXiv.2606.18782},
  url           = {https://arxiv.org/abs/2606.18782}
}