TL;DR: Two complementary, pivotal papers from early 2026 challenge how we think about AI context. The arXiv study proves prose context files hurt performance and inflate cost. The FAF paper provides the structured fix. Together they signal: ditch the junk drawer, embrace standards.

In the fast-evolving world of AI-assisted coding, context is king—or so we thought. The arXiv paper “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” (arXiv:2602.11988) delivers a wake-up call, revealing how popular context files like AGENTS.md often hinder rather than help. My Zenodo-archived work, “Format-Driven AI Context Architecture: The .faf Standard for Persistent Project Understanding” (DOI: 10.5281/zenodo.18251362), builds on this by proposing FAF—a user-owned, IANA-registered standard that eliminates bloat while ensuring persistent, portable understanding.

These papers aren't rivals — they're complementary. The arXiv study spotlights the problems, FAF provides the fix. Facts first, logic follows.

The Wake-Up Call: The arXiv Paper

Published February 12, 2026, this paper by researchers at ETH Zurich's Secure, Reliable, and Intelligent Systems Lab rigorously tests a widespread practice: using repository-level context files (e.g., AGENTS.md, CLAUDE.md) to guide coding agents.

Key Methodology

The authors introduce AGENTbench, a novel benchmark curating real-world issues from repositories with developer-committed context files. They complement this with SWE-bench Lite (established tasks from popular repos). Evaluations span multiple coding agents and LLMs for generating files.

  • Experiments: Compare agent performance with/without context files. LLM-generated files use varied prompts/models; developer files are real-world samples.
  • Metrics: Task success rates, inference costs (tokens), behavioral traces (exploration, testing depth).
  • Scale: Tested on 100+ repos, multiple LLMs—robust and reproducible.
“Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”

Core Findings

-3%
LLM-Generated
Performance drop from machine-written context files
+4%
Human-Written
Marginal gain from developer-authored files
+20%
Cost Spike
Inference cost increase from broader exploration
Bloat
Root Cause
Unnecessary requirements make tasks harder

Limitations: Focuses on coding agents; doesn't test structured formats beyond MD. Strengths: Real-world data, agent trace analysis—a foundational critique that flips assumptions.

It validates what developers already feel — Theo Browne (t3.gg) dissected this paper publicly, spotlighting how AI crushes structured data like package.json but struggles with prose bloat. This sets the stage for better solutions.

Why Prose Is the Problem

In addition to bloat, MDs are subjective, unvalidated, and potentially misleading. Agents follow them literally — even when they're wrong. The arXiv data proves it: prose context is a disguised liability. FAF replaces prose with structure. That's not a style choice — it's a common-sense safety decision.

The Solution Blueprint: The FAF Paper

My Zenodo paper (published January 15, 2026, on CERN's open platform) addresses the sovereignty gap in AI context. It proposes FAF as a portable, user-controlled format—think “package.json for your project's context.”

Key Methodology

Drawing from 27,000+ ecosystem downloads and 1,051+ tests, the paper validates FAF across platforms (Claude, OpenAI Codex, Google Gemini). It includes schema specs, performance benchmarks (e.g., 220x faster binary loading), and GDPR alignment analysis.

  • Structure: YAML-based with sections for mission, tech_stack, key_files, context (architecture, conventions).
  • Tools: faf-cli (41 commands), MCP server integration (Anthropic-approved merge #2759), Rust SDK.
  • Validation: Cross-platform scores (9.0-9.5/10), IANA registration (application/vnd.faf+yaml, Oct 2025).
  • Binary Companion: .fafb for efficient loading—32-byte header, priority sections for token budgets.
“We present FAF... an IANA-registered format that enables user-controlled, portable AI context. Like Solid pods for personal data, FAF files give users sovereignty over their AI context.”

Core Contributions

🔒

User Ownership

Local files invert vendor control—transparent, portable, no cloud dependency.

Performance

91% token reclaim, minimal <2KB size. Scoring system ensures AI-readiness.

🛡

Privacy

Aligns with GDPR (access, portability, erasure); enables offline workflows.

🌍

Ecosystem

8 SDKs, Anthropic MCP merge (PR #2759), enterprise features.

FAF isn't just theory; it's live infrastructure enhancing AI tools today.

The Structure

Human-readable YAML as the single source of truth, branching to AI-specific outputs:

.faf Project DNA
faf_version "2.5.0"
project {name, mission}
tech_stack {languages, frameworks}
key_files [{path, purpose}]
context {architecture, conventions}
outputs {claude_md, agents_md, cursorrules, gemini_md}
CLAUDE.md
AGENTS.md
.cursorrules
GEMINI.md

One source. Four native formats. Zero drift.

Head-to-Head: How FAF Elevates the arXiv Findings

The arXiv paper exposes the pitfalls of unstructured MD files. FAF directly addresses them. Here's how they complete each other:

Traditional MD Flow

Repo
Bloated MD Prose, subjective, unvalidated
Agent Explores Broadly +20% cost, over-testing
-3% Performance Task harder, cost higher
vs

FAF Standard Flow

Repo
.faf (Project DNA) Structured, scored, validated
Lean Outputs Agent focuses on essentials
Persistent Context Minimal, portable, fast
AspectarXiv CritiqueFAF Solution
BloatMD files add noise, confusing agents with generic docsStructured YAML fields enforce minimal, essential context—compiler trims fluff, scoring flags low-quality
Success Rates-3% to +4% marginal; over-exploration hurtsEmpirical +6.7x response speed; priority loading fits token windows, focusing agents on essentials
Costs+20% inference from broader tracesBinary .fafb parses 220x faster; minimal size (<1K tokens) cuts overhead
Human vs LLMHumans slight edge, but still riskyAuto-gen lean outputs from .faf project DNA; human-verified timestamps ensure accountability
PortabilityVendor-siloed (e.g., .cursorrules only for Cursor)IANA-standard, cross-AI (Claude/Grok/Gemini); user-owned like Solid pods—no lock-in
ExplorationEncourages over-testing, lowering successContextual primitives guide without mandating—minimal requirements by design

arXiv urges “human-written context files should describe only minimal requirements.” FAF delivers: One .faf (Project DNA for any AI), multiple expressions (CLAUDE.md, AGENTS.md, .cursorrules, .windsurf)—playbook, not rulebook.

Note: ArXiv metrics measure coding agent task resolution rates. FAF metrics measure AI-Readiness — persistent project context scoring as defined in Anthropic's MCP ecosystem (merge #2759). Complementary benchmarks, not direct comparisons.

The Future of AI Context

The arXiv paper is a foundational critique — credit to the authors for AGENTbench and their data-driven debunking. My FAF paper is the next chapter, enhancing those insights into infrastructure. Together, they signal: ditch the junk drawer, embrace standards.

Try FAF

npm i faf-cli faf init faf bi-sync --all

Cite these papers. Build on them. Make every repo AI-ready.

Questions? DM @wolfe_jam.

References

  1. Gloaguen, T., Mündler, N., Müller, M., Raychev, V., & Vechev, M. (2026). Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988.
  2. Wolfe Harrison, J. (2026). Format-Driven AI Context Architecture: The .faf Standard for Persistent Project Understanding. DOI: 10.5281/zenodo.18251362.

Further Reading