TL;DR: ETH Zurich's SRI Lab just published their second paper in two weeks. The first proved prose AI context hurts performance. The second proves semantic drift and context loss corrupt multilingual benchmarks. Same lab. Same disease. Same cure: structure over prose.

Previously: The Bloat Problem

In Part I, we covered the first paper from Professor Martin Vechev's Secure, Reliable, and Intelligent Systems Lab at ETH Zurich: “Evaluating AGENTS.md” (arXiv:2602.11988). The findings were damning — prose context files reduce performance by 3%, inflate costs by 20%, and encourage agents to over-explore.

That paper answered the question: does giving AI more prose context help?

The answer was no.

The Second Paper: Recovered in Translation

Two weeks later, the same lab publishes “Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets” (arXiv:2602.22207). Authors: Hanna Yukhymenko, Anton Alexandrov, and Martin Vechev.

Different patient. Same disease.

“The reliability of multilingual Large Language Model evaluation is currently compromised by the inconsistent quality of translated benchmarks.”

The paper identifies that when benchmarks are translated from one language to another, two things happen:

💨

Semantic Drift

Meaning shifts during translation. The question asks something subtly different in the target language. The benchmark measures the wrong thing.

🕳

Context Loss

Nuance evaporates. Cultural references, technical precision, task structure — all degraded. The evaluation produces misleading metrics.

Their solution: a structured pipeline that preserves “original task structure and linguistic nuances during localization.” They validated across eight languages with measurably better results.

The Pattern: Drift Is Universal

Step back and look at what the same lab found in two weeks:

Paper 1: AGENTS.mdPaper 2: Benchmarks
DomainAI coding contextMultilingual evaluation
InputProse MD filesEnglish benchmarks
DiseaseBloat, noise, over-explorationSemantic drift, context loss
Symptom-3% performance, +20% costMisleading performance metrics
Root causeUnstructured proseUnstructured translation
FixStructure over proseStructure over prose

The fix is the same both times: replace loose prose with structured pipelines that preserve meaning.

This isn't a coincidence. It's a law.

Semantic Drift Doesn't Care What It Corrupts

Translate a benchmark — meaning drifts. Write a CLAUDE.md by hand — meaning drifts. Copy context between sessions — meaning drifts. Tell an LLM to generate your project context — meaning drifts and performance drops. Every time humans or LLMs convert structured knowledge into prose, information degrades. The vector is the same: lossy conversion from structure to text.

Three Papers. One Roadmap.

The picture is now complete. Two papers diagnose. One prescribes.

Diagnosis 1

Evaluating AGENTS.md

Gloaguen, Mündler, Müller, Raychev & Vechev

“Prose context files reduce task success rates while increasing cost by 20%.”

arXiv:2602.11988
Diagnosis 2

Recovered in Translation

Yukhymenko, Alexandrov & Vechev

“Semantic drift and context loss produce misleading performance metrics.”

arXiv:2602.22207
Prescription

Format-Driven AI Context Architecture

Wolfe Harrison, J.

“IANA-registered structured format that enables user-controlled, portable AI context. 91% token reclaim. Zero drift.”

DOI: 10.5281/zenodo.18251362

How FAF Eliminates Drift

The “Recovered in Translation” paper solves benchmark drift with a structured pipeline. FAF solves AI context drift with a structured format. The engineering is parallel:

📋

Schema, Not Prose

YAML fields are unambiguous. tech_stack.languages: [TypeScript] can't drift into “we mostly use TS but sometimes JS.”

🔄

One Source, Many Outputs

Define once in .faf, generate CLAUDE.md, AGENTS.md, .cursorrules, GEMINI.md. No manual translation between formats. No drift between copies.

Scored, Not Hoped

AI-Readiness scoring catches degradation. If context quality drops, you know immediately. Benchmarks should be this honest.

🔒

User-Owned, Not Vendor-Locked

Local files. IANA-registered standard. No cloud dependency. The user controls their context the way they control their source code.

The Translation Parallel

Think about what faf bi-sync --all actually does:

Manual “Translation”

Project knowledge
Human writes CLAUDE.md Subjective, incomplete
Copy-paste to AGENTS.md Drift begins
Semantic Drift Each file tells a different story
vs

FAF Pipeline

Project knowledge
.faf (structured YAML) Single source of truth
faf bi-sync --all Deterministic generation
Zero Drift Every format tells the same story

The parallel to benchmark translation is exact. The ETH Zurich team's solution — a structured pipeline that preserves task structure and nuance — is the same architecture FAF uses for AI context. One source. Deterministic outputs. No lossy prose conversion.

Two Labs, One Direction

The evidence base for structured AI context just doubled. In two weeks, the same lab at ETH Zurich has independently confirmed, in two different domains, that:

  • Prose degrades information
  • Semantic drift is measurable and harmful
  • Context loss produces misleading results
  • Structured pipelines fix all three

FAF has been shipping this fix since October 2025. IANA registered. Anthropic approved (MCP merge #2759). 33,500+ ecosystem downloads across npm, PyPI, and crates.io. The architecture isn't theoretical — it's infrastructure.

Note on benchmarks: The ETH Zurich papers measure coding agent task resolution (Paper 1) and multilingual benchmark reliability (Paper 2). FAF measures AI-Readiness — persistent project context scoring as defined in Anthropic's MCP ecosystem. Three complementary measurement systems, one shared enemy: drift.

Try FAF

npm i -g faf-cli faf init faf bi-sync --all

One source. Every AI format. Zero drift.

Read Part I: Beyond the Bloat for the full AGENTS.md analysis.

Questions? DM @wolfe_jam.

References

  1. Yukhymenko, H., Alexandrov, A., & Vechev, M. (2026). Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets. arXiv:2602.22207.
  2. Gloaguen, T., Mündler, N., Müller, M., Raychev, V., & Vechev, M. (2026). Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988.
  3. Wolfe Harrison, J. (2026). Format-Driven AI Context Architecture: The .faf Standard for Persistent Project Understanding. DOI: 10.5281/zenodo.18251362.

Further Reading