TL;DR: ETH Zurich's SRI Lab just published their second paper in two weeks. The first proved prose AI context hurts performance. The second proves semantic drift and context loss corrupt multilingual benchmarks. Same lab. Same disease. Same cure: structure over prose.
Previously: The Bloat Problem
In Part I, we covered the first paper from Professor Martin Vechev's Secure, Reliable, and Intelligent Systems Lab at ETH Zurich: “Evaluating AGENTS.md” (arXiv:2602.11988). The findings were damning — prose context files reduce performance by 3%, inflate costs by 20%, and encourage agents to over-explore.
That paper answered the question: does giving AI more prose context help?
The answer was no.
The Second Paper: Recovered in Translation
Two weeks later, the same lab publishes “Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets” (arXiv:2602.22207). Authors: Hanna Yukhymenko, Anton Alexandrov, and Martin Vechev.
Different patient. Same disease.
“The reliability of multilingual Large Language Model evaluation is currently compromised by the inconsistent quality of translated benchmarks.”
The paper identifies that when benchmarks are translated from one language to another, two things happen:
Semantic Drift
Meaning shifts during translation. The question asks something subtly different in the target language. The benchmark measures the wrong thing.
Context Loss
Nuance evaporates. Cultural references, technical precision, task structure — all degraded. The evaluation produces misleading metrics.
Their solution: a structured pipeline that preserves “original task structure and linguistic nuances during localization.” They validated across eight languages with measurably better results.
The Pattern: Drift Is Universal
Step back and look at what the same lab found in two weeks:
| Paper 1: AGENTS.md | Paper 2: Benchmarks | |
|---|---|---|
| Domain | AI coding context | Multilingual evaluation |
| Input | Prose MD files | English benchmarks |
| Disease | Bloat, noise, over-exploration | Semantic drift, context loss |
| Symptom | -3% performance, +20% cost | Misleading performance metrics |
| Root cause | Unstructured prose | Unstructured translation |
| Fix | Structure over prose | Structure over prose |
The fix is the same both times: replace loose prose with structured pipelines that preserve meaning.
This isn't a coincidence. It's a law.
Semantic Drift Doesn't Care What It Corrupts
Translate a benchmark — meaning drifts. Write a CLAUDE.md by hand — meaning drifts. Copy context between sessions — meaning drifts. Tell an LLM to generate your project context — meaning drifts and performance drops. Every time humans or LLMs convert structured knowledge into prose, information degrades. The vector is the same: lossy conversion from structure to text.
Three Papers. One Roadmap.
The picture is now complete. Two papers diagnose. One prescribes.
Evaluating AGENTS.md
“Prose context files reduce task success rates while increasing cost by 20%.”
arXiv:2602.11988Recovered in Translation
“Semantic drift and context loss produce misleading performance metrics.”
arXiv:2602.22207Format-Driven AI Context Architecture
“IANA-registered structured format that enables user-controlled, portable AI context. 91% token reclaim. Zero drift.”
DOI: 10.5281/zenodo.18251362How FAF Eliminates Drift
The “Recovered in Translation” paper solves benchmark drift with a structured pipeline. FAF solves AI context drift with a structured format. The engineering is parallel:
Schema, Not Prose
YAML fields are unambiguous. tech_stack.languages: [TypeScript] can't drift into “we mostly use TS but sometimes JS.”
One Source, Many Outputs
Define once in .faf, generate CLAUDE.md, AGENTS.md, .cursorrules, GEMINI.md. No manual translation between formats. No drift between copies.
Scored, Not Hoped
AI-Readiness scoring catches degradation. If context quality drops, you know immediately. Benchmarks should be this honest.
User-Owned, Not Vendor-Locked
Local files. IANA-registered standard. No cloud dependency. The user controls their context the way they control their source code.
The Translation Parallel
Think about what faf bi-sync --all actually does:
Manual “Translation”
FAF Pipeline
The parallel to benchmark translation is exact. The ETH Zurich team's solution — a structured pipeline that preserves task structure and nuance — is the same architecture FAF uses for AI context. One source. Deterministic outputs. No lossy prose conversion.
Two Labs, One Direction
The evidence base for structured AI context just doubled. In two weeks, the same lab at ETH Zurich has independently confirmed, in two different domains, that:
- Prose degrades information
- Semantic drift is measurable and harmful
- Context loss produces misleading results
- Structured pipelines fix all three
FAF has been shipping this fix since October 2025. IANA registered. Anthropic approved (MCP merge #2759). 33,500+ ecosystem downloads across npm, PyPI, and crates.io. The architecture isn't theoretical — it's infrastructure.
Try FAF
npm i -g faf-cli faf init faf bi-sync --allOne source. Every AI format. Zero drift.
Read Part I: Beyond the Bloat for the full AGENTS.md analysis.
Questions? DM @wolfe_jam.
References
- Yukhymenko, H., Alexandrov, A., & Vechev, M. (2026). Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets. arXiv:2602.22207.
- Gloaguen, T., Mündler, N., Müller, M., Raychev, V., & Vechev, M. (2026). Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988.
- Wolfe Harrison, J. (2026). Format-Driven AI Context Architecture: The .faf Standard for Persistent Project Understanding. DOI: 10.5281/zenodo.18251362.
Further Reading
- Beyond the Bloat (Part I) — Full analysis of the AGENTS.md paper
- Official IANA Media Type Registration for FAF
- FAF on GitHub