Recognition: unknown
Knows: Agent-Native Structured Research Representations
Pith reviewed 2026-05-10 06:46 UTC · model grok-4.3
The pith
A lightweight YAML sidecar lets small LLM agents extract accurate research details from papers with far less computation than reading PDFs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Knows supplies a thin YAML sidecar (KnowsRecord) that coexists with any PDF and binds machine-readable claims, evidence, provenance and verifiable relations to the paper, letting LLM agents consume task-relevant content directly without the cost and instability of full-document inference.
What carries the argument
The KnowsRecord, a YAML sidecar validated by a deterministic schema linter, which structures the paper's content for direct agent consumption while leaving the original PDF unchanged.
If this is right
- Weak models can handle research comprehension tasks at accuracy levels previously requiring much larger models.
- Agent workflows that process many papers become cheaper and more stable because token counts fall sharply.
- Research can be distributed in dual form: human-readable PDF plus machine-readable sidecar without changing publication practices.
- Hybrid agent pipelines can fall back to the PDF only when the sidecar lacks needed detail.
Where Pith is reading between the lines
- If sidecar creation tools improve, the format could become the default machine interface for new papers.
- The same sidecar approach might apply to other long documents such as patents or technical reports.
- Automated generation of sidecars from existing PDFs could create a large public dataset for training better research agents.
Load-bearing premise
That accurate structured sidecar content can be created at scale for arbitrary papers without errors or omissions that would mislead agents on real tasks.
What would settle it
Generate sidecars for a fresh set of papers, then measure whether agent accuracy on questions requiring details not explicitly listed in the sidecar drops below the PDF baseline.
Figures
read the original abstract
Research artifacts are distributed primarily as reader-oriented documents like PDFs. This creates a bottleneck for increasingly agent-assisted and agent-native research workflows, in which LLM agents need to infer fine-grained, task-relevant information from lengthy full documents, a process that is expensive, repetitive, and unstable at scale. We introduce Knows, a lightweight companion specification that binds structured claims, evidence, provenance, and verifiable relations to existing research artifacts in a form LLM agents can consume directly. Knows addresses the gap with a thin YAML sidecar (KnowsRecord) that coexists with the original PDF, requiring no changes to the publication itself, and validated by a deterministic schema linter. We evaluate Knows on 140 comprehension questions across 20 papers spanning 14 academic disciplines, comparing PDF-only, sidecar-only, and hybrid conditions across six LLM agents of varying capacity. Weak models (0.8B--2B parameters) improve from 19--25\% to 47--67\% accuracy (+29 to +42 percentage points) when reading sidecar instead of PDF, while consuming 29--86\% fewer input tokens; an LLM-as-judge re-scoring confirms that weak-model sidecar accuracy (75--77\%) approaches stronger-model PDF accuracy (78--83\%). Beyond this controlled evaluation, a community sidecar hub at https://knows.academy/ has already indexed over ten thousand publications and continues to grow daily, providing independent evidence that the format is adoption-ready at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Knows, a lightweight YAML sidecar (KnowsRecord) specification that attaches structured claims, evidence, provenance, and verifiable relations to existing research PDFs without modifying the original artifact. It evaluates the format on 140 comprehension questions drawn from 20 papers across 14 disciplines, comparing PDF-only, sidecar-only, and hybrid inputs across six LLM agents of varying sizes. The central empirical result is that weak models (0.8B–2B parameters) improve from 19–25% to 47–67% accuracy (+29 to +42 pp) when using sidecars instead of PDFs while consuming 29–86% fewer tokens; an LLM-as-judge re-scoring shows weak-model sidecar performance approaching stronger-model PDF performance. The paper also reports a community hub at knows.academy that has indexed over 10,000 publications.
Significance. If sidecar creation can be shown to be accurate and scalable without systematic omissions or factual drift, the approach would meaningfully lower the cost and instability of agent-native research workflows. The reported token reductions and accuracy lifts for small models are practically relevant, and the observed community adoption supplies independent evidence of format viability. However, the evaluation's dependence on un-audited sidecar fidelity limits the strength of the claims until that assumption is tested.
major comments (2)
- [Evaluation] Evaluation section: the paper provides no description of how the KnowsRecords for the 20-paper, 140-question test set were generated (manual, LLM-assisted, or hybrid), nor any inter-annotator agreement, omission-rate, or factual-drift measurements against the source PDFs. Because the headline accuracy gains (+29 to +42 pp) and token savings are measured only under the assumption that these sidecars faithfully encode all task-relevant content, the absence of such validation is load-bearing for the central empirical claim.
- [Community Hub / Discussion] Community hub paragraph: the report of >10k indexed publications is presented as evidence of adoption-readiness, yet no sampling, accuracy audit, or downstream-task verification of the community-contributed KnowsRecords is supplied. Without such data it is impossible to assess whether the format maintains fidelity when applied at scale to arbitrary papers.
minor comments (2)
- [Abstract] The abstract states that an LLM-as-judge re-scoring was performed but does not specify the judge model, prompt template, or agreement metric with human judgments; adding these details would improve reproducibility.
- [Methods] The YAML schema is said to be validated by a deterministic linter, but the paper does not indicate whether the linter was run on the evaluation sidecars or only on the schema definition itself.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive report. The comments highlight important aspects of reproducibility and scalability that strengthen the manuscript. We address each major comment below, indicating where revisions will be made to incorporate additional methodological details and caveats.
read point-by-point responses
-
Referee: Evaluation section: the paper provides no description of how the KnowsRecords for the 20-paper, 140-question test set were generated (manual, LLM-assisted, or hybrid), nor any inter-annotator agreement, omission-rate, or factual-drift measurements against the source PDFs. Because the headline accuracy gains (+29 to +42 pp) and token savings are measured only under the assumption that these sidecars faithfully encode all task-relevant content, the absence of such validation is load-bearing for the central empirical claim.
Authors: We agree that explicit documentation of sidecar creation is necessary to support the empirical claims. The 20 KnowsRecords were generated manually by the authors through direct extraction of claims, evidence, and relations from the source PDFs using the Knows specification; no LLM assistance was used for the test set. We will add a dedicated subsection to the Evaluation section describing the extraction protocol, including how completeness was targeted and cross-checked within the team. We did not compute formal inter-annotator agreement metrics because creation was performed by a small expert team with internal review rather than independent annotators. We will explicitly note this limitation and its implications for the results. Factual drift was minimized by restricting content to verbatim or near-verbatim extractions without summarization or external inference; we will add a short statement to this effect. revision: yes
-
Referee: Community hub paragraph: the report of >10k indexed publications is presented as evidence of adoption-readiness, yet no sampling, accuracy audit, or downstream-task verification of the community-contributed KnowsRecords is supplied. Without such data it is impossible to assess whether the format maintains fidelity when applied at scale to arbitrary papers.
Authors: We accept that the community adoption figure alone does not constitute a fidelity audit. The >10,000 records are contributed by users through the public hub and validated only by the deterministic schema linter; no systematic sampling or downstream-task verification has been performed by the authors. We will revise the relevant paragraph in the Discussion to present the hub statistics strictly as evidence of format interest and uptake rather than proven scalability of content quality. We will also add an explicit statement that large-scale fidelity audits remain future work and invite community contributions toward that goal. revision: partial
Circularity Check
No significant circularity; claims rest on independent empirical evaluation
full rationale
The paper presents an empirical study measuring LLM accuracy and token usage across PDF-only, sidecar-only, and hybrid conditions on 140 questions from 20 papers. No mathematical derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The deterministic schema linter validates format only and does not enter the performance measurements. Community adoption at knows.academy is cited as external evidence of format uptake, not as justification for the accuracy deltas. The evaluation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can extract task-relevant information more reliably from structured YAML than from raw PDF text
invented entities (1)
-
KnowsRecord
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Agents4Science: The first open conference of AI agents, co-led by Stanford and BroadAI,
Stanford University and BroadAI, “Agents4Science: The first open conference of AI agents, co-led by Stanford and BroadAI,” https://agents4science.stanford.edu/, 2025, conference where AI agents are the primary authors, reviewers, and presenters of research contributions
2025
-
[2]
The anatomy of a nanopublication,
P. Groth, A. Gibson, and J. Velterop, “The anatomy of a nanopublication,”Information Services and Use, vol. 30, no. 1-2, pp. 51–56, 2010
2010
-
[3]
Open research knowledge graph: Next generation infras- tructure for semantic scholarly knowledge,
M. Y . Jaradeh, A. Oelen, K. E. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker, and S. Auer, “Open research knowledge graph: Next generation infras- tructure for semantic scholarly knowledge,” inProceed- ings of the 10th International Conference on Knowledge Capture (K-CAP). ACM, 2019, pp. 243–246
2019
-
[4]
Jiacheng Miao, Joe R Davis, Yaohui Zhang, Jonathan K Pritchard, and James Zou
J. Miao, J. R. Davis, Y . Zhang, J. K. Pritchard, and J. Zou, “Paper2agent: Reimagining research papers as interactive and reliable ai agents,”arXiv preprint arXiv:2509.06917, 2025. [Online]. Available: https: //arxiv.org/abs/2509.06917
-
[5]
Agentic publications: redesigning scientific publishing in the age of thinking large language models
R. Pugliese, G. Kourousias, F. Venier, and G. Garlatti Costa, “Agentic publications: An LLM-driven framework for interactive scientific publishing, supple- menting traditional papers with AI-powered knowledge systems,”arXiv preprint arXiv:2505.13246, 2025. [Online]. Available: https://arxiv.org/abs/2505.13246
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Agentrxiv: Towards collaborative au- tonomous research,
S. Schmidgall and M. Moor, “Agentrxiv: Towards collaborative autonomous research,”arXiv preprint arXiv:2503.18102, 2025. [Online]. Available: https: //arxiv.org/abs/2503.18102
-
[7]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The AI scientist: Towards fully auto- mated open-ended scientific discovery,”arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
NovelSeek Team, “NovelSeek: When agent becomes the scientist — building closed-loop system from hypothesis to verification,”arXiv preprint arXiv:2505.16938, 2025
-
[9]
J. Yuan, X. Yan, B. Shiet al., “Dolphin: Closed-loop open-ended auto-research through thinking, practice, and feedback,”arXiv preprint arXiv:2501.03916, 2025
-
[10]
CodeScientist: End-to-end semi- automated scientific discovery with code-based ex- perimentation,
Allen Institute for AI, “CodeScientist: End-to-end semi- automated scientific discovery with code-based ex- perimentation,” https://github.com/allenai/codescientist, 2024, gitHub repository
2024
-
[11]
Data-to-paper: AI-driven research and documentation,
Kishony Lab, “Data-to-paper: AI-driven research and documentation,” https://github.com/Technion-Kishony- lab/data-to-paper, 2024, gitHub repository
2024
-
[12]
EvoScientist: Evolutionary scien- tific discovery platform,
EvoScientist Team, “EvoScientist: Evolutionary scien- tific discovery platform,” https://github.com/EvoScientist/ EvoScientist, 2025, gitHub repository
2025
-
[13]
Baby-AIGS Team, “Toward automated scientific discov- ery: A survey on artificial intelligence generated science,” arXiv preprint arXiv:2411.11910, 2024
-
[14]
ResearchClawBench: A bench- mark for autonomous research agents,
InternScience Team, “ResearchClawBench: A bench- mark for autonomous research agents,” https://github. com/InternScience/ResearchClawBench, 2025, bench- mark for autonomous research agents
2025
-
[15]
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Y . Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu, “SoK: Agentic skills – beyond tool use in LLM agents,” 2026. [Online]. Available: https://arxiv.org/abs/2602.20867
work page internal anchor Pith review arXiv 2026
-
[16]
S. Chen, Q. Wang, G. Yu, X. Wang, and L. Zhu, “Clawed and dangerous: Can we trust open agentic systems?” 2026. [Online]. Available: https://arxiv.org/ abs/2603.26221 APPENDIXA SCHEMAREFERENCESUMMARY The complete JSON Schema v0.9 is released along- side the specification at https://knows.academy/; this ap- pendix summarizes the root-level structure and pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.