pith. sign in

arxiv: 2605.17281 · v1 · pith:IYJKHMX7new · submitted 2026-05-17 · 💻 cs.SE · cs.AI

ContractBench: Can LLM Agents Preserve Observation Contracts?

Pith reviewed 2026-05-19 22:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM agentsobservation contractstool useAPI compliancebenchmarkvalidity failuresintegrity failuresscaling behavior
0
0 comments X p. Extension
pith:IYJKHMX7 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{IYJKHMX7}

Prints a linked pith:IYJKHMX7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

LLM agents must preserve observation contracts like tokens and presigned URLs, yet current models routinely fail at this separate capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ContractBench to test whether tool-augmented LLM agents keep observation contracts intact, meaning they respect time limits and exact byte content of API outputs such as session tokens or OAuth parameters. It evaluates 38 models across 33 dual-axis tasks that separately check for validity failures (using expired artifacts) and integrity failures (corrupting bytes during processing). No model reaches 80 percent success, with the highest score at 77.8 percent, and the results show that general tool-use skill does not ensure contract compliance. Certain scaling patterns and post-training steps can even reduce performance, while labeling failures from real API rules can serve as an in-context signal to improve results.

Core claim

Observation contract compliance is an emergent, regression-prone capability that is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. ContractBench measures this through 33 tasks that probe validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes), with deterministic evaluation via a virtual clock and SHA-256 hashes. Across 38 models, none exceed 80 percent accuracy and some families show sharp cliffs or non-monotonic drops tied to agentic post-training.

What carries the argument

ContractBench, a benchmark of 33 dual-axis tasks that separately test temporal validity and byte-level integrity of observation contracts using a virtual clock and hash verification.

If this is right

  • General tool-use training is insufficient and targeted restraint training is needed to avoid using artifacts after expiry or altering their content.
  • Within-model-family scaling can produce sudden jumps in compliance once a size threshold allows mid-trajectory checking rather than immediate action.
  • Agentic post-training that emphasizes helpfulness can introduce sycophancy-driven regressions that lower contract adherence.
  • Feeding the failure taxonomy back as an in-context reward signal raises performance on previously failed cases by several points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real deployments may need explicit contract-tracking layers outside the model to catch validity and integrity errors the LLM misses.
  • The observed within-family cliffs suggest that future scaling studies should measure contract compliance as a distinct axis alongside tool competence.
  • Extending the benchmark to multi-step agent trajectories with chained contracts could reveal whether current failures compound over longer sessions.

Load-bearing premise

The 33 dual-axis tasks and their failure labels drawn from real-world API specifications sufficiently capture the observation-contract compliance problem that arises in deployed tool-augmented agents.

What would settle it

A production agent that scores above 80 percent on ContractBench yet frequently uses expired presigned URLs or corrupted tokens in live API calls would show the benchmark does not track the real capability.

Figures

Figures reproduced from arXiv: 2605.17281 by Arkaprava De, Hanwen Xing, Hao Chen, Jicheng Wang, Yifeng He, Zili Wang.

Figure 1
Figure 1. Figure 1: CONTRACTBENCH is the first deterministic benchmark for LLM-agent observation-contract compliance, spanning 33 tasks across two orthogonal axes (temporal validity and byte-level integrity) and eight real-world API contract domains. 2 Observation Contracts Production agents repeatedly hit the same class of failure: a tool returns an artifact (an intermediate value such as a presigned URL, a signed token, or … view at source ↗
Figure 2
Figure 2. Figure 2: Two orthogonal predicates define an ob￾servation contract. Integrity (SHA-256(payload) = expected) and validity (tfetch ≤tissue+τ ) must both hold (Definition 2.2); the four quadrants instantiate Propo￾sition 1. Quadrant area encodes CONTRACTBENCH’s task distribution. Two-level taxonomy. We organize tasks using a two-level taxonomy, as depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Within-family scaling on CONTRACTBENCH: at fixed parameter count, the Base→Instruct gap (post-training delta) is family-specific, with Qwen 3.5 reaching 56.6 % already at 9 B-Instruct. 4.3 Post-Training Regression: A V-Shape in GPT-5 In Finding 3 we observe that contract compliance emerges from post-training. Holding the pretrained base fixed, we further investigate the converse question: how does scaling … view at source ↗
Figure 4
Figure 4. Figure 4: A V-shape performance within the GPT-5 family with shared pretrained base: a structured, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: reproduces the four files that make up a single CONTRACTBENCH task, using harbor/tasks/presigned-url-download/ as a representative example. The prose summary is in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ). The variance is not measurement noise: in every variable cell, the variance comes from genuine stochasticity in the agent’s strategy under tight TTL or rate-limit pressure, not from validator non-determinism (the virtual clock + SHA-256 trace hash hold all else fixed) [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-task pass-rate heatmap: frontier proprietary models × 33 tasks. The Hybrid block reveals the universally-hard tasks (e.g., multi-turn-recall is red across every row). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-task pass-rate heatmap: open-source SOTA × 33 tasks. Same color scale and family grouping as Appendix [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Capability ladder of failure-mode profiles, ordered by SR. The dominant failure type shifts [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Full per-model failure-label profile (stacked bars, all 15 complete- or near-complete [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
read the original abstract

Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA-256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real-world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude-Opus-4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within-family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B-A17B: what emerges across the cliff is mid-trajectory restraint, not tool-call competence; (iii) non-monotonic scaling across the GPT-5 family: agentic post-training can erode compliance through sycophancy-driven regression; (iv) our failure taxonomy works as an actionable in-context reward signal, yielding +7.1 pp on 42 paired GPT-5.1 failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ContractBench, a benchmark of 33 dual-axis tasks that evaluate LLM agents on preserving observation contracts (artifacts such as presigned URLs, tokens, and OAuth parameters whose later use is constrained by temporal validity and byte integrity). Using deterministic evaluation with a virtual clock and SHA-256 checks, the authors test 38 models and report that no model exceeds 80% compliance (Claude-Opus-4.6 leads at 77.8%), identify within-family scaling cliffs (e.g., Qwen 3.5) and non-monotonic behavior (GPT-5 family), and show that their failure taxonomy can serve as an in-context reward yielding +7.1 pp improvement.

Significance. If the 33 tasks validly isolate observation-contract compliance from general agent robustness issues, the work would usefully document a gap in current frontier models for realistic tool-augmented deployment. The deterministic, programmatic evaluation protocol (virtual clock plus hash verification) and the demonstration that the taxonomy functions as an actionable reward signal are concrete strengths that support reproducibility and potential follow-on work.

major comments (3)
  1. [§3] §3 (Benchmark Construction): The claim that the 33 tasks cleanly probe validity and integrity failures rests on the assertion that failure labels are drawn from real-world API specifications, yet the manuscript provides no coverage analysis, inter-annotator validation, or explicit check that the dual-axis design separates contract-specific errors from confounding factors such as multi-turn state tracking or general planning failures.
  2. [§4.2] §4.2 (Prompting and Contract Conveyance): The evaluation protocol is described as deterministic, but the paper does not report the exact prompting templates used to communicate contract constraints (expiry times, integrity requirements) to the agents; without this, it is unclear whether observed failures reflect inability to preserve contracts or simply failure to receive or parse the constraints in the first place.
  3. [§5.3] §5.3 (Non-monotonic Scaling Claim): The attribution of GPT-5 family regression to sycophancy-driven erosion of compliance is load-bearing for the 'regression-prone' characterization, yet the manuscript supplies only aggregate scores rather than paired before/after examples or ablation showing that post-training specifically increases contract violations while preserving other tool-use metrics.
minor comments (2)
  1. [Figures/Tables] Figure 2 and Table 4: axis labels and legend entries use inconsistent abbreviations for model families; expanding them would improve readability.
  2. [§2] §2 (Related Work): The discussion of prior agent benchmarks could more explicitly contrast ContractBench's focus on observation contracts with existing tool-use suites that emphasize planning or API selection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The claim that the 33 tasks cleanly probe validity and integrity failures rests on the assertion that failure labels are drawn from real-world API specifications, yet the manuscript provides no coverage analysis, inter-annotator validation, or explicit check that the dual-axis design separates contract-specific errors from confounding factors such as multi-turn state tracking or general planning failures.

    Authors: We acknowledge the value of additional documentation on task construction. The 33 tasks were derived by enumerating common observation-contract patterns from public API specifications (e.g., AWS S3 presigned URLs, OAuth 2.0 state parameters, and time-limited session tokens). In the revision we will add a dedicated subsection to §3 that (i) lists the specific API sources and the coverage they provide across validity and integrity axes, (ii) describes the programmatic generation process that enforces the dual-axis separation, and (iii) reports results on a set of control tasks without contract constraints to demonstrate that general planning or state-tracking failures do not dominate the measured outcomes. Because the failure labels are assigned by direct reference to the cited specifications rather than by human annotation, inter-annotator agreement statistics are not applicable; we will explicitly note this design choice. revision: yes

  2. Referee: [§4.2] §4.2 (Prompting and Contract Conveyance): The evaluation protocol is described as deterministic, but the paper does not report the exact prompting templates used to communicate contract constraints (expiry times, integrity requirements) to the agents; without this, it is unclear whether observed failures reflect inability to preserve contracts or simply failure to receive or parse the constraints in the first place.

    Authors: We agree that the precise wording used to convey constraints is necessary for full reproducibility. Although the downstream evaluation (virtual clock and SHA-256 verification) is deterministic, the prompt-level communication of expiry and integrity rules can influence agent behavior. In the revised manuscript we will include the complete prompting templates in an appendix, showing the exact phrasing for temporal constraints (“use this artifact before timestamp T”) and integrity constraints (“do not alter any byte of the provided value”). This addition will allow readers to confirm that the constraints were explicitly and consistently stated. revision: yes

  3. Referee: [§5.3] §5.3 (Non-monotonic Scaling Claim): The attribution of GPT-5 family regression to sycophancy-driven erosion of compliance is load-bearing for the 'regression-prone' characterization, yet the manuscript supplies only aggregate scores rather than paired before/after examples or ablation showing that post-training specifically increases contract violations while preserving other tool-use metrics.

    Authors: The non-monotonic pattern is directly visible in the per-variant aggregate scores we report. We interpret the regression as consistent with sycophancy effects documented in the post-training literature for agentic models. While we cannot release proprietary intermediate checkpoints for a controlled ablation, we will add representative failure traces from the GPT-5 family evaluations to §5.3, illustrating the specific contract violations that increase after the post-training stage. These qualitative examples, together with the quantitative drop, provide concrete support for the regression-prone characterization. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This paper is a pure empirical benchmark study that constructs 33 tasks from real-world API specifications and measures model compliance via deterministic programmatic checks (virtual clock for validity, SHA-256 for integrity). The reported scores and four findings are obtained by direct execution on 38 models; no equations, fitted parameters, self-referential predictions, or derivations are present that could reduce to the inputs by construction. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes, rendering the evaluation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark without introducing new free parameters, axioms, or postulated entities; it relies on standard practices in LLM evaluation and real-world API specifications.

pith-pipeline@v0.9.0 · 5864 in / 1086 out tokens · 30510 ms · 2026-05-19T22:56:39.462455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    2026 , eprint =

    Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation , author =. 2026 , eprint =

  2. [2]

    2026 , eprint =

    Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception , author =. 2026 , eprint =

  3. [3]

    and Van Horn, Earl C

    Dennis, Jack B. and Van Horn, Earl C. , title =. Commun. ACM , month = mar, pages =. 1966 , issue_date =. doi:10.1145/365230.365252 , abstract =

  4. [4]

    2024 , url =

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle =. 2024 , url =

  5. [5]

    AgentBench: Evaluating

    Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle =. AgentBench: Evalua...

  6. [6]

    Transactions on Machine Learning Research , issn =

    Inverse Scaling: When Bigger Isn't Better , author =. Transactions on Machine Learning Research , issn =. 2023 , url =

  7. [7]

    Transactions of the Association for Computational Linguistics , volume =

    Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , address =. doi:10.1162/tacl_a_00638 , pages =

  8. [8]

    Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , booktitle =. Tool. 2024 , url =

  9. [9]

    Thirty-seventh Conference on Neural Information Processing Systems , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

  10. [10]

    2025 , eprint =

    Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination , author =. 2025 , eprint =

  11. [11]

    Transactions on Machine Learning Research , issn =

    Emergent Abilities of Large Language Models , author =. Transactions on Machine Learning Research , issn =. 2022 , url =

  12. [12]

    2026 , eprint =

    Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Dialogues , author =. 2026 , eprint =

  13. [13]

    2024 , booktitle =

    Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , title =. 2024 , booktitle =

  14. [14]

    2023 , url =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

  15. [15]

    2024 , eprint =

    NATURAL PLAN: Benchmarking LLMs on Natural Language Planning , author =. 2024 , eprint =

  16. [16]

    2025 , eprint =

    Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning , author =. 2025 , eprint =

  17. [17]

    The Twelfth International Conference on Learning Representations , year =

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author =. The Twelfth International Conference on Learning Representations , year =

  18. [18]

    2025 , note =

    Measuring what Matters: Construct Validity in Large Language Model Benchmarks , author =. 2025 , note =. 2511.04703 , archiveprefix =

  19. [19]

    2025 , booktitle =

    Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity , author =. 2025 , booktitle =. 2503.10694 , archiveprefix =

  20. [20]

    2025 , url =

    Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik R Narasimhan , booktitle =. 2025 , url =

  21. [21]

    The Fourteenth International Conference on Learning Representations , year =

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author =. The Fourteenth International Conference on Learning Representations , year =

  22. [22]

    2026 , eprint =

    ClawBench: Can AI Agents Complete Everyday Online Tasks? , author =. 2026 , eprint =

  23. [23]

    Security of AI Agents , year =

    He, Yifeng and Wang, Ethan and Rong, Yuyang and Cheng, Zifei and Chen, Hao , booktitle =. Security of AI Agents , year =

  24. [24]

    and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =

    Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  25. [25]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

    Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

  26. [26]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , year =

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...

  27. [27]

    and Hardt, D

    Jones, M. and Hardt, D. , title =. 2012 , month = oct, url =

  28. [28]

    and Fielding, R

    Nottingham, M. and Fielding, R. , title =. 2012 , month = apr, url =

  29. [29]

    and Nottingham, M

    Fielding, R. and Nottingham, M. and Reschke, J. , title =. 2022 , month = jun, url =

  30. [30]

    and Reschke, J

    Fielding, R. and Reschke, J. , title =. 2014 , month = jun, url =

  31. [31]

    , title =

    Hardt, D. , title =. 2012 , month = oct, url =

  32. [32]

    and Fielding, R

    Berners-Lee, T. and Fielding, R. and Masinter, L. , title =. 2005 , month = jan, url =

  33. [33]

    and Richer, J

    Backman, A. and Richer, J. and Sporny, M. , title =. 2024 , month = feb, url =

  34. [34]

    and Pardue, L

    Polli, R. and Pardue, L. , title =. 2024 , month = feb, url =

  35. [35]

    2015 , month = aug, url =

    Secure Hash Standard (. 2015 , month = aug, url =

  36. [36]

    Proceedings of the 1987

    Garcia-Molina, Hector and Salem, Kenneth , title =. Proceedings of the 1987. 1987 , pages =. doi:10.1145/38713.38742 , url =

  37. [37]

    2020 , url =

    Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane , title =. 2020 , url =

  38. [38]

    2025 , month = nov, url =

  39. [39]

    2025 , month = nov, url =

    Introducing. 2025 , month = nov, url =

  40. [40]

    2026 , month = apr, url =

    Introducing. 2026 , month = apr, url =

  41. [41]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Jiankai Sun and Shuyi Wang and Binxu Li and Qunhong Zeng and Di Wang and Xuandong Zhao and Yuanli Wang and Roey Ben Chaim and Zonglin Di and Yipeng Gao and Junwei He and Yizhuo He and Liqiang Jing and Luyang Kong and X...

  42. [42]

    2025 , url =

    Claude. 2025 , url =

  43. [43]

    2026 , url =

    Update to. 2026 , url =

  44. [44]

    InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , publisher =. doi:10.1145/3600006.3613165 , url =