{"total":27,"items":[{"citing_arxiv_id":"2606.00235","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence","primary_cat":"physics.soc-ph","submitted_at":"2026-05-29T18:10:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Introduces phenomenological model R_eff = β(1-ρ)(1-τ)(1-γρτ) for coordination under AGI decision velocity, with phase transition and proposed randomized trial.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12809","ref_index":216,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10528","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics","primary_cat":"cond-mat.stat-mech","submitted_at":"2026-05-11T13:13:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Many MAS architectures (debate, self-consistency, com- mittee voting) implicitly assume that agreement among multiple agents provides stronger evidence for correct- ness than a single query. However, when all agents are copies of the same model, they share the same intrin- sic biases, encoded in their weights [14, 15] and shaped by alignment training such as RLHF [16, 17]. A well- documented manifestation of these training-induced bi- ases is sycophancy [18], the tendency of preference-tuned models to align responses with user-suggested or context- suggested positions; in our framework, this and related label preferences appear as a nonzero effective field. If these shared biases dominate over genuine inter-agent"},{"citing_arxiv_id":"2605.07172","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization","primary_cat":"cs.CL","submitted_at":"2026-05-08T03:07:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"are provided in Appendix G. Semantic improvement vectors in hidden space. During DPO training, for each preference pair we select an intermediate layer l (e.g., −4 from the final layer) and compute mean-pooled hidden states hch, hrj ∈R d for the chosen and rejected responses. After layer normalization we define the semantic improvement vector ∆h=LN(h ch)−LN(h rj),(10) which encodes how the hidden representation must change to turn a rejected answer into a chosen one for the same prompt. TPO loss and dynamic weighting.Because the sentence-embedding space Rds and model hidden space Rd are not aligned a priori, we introduce a small trainable projection P∈R d×ds and map topic vectors as ¯uti =P u ti.(11) For a batch of sizeB, the TPO loss is"},{"citing_arxiv_id":"2605.04542","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-06T06:42:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01311","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice","primary_cat":"cs.LG","submitted_at":"2026-05-02T07:55:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.","context_count":1,"top_context_role":"method","top_context_polarity":"extend","context_text":"interpolation to randomized outcomes. 3.3 CVCI-Residual.CVCI-Residual extends the CVCI pooling rule of Yang et al. (2025) by first residualizing around the OBS baseline (replacing the regression target Y with Y−f OBS(z)) and then running aλ-pooled fit in the proxy spaceψ(z): (bwres λ ,bbres λ ) =arg min w,b n (1−λ)L EXP(w,b;ψ,Y−f OBS) +λL OBS(w,b;ψ,Y−f OBS) +α∥w∥ 2 2 o , (5) yieldingbgres(z) = ( bwres λ )⊤ψ(z) +bbres λ and the final prediction clip[0,1] fOBS(z) +bgres(z) \u0001 . It targets settings where OBS already captures the hard part of the reward surface and EXP is mainly needed to calibrate a simpler discrepancy. When residual-CVCI hyperparameter CV scores tie within numerical tolerance, the tie is broken toward the more regularized"},{"citing_arxiv_id":"2604.24176","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Explanation Quality Assessment as Ranking with Listwise Rewards","primary_cat":"cs.AI","submitted_at":"2026-04-27T08:35:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the citation threshold that directly address identified gap areas; the citation threshold was not applied to 2025 work due to insufficient publication time. We identify six core architectural components (Fig. 2), each introducing distinct security concerns: (1) Foundation Model.A large pretrained language model [19]-[21] that serves as the reasoning engine [22]- [25]. Safety properties acquired through RLHF [26]-[28] or Constitutional AI [29] are necessary but insufficient for agentic deployment, as they are trained on single-turn interaction data [30]. (2) Planning and Reasoning Module.Translates high- level goals into action sequences via architectures including ReAct [5], chain-of-thought [31], and tree-of-thought search. Multi-step planning creates a surface for cumulative goal drift:"},{"citing_arxiv_id":"2604.20140","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-22T03:08:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13803","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation","primary_cat":"cs.CV","submitted_at":"2026-04-15T12:38:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11554","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale","primary_cat":"cs.CL","submitted_at":"2026-04-13T14:42:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06788","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Perception to Autonomous Computational Modeling: A Multi-Agent Approach","primary_cat":"cs.CE","submitted_at":"2026-04-08T07:56:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Thisprompt-levelrefinement is lightweight, interpretable, and reversible. Should future deployments find that prompt/memory rewrites saturate against a persistent weakness, the architecturepermitsa second tier in which the accumulated engineer-supplied correction set is used for supervised fine-tuning ofpLM or preference-based optimisation in the spirit of RLHF [62, 63] or direct preference optimisation (DPO) [64]; that escalation is a design option, not a step executed in this paper, and we do not equate the prompt-level operatorFused here with RLHF or DPO. 26 Figure 8: Two-tier self-improving feedback loop.Prompt-level refinement(olive dashed): engi- neer corrections are distilled into abstract patterns updating agent definitions and shared mem- ory, lightweight, interpretable, and reversible."},{"citing_arxiv_id":"2510.04265","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation","primary_cat":"cs.AI","submitted_at":"2025-10-05T16:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20265","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Failure Modes of Maximum Entropy RLHF","primary_cat":"cs.LG","submitted_at":"2025-09-24T15:52:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06701","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks","primary_cat":"cs.LG","submitted_at":"2025-09-08T13:55:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.16771","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention","primary_cat":"cs.SE","submitted_at":"2025-08-22T20:08:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.08125","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-06-09T18:27:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.18719","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-05-24T14:42:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.07283","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2024-10-09T11:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.21787","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","primary_cat":"cs.LG","submitted_at":"2024-07-31T17:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"1 Common Verification Methods Don't Always Scale with the Sample Budget Of the five tasks we evaluate, only GSM8K and MATH lack tools for automatically verifying solutions. We test three simple and commonly used verification approaches on their ability to identify correct solutions from these datasets: 1. Majority Vote: We pick the most common final answer [60]. 2. Reward Model + Best-of-N: We use a reward model [ 17] to score each solution, and pick the answer from the highest-scoring sample. 3. Reward Model + Majority Vote: We calculate a majority vote where each sample is weighted by its reward model score. We reuse the collections of 10,000 samples that we generated with Llama-3-8B-Instruct and Llama-3-70B- Instruct in Section 2. We use ArmoRM-Llama3-8B-v0."},{"citing_arxiv_id":"2406.06592","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improve Mathematical Reasoning in Language Models by Automated Process Supervision","primary_cat":"cs.CL","submitted_at":"2024-06-05T19:25:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"question-answer pairs available, an ORM can be trained by sampling outputs from a policy model (e.g., a pretrained or fine-tuned LLM) using the questions and obtaining the correctness labels by comparing these outputs with the golden answers. In contrast, a PRM is trained to predict the correctness of each intermediate step𝑥𝑡 in the solution. Formally, 𝑝𝑡 = PRM( [𝑞, 𝑥1:𝑡−1], 𝑥𝑡), where 𝑥1:𝑖 = [𝑥1, . . . , 𝑥 𝑖] represents the first𝑖 steps in the solution. This provides more precise and fine-grained feedback than ORMs, as it identifies the exact location of errors. Process supervision has also been shown to mitigate incorrect reasoning in the domain of mathematical problem solving. Despite these advantages, obtaining the intermediate signal for each"},{"citing_arxiv_id":"2403.19647","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models","primary_cat":"cs.LG","submitted_at":"2024-03-28T17:56:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study, Variable generalization performance of a deep learn- ing model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11):e1002683, November 2018. ISSN 1549-1676. Jingzhao Zhang, Aditya Krishna Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, and Suvrit Sra. Coping with label shift via distributionally robust optimisation. In International Con- ference on Learning Representations, 2021, Coping with Label Shift via Distributionally Robust Optimisation. Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher Re."},{"citing_arxiv_id":"2309.16797","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution","primary_cat":"cs.CL","submitted_at":"2023-09-28T19:01:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.03958","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Simple synthetic data reduces sycophancy in large language models","primary_cat":"cs.CL","submitted_at":"2023-08-07T23:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"used for STEM datasets are also from Chung et al. (2022), which was taken from Lewkowycz et al. (2022). Here, we report model performance on the \"validation\" set for each task in MMLU for Flan-PaLM models and variants with synthetic-data intervention after tuning for 1k steps. These results are shown in Table 7, Table 8, Table 9, Table 10, Table 11, and Table 12. Table 7: MMLU [:10] 5-shot individual task performance. MMLU AbstractAlgebra Anatomy AstronomyBusinessEthics ClinicalKnowledgeCollegeBiology CollegeChemistryCollegeComp. Sci.CollegeMath CollegeMedicine Model Direct CoT Direct CoT Direct CoT Direct CoT Direct CoT Direct CoT Direct CoT Direct CoT Direct CoT Direct CoT 8B Flan-PaLM 36.4 9.1 42.9 35.7 43.8 43.8 36.4 45.5 44."},{"citing_arxiv_id":"2306.12001","ref_index":129,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Overview of Catastrophic AI Risks","primary_cat":"cs.CY","submitted_at":"2023-06-21T03:35:06+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"players must form alliances at least initially, but winning strategies often involve backstabbing allies later on. As such, CICERO learned to deceive other players, for example by omitting information about its plans when talking to supposed allies. A different example of an AI learning to deceive comes from researchers who were training a robot arm to grasp a ball [129]. The robot's performance was assessed by one camera watching its movements. However, the AI learned that it could simply place the robotic hand between the camera lens and the ball, essentially \"tricking\" the camera into believing it had grasped the ball when it had not. Thus, the AI exploited the fact that there were limitations in our oversight over its actions."},{"citing_arxiv_id":"2212.03827","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Discovering Latent Knowledge in Language Models Without Supervision","primary_cat":"cs.CL","submitted_at":"2022-12-07T18:17:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00861","ref_index":215,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}