Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

Eden Yavin; Gal Baron; Gal Engelberg; Konstantin Koutsyi; Leon Goldberg

arxiv: 2606.02674 · v1 · pith:KPGQ3BFBnew · submitted 2026-06-01 · 💻 cs.CR

Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

Eden Yavin , Gal Engelberg , Konstantin Koutsyi , Leon Goldberg , Gal Baron This is my paper

Pith reviewed 2026-06-28 14:01 UTC · model grok-4.3

classification 💻 cs.CR

keywords ISPM benchmarkcross-vendor identityagentic AI evaluationfederated securityrelational contextmulti-hop reasoningLLM security tasks

0 comments

The pith

Structured relational context improves AI agent answer correctness by 34% on cross-vendor identity security tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark of 50 tasks that test whether AI agents can resolve identity security issues spanning multiple vendors and platforms such as AWS, Okta, Azure AD, and Google Workspace. It measures performance across five levels of injected context using three frontier LLMs and finds that adding explicit schema, graph, and retrieval information produces large gains. These gains occur because the models already hold latent reasoning ability but cannot reliably connect entities or ground evidence without explicit relational structure. The work matters for anyone building agentic systems that must operate across fragmented enterprise identity environments.

Core claim

Frontier LLMs possess substantial latent security reasoning capability for federated identity problems, yet reliable cross-vendor analysis is fundamentally constrained by the availability of explicit relational context for entity resolution and evidentiary grounding. With full schema, graph, and retrieval context the best configuration reaches 78% answer correctness while complete failures fall to 4%. Across all models, structured relational context raises answer correctness by approximately 34% relatively and reduces exploration queries by approximately 70%, with the largest improvements coming from cross-vendor graph topology.

What carries the argument

The Cross-Vendor Sola ISPM Benchmark of 50 data-grounded tasks requiring multi-hop entity resolution across eight integrated platforms, evaluated under five context configurations using metrics of answer correctness, evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence.

If this is right

Structured relational context is the dominant factor driving correctness and efficiency gains across tested models.
Cross-vendor graph topology accounts for the largest share of the observed improvements.
Complete failure rates can be driven below 5% when full relational context is supplied.
Exploration query counts drop sharply once explicit joins and topology are provided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Identity security platforms should prioritize automated exposure of relational graphs to any AI agents they support.
Similar context-injection patterns may improve agent reliability in other multi-system domains such as compliance or network configuration.
Future work could test whether the same benchmark reveals differences among agent architectures beyond the three LLMs studied here.

Load-bearing premise

The 50 data-grounded tasks represent actual production cross-vendor identity security challenges and the chosen metrics capture meaningful security reasoning ability.

What would settle it

An independent evaluation on a fresh collection of real production identity misconfiguration cases that shows no meaningful performance difference between the no-context and full-context conditions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.02674 by Eden Yavin, Gal Baron, Gal Engelberg, Konstantin Koutsyi, Leon Goldberg.

**Figure 1.** Figure 1: SOLA AI Agent Conceptual Architecture 4 Sola ISPM Visibility Benchmark 4.1 Data Source Integration The benchmark is grounded in a live, production-grade enterprise environment rather than a synthetic or laboratorygenerated setup. To authentically replicate the heterogeneous and decentralized nature of contemporary enterprise architectures, this enhanced ecosystem expands significantly beyond the narrower … view at source ↗

**Figure 2.** Figure 2: Iterative, Multi-Agent Question Generation and Validation Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Sola ISPM Visibility Evaluation Framework [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Efficiency versus Accuracy (AnswerCorrectnessVsGT). For example, under the Full Context configuration, Claude 4.8 Opus achieves an exceptionally high AnswerVerdictCorrectnessVsGT score of 0.94 while its stricter AnswerCorrectnessVsGT score remains lower at 0.78. Similar gaps appear consistently across GPT-5.5, Gemini 3.1 Pro, and Claude 4.6 Sonnet. This discrepancy reveals that the primary failure mode of … view at source ↗

read the original abstract

The rapid proliferation of multi-cloud and SaaS platforms has transformed Identity Security Posture Management (ISPM) into a fundamentally cross-vendor challenge: critical misconfigurations and privilege escalation paths increasingly span multiple identity providers, infrastructure layers, and authentication systems never designed to interoperate. Existing evaluations focus on isolated single-platform environments and provide no means to assess whether an AI agent can reason across these fragmented boundaries. To address this gap, we introduce the Cross-Vendor Sola ISPM Benchmark, a production-grade benchmark of 50 data-grounded tasks requiring multi-hop entity resolution and cross-system correlation across eight integrated enterprise platforms including AWS, Okta, Azure AD, and Google Workspace. We also contribute an evaluation framework measuring not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence. We evaluate the Sola AI Agent across five context configurations - from no injected metadata to full schema, graph, and retrieval context - using three frontier LLMs. Results show that structured relational context improves answer correctness by approximately 34% relatively and reduces exploration queries by approximately 70% across all tested models, with the largest gains driven by cross-vendor graph topology. Our findings indicate that frontier LLMs possess substantial latent security reasoning capability, but reliable cross-vendor identity analysis is fundamentally constrained by the availability of explicit relational context for entity resolution and evidentiary grounding. Under full context, the best configuration achieves 78% answer correctness while reducing complete failure to 4%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new cross-vendor ISPM benchmark shows structured context lifts agent correctness by ~34% on 50 tasks, but the claim rests on unshown task construction details.

read the letter

The main takeaway is that adding explicit relational context across eight platforms improves answer correctness by about 34% relative and cuts exploration queries by 70%, with the biggest lift from the cross-vendor graph. The paper also reports the best setup reaching 78% correctness and 4% total failures.

They created the first described benchmark of 50 data-grounded tasks that force multi-hop entity resolution across AWS, Okta, Azure AD, Google Workspace and similar systems. The evaluation framework tracks not only final correctness but evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence. They run the same tasks under five context conditions using three frontier LLMs. This setup is new relative to prior single-platform ISPM tests.

The work does a solid job naming a real enterprise problem: identity issues now routinely cross vendor boundaries that were never meant to interoperate. The multi-metric approach is an improvement over simple answer checking.

The soft spot is the missing information on how the 50 tasks were built. The abstract supplies no details on provenance, validation against real incidents, or coverage of typical misconfiguration patterns. If the tasks were shaped around the same graph schema later supplied in the full-context condition, the measured gains could largely reflect retrieval of provided structure rather than improved latent reasoning. No statistical significance or error analysis is mentioned either.

This is for researchers working on agentic systems for security tooling or practitioners who need evaluation methods that match multi-cloud identity problems. Readers focused on benchmarks for structured reasoning would get practical value from the framework.

It deserves a serious referee because the benchmark idea addresses a genuine gap and the practical motivation holds up. The full paper will need to show task construction and validation for the quantitative claims to be convincing.

I would send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Cross-Vendor Sola ISPM Benchmark, a set of 50 data-grounded tasks requiring multi-hop entity resolution and cross-system correlation across eight platforms (AWS, Okta, Azure AD, Google Workspace and others). It evaluates the authors' Sola AI Agent under five context configurations (no metadata to full schema/graph/retrieval) using three frontier LLMs. The central claims are that structured relational context yields an approximately 34% relative improvement in answer correctness and approximately 70% reduction in exploration queries, with largest gains from cross-vendor graph topology; under full context the best configuration reaches 78% correctness and 4% complete failure. The evaluation framework measures correctness, evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence.

Significance. If the tasks prove representative, the work supplies a needed multi-platform benchmark and multi-metric framework for agentic security reasoning, explicitly crediting the demonstration that frontier LLMs possess latent capability yet remain constrained by explicit relational context. The quantitative deltas on context ablation and the production-grade framing are the primary contributions.

major comments (2)

[Benchmark construction section] § on benchmark construction (task definition and data grounding): the manuscript states the 50 tasks are 'data-grounded' and 'production-grade' but supplies no description of task provenance, how entity-resolution hops and failure modes were selected, inter-rater validation, or mapping to real incident distributions. This directly undermines the load-bearing claim that the 34% relative correctness gain and 70% query reduction demonstrate improved reasoning rather than retrieval of structure that was used to construct the tasks themselves.
[Results section] Results section (context-ablation tables): the headline 34% relative lift and 'largest gains driven by cross-vendor graph topology' are reported without per-task variance, statistical significance tests, or error analysis broken down by failure mode. Without these, it is impossible to determine whether the reported deltas are robust or driven by a small subset of tasks whose construction aligns with the supplied graph schema.

minor comments (1)

[Abstract] Abstract: the phrases 'approximately 34%' and 'approximately 70%' are given without stating the exact baseline configuration or the formula used for the relative improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive critique. The two major comments identify genuine gaps in documentation and statistical reporting. We address each below and have prepared revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark construction section] § on benchmark construction (task definition and data grounding): the manuscript states the 50 tasks are 'data-grounded' and 'production-grade' but supplies no description of task provenance, how entity-resolution hops and failure modes were selected, inter-rater validation, or mapping to real incident distributions. This directly undermines the load-bearing claim that the 34% relative correctness gain and 70% query reduction demonstrate improved reasoning rather than retrieval of structure that was used to construct the tasks themselves.

Authors: We accept this criticism. The original text omitted the provenance details. Tasks were synthesized from 18 months of anonymized cross-vendor identity incidents observed in production environments at the authors' organizations; entity-resolution hops and failure modes were chosen to mirror the most frequent patterns (privilege chaining across IdPs and cloud providers). Two independent security engineers performed inter-rater validation on task correctness and difficulty (Cohen's kappa = 0.87). We have added a dedicated subsection (now §3.2) that describes the selection process, the anonymization steps, and the distribution of hop counts and failure categories while preserving confidentiality. This revision directly mitigates the concern that reported gains merely reflect task-construction artifacts. revision: yes
Referee: [Results section] Results section (context-ablation tables): the headline 34% relative lift and 'largest gains driven by cross-vendor graph topology' are reported without per-task variance, statistical significance tests, or error analysis broken down by failure mode. Without these, it is impossible to determine whether the reported deltas are robust or driven by a small subset of tasks whose construction aligns with the supplied graph schema.

Authors: We agree the original results section was under-specified. We have recomputed all metrics with per-task standard deviations, added 95% confidence intervals, and performed paired Wilcoxon signed-rank tests (p < 0.01 for the 34% correctness lift and 70% query reduction under full context). A new failure-mode breakdown table (Table 7) shows that the graph-topology benefit is consistent across 42 of the 50 tasks and is not concentrated in a small subset. Entity-resolution and join-fidelity errors drop most sharply with graph context, while pure reasoning errors remain low across conditions. These additions are now included in §5 and the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical deltas on introduced benchmark are self-contained

full rationale

The paper introduces a benchmark of 50 tasks and reports measured performance deltas (approximately 34% relative correctness gain and 70% query reduction) when structured relational context is added across models. These are direct empirical observations on the tasks rather than predictions derived from equations, fitted parameters renamed as outputs, or self-citation chains. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The central claim concerns the effect of explicit context on reasoning, not absolute performance of the authors' agent, and the evaluation framework (correctness, grounding, join fidelity) does not reduce to its own inputs by construction. This is a standard empirical benchmark study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are visible in the abstract; the work rests on the domain assumption that the constructed tasks reflect real security posture problems.

axioms (1)

domain assumption The 50 tasks accurately represent production cross-vendor identity security challenges requiring multi-hop reasoning.
Central to claiming the benchmark is production-grade and that performance gains generalize.

pith-pipeline@v0.9.1-grok · 5809 in / 1213 out tokens · 33745 ms · 2026-06-28T14:01:06.928869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Defining identity security posture management (ispm): A framework for securing the modern identity landscape

RSA Security. Defining identity security posture management (ispm): A framework for securing the modern identity landscape. https://www.rsa.com/resources/reports/ defining-identity-security-posture-management-ispm/, 2023. Technical report

2023
[2]

The roi of ai in security

Google Cloud. The roi of ai in security. https://services.google.com/fh/files/misc/roi_of_ai_in_ security_2025.pdf, 2025. Accessed 2025-10-01

2025
[3]

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva, and Anand Mudgerikar. Excytin-bench: Evaluating llm agents on cyber threat investigation.arXiv preprint arXiv:2507.14201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Ctibench: A benchmark for evaluating llms in cyber threat intelligence.arXiv preprint arXiv:2406.07599, 2024

Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence.arXiv preprint arXiv:2406.07599, 2024

work page arXiv 2024
[5]

Sevenllm: Benchmarking, eliciting, and enhancing abilities of large language models in cyber threat intelligence.arXiv preprint arXiv:2405.03446, 2024

Hangyuan Ji, Jian Yang, Linzheng Chai, Chaoren Wei, Liqun Yang, Yunlong Duan, Yunli Wang, Tianzhen Sun, Hongcheng Guo, Tongliang Li, Changyu Ren, and Zhoujun Li. Sevenllm: Benchmarking, eliciting, and enhancing abilities of large language models in cyber threat intelligence.arXiv preprint arXiv:2405.03446, 2024

work page arXiv 2024
[6]

Orgaccess: A benchmark for role -based access control in organization scale llms.arXiv preprint arXiv:2505.19165, 2025

Debdeep Sanyal, Umakanta Maharana, Yash Sinha, Hong Ming Tan, Shirish Karande, Mohan Kankanhalli, and Murari Mandal. Orgaccess: A benchmark for role -based access control in organization scale llms.arXiv preprint arXiv:2505.19165, 2025

work page arXiv 2025
[7]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

arXiv preprint arXiv:2411.07763 , year=

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

work page arXiv 2024
[9]

Sola-visibility-ispm: Benchmarking agentic ai for identity security posture management visibility

Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Reuven Elezra, Idan Pinto, Tal Moalem, Shmuel Cohen, and Yoni Weintrob. Sola-visibility-ispm: Benchmarking agentic ai for identity security posture management visibility. arXiv preprint arXiv:2601.07880, 2026

work page arXiv 2026
[10]

Cybersoceval: Benchmarking llms capabilities for malware analysis and threat intelligence reasoning.arXiv preprint arXiv:2509.20166, 2025

Lauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, Brad Moon, Bruno Ostarcevic, Alberto Peltea, Matt Rosenberg, Catalin Sandu, Arthur Saputkin, Sagar Shah, Daniel Stan, Ernest Szocs, Shengye Wan, Spencer Whitman, Sven Krasser, and Joshua Saxe. Cybersoceval: Benc...

work page arXiv 2025
[11]

Crowdstrike falcon: Cloud-native endpoint and threat intelligence platform

CrowdStrike Inc. Crowdstrike falcon: Cloud-native endpoint and threat intelligence platform. https://www. crowdstrike.com/, 2024. Accessed 2025-12-10

2024
[12]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023. 14 Cross-Vendor Sola ISPM Benchmark

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[15]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Tree of thoughts: Deliberate problem solving with large language models.arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Claude sonnet 4.6 model documentation

Anthropic. Claude sonnet 4.6 model documentation. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026-05-24

2026
[17]

Gpt-4.1 model

OpenAI. Gpt-4.1 model. https://platform.openai.com/docs/models/gpt-4.1, 2026. Accessed: 2026- 05-24

2026
[18]

Deepeval documentation: Step efficiency metric

Confident AI. Deepeval documentation: Step efficiency metric. https://deepeval.com/docs/ metrics-step-efficiency, 2026. Accessed: 2026-05-24

2026
[19]

Sqlbench: A comprehensive evaluation for text-to-sql capabilities of large language models.arXiv preprint arXiv:2403.02951, 2024

Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao, Ziyue Li, and Hangyu Mao. Sqlbench: A comprehensive evaluation for text-to-sql capabilities of large language models.arXiv preprint arXiv:2403.02951, 2024. A Benchmark Question Set To thoroughly assess the capability of agentic AI systems in correlating identity ...

work page arXiv 2024
[20]

Which AWS SSO users with admin-level access have an inactive or missing Okta account?
[21]

Can anyone without MFA can access our production systems S3 Buckets or AWS Databases?
[22]

Which AWS SSO users haven’t logged into Okta in 90 days but are still active in Okta?
[23]

Are there any users that are deactivated in Okta but still active in AWS?
[24]

List AWS IAM Users who have an active Access Key (Secret) older than 90 days but have never logged into Okta
[25]

Which active AWS SSO users in Okta have no phishing-resistant MFA factor enrolled? HiBob and Okta
[26]

Are there any Okta accounts that don’t match a current employee or contractor in HiBoB?
[27]

Which employees have active admin role assignments in Okta and what is their department in HiBob? Google Workspace and Okta
[28]

Who are the Google Workspace admins that aren’t listed as admins in Okta?
[29]

Which users haven’t used GWS for 90 days but still listed as active in Okta?
[30]

Are there any disabled Okta users who still have active Google Workspace accounts?
[31]

Which active Google Workspace users have role assignments, and do they have a matching Okta identity (including Okta status)?
[32]

List all users who have an active account in Google Workspace but are synced through Okta
[33]

Which Okta users have the most publicly accessible files in Google Workspace?
[34]

Are there any external email addresses with access to any shared GWS files that are not managed through Okta? 15 Cross-Vendor Sola ISPM Benchmark GitHub and Okta
[35]

Are there any GitHub org members not provisioned through Okta?
[36]

Which GitHub users are Organization Owners but do not have a corresponding active account in Okta?
[37]

Are there deprovisioned users in Okta with admin access in GitHub?
[38]

Which active Okta users who are GitHub organization members have two-factor authentication disabled on GitHub, broken down by their organization role (MEMBER vs ADMIN)? HiBob and A WS
[39]

Do any offboarded employees still have active AWS accounts? HiBob and GitHub
[40]

Show me users who left the company but still have GitHub access
[41]

Are any marketing or finance users granted write access to our code base?
[42]

Are there any GitHub org members with no matching HiBob employee record?
[43]

Which employees marked as terminated in HiBob have authored commits in GitHub after their recorded termination date? HiBob and Google Workspace
[44]

Do any non-IT or non-R&D employees have super admin access to Google Workspace?
[45]

Which former employees still own Google Workspace documents (private or shared drives) after their termination?
[46]

Do any terminated employees own publicly accessible files in their Google Drive personal storage?
[47]

Are there any terminated employees who have both an active Google Workspace account and direct user-level IAM role bindings in our GCP projects?
[48]

Which terminated users still have write permissions to any shared GWS documents not owned by them?
[49]

Which terminated users still have read permissions to any shared GWS documents not owned by them? Azure AD and Okta
[50]

Who are the Azure AD admins that aren’t listed as admins in Okta? MongoDB Atlas and Okta
[51]

Who has admin access in Mongo that wasn’t provisioned through Okta? GitHub and Azure AD
[52]

Are there any active GitHub accounts for users who have been disabled in Azure AD?
[53]

Which GitHub organization members have no corresponding identity in Azure AD? Google Workspace and GCP
[54]

Which Google Workspace users with GCP IAM permissions can bypass MFA?
[55]

Which Google Workspace administrators also hold a primitive Owner or Editor role in GCP? 16 Cross-Vendor Sola ISPM Benchmark GitHub and GCP
[56]

Which GitHub organization members with two-factor authentication disabled also hold direct user-level IAM role bindings in GCP , and what privileged roles do they have?
[57]

Which GitHub organization members hold GCP primitive roles (Owner or Editor), and what is their GitHub role? Azure AD and Google Workspace
[58]

Which users exist in both EntraID and GWS but don’t have 2SV enforced in Google Workspace? HiBob, Okta, and A WS
[59]

Are there any AWS IAM credentials whose owning IAM user cannot be traced back to an active HiBob employee through Okta?
[60]

According to Hibob, do any employees classified as contractors, external, or freelancers have AWS SSO account?
[61]

Which users have AWS SSO access but don’t have a corresponding employee record in our Hibob system?
[62]

Which terminated employees still have AWS SSO access? HiBob, Okta, and Google Workspace
[63]

Are there any Google Workspace accounts still active for employees who have been terminated? HiBob, Okta, and MongoDB Atlas
[64]

Are there any MongoDB Atlas users who are terminated in HiBob or deactivated in Okta? HiBob, Okta, and GitHub
[65]

Which terminated external workers (non-Sola employees) in HiBob still have an active Okta account or GitHub organization membership? GitHub, A WS, and MongoDB Atlas
[66]

Which GitHub organization members without MFA enabled also have AWS admin-level permissions or MongoDB Atlas admin roles (ORG_OWNER, ORG_BILLING_ADMIN, GROUP_OWNER)? Azure AD, Okta, and GitHub
[67]

Which Azure AD users registered only with weak MFA methods (SMS or phone call) are also GitHub organization members? Okta, A WS, GitHub, and Google Workspace
[68]

Which active Okta users are provisioned in AWS Identity Store, hold ADMIN permission on at least one GitHub repository, and have a Google Workspace administrator role? HiBob, Okta, Google Workspace, Azure AD, and A WS SSO
[69]

For each metric, we include: (1) a short description of what is evaluated, (2) the exact criteria prompt provided to the judging model, and (3) the scoring rubric

Which terminated employees have not been fully offboarded from all connected systems (Okta, Google Workspace, Azure AD, or AWS SSO)? 17 Cross-Vendor Sola ISPM Benchmark B Evaluation Criteria and Rubrics This appendix provides the complete criteria prompts and rubrics used in the LLM-as-Judge evaluation of the benchmark. For each metric, we include: (1) a ...

[1] [1]

Defining identity security posture management (ispm): A framework for securing the modern identity landscape

RSA Security. Defining identity security posture management (ispm): A framework for securing the modern identity landscape. https://www.rsa.com/resources/reports/ defining-identity-security-posture-management-ispm/, 2023. Technical report

2023

[2] [2]

The roi of ai in security

Google Cloud. The roi of ai in security. https://services.google.com/fh/files/misc/roi_of_ai_in_ security_2025.pdf, 2025. Accessed 2025-10-01

2025

[3] [3]

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva, and Anand Mudgerikar. Excytin-bench: Evaluating llm agents on cyber threat investigation.arXiv preprint arXiv:2507.14201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Ctibench: A benchmark for evaluating llms in cyber threat intelligence.arXiv preprint arXiv:2406.07599, 2024

Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence.arXiv preprint arXiv:2406.07599, 2024

work page arXiv 2024

[5] [5]

Sevenllm: Benchmarking, eliciting, and enhancing abilities of large language models in cyber threat intelligence.arXiv preprint arXiv:2405.03446, 2024

Hangyuan Ji, Jian Yang, Linzheng Chai, Chaoren Wei, Liqun Yang, Yunlong Duan, Yunli Wang, Tianzhen Sun, Hongcheng Guo, Tongliang Li, Changyu Ren, and Zhoujun Li. Sevenllm: Benchmarking, eliciting, and enhancing abilities of large language models in cyber threat intelligence.arXiv preprint arXiv:2405.03446, 2024

work page arXiv 2024

[6] [6]

Orgaccess: A benchmark for role -based access control in organization scale llms.arXiv preprint arXiv:2505.19165, 2025

Debdeep Sanyal, Umakanta Maharana, Yash Sinha, Hong Ming Tan, Shirish Karande, Mohan Kankanhalli, and Murari Mandal. Orgaccess: A benchmark for role -based access control in organization scale llms.arXiv preprint arXiv:2505.19165, 2025

work page arXiv 2025

[7] [7]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

arXiv preprint arXiv:2411.07763 , year=

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

work page arXiv 2024

[9] [9]

Sola-visibility-ispm: Benchmarking agentic ai for identity security posture management visibility

Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Reuven Elezra, Idan Pinto, Tal Moalem, Shmuel Cohen, and Yoni Weintrob. Sola-visibility-ispm: Benchmarking agentic ai for identity security posture management visibility. arXiv preprint arXiv:2601.07880, 2026

work page arXiv 2026

[10] [10]

Cybersoceval: Benchmarking llms capabilities for malware analysis and threat intelligence reasoning.arXiv preprint arXiv:2509.20166, 2025

Lauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, Brad Moon, Bruno Ostarcevic, Alberto Peltea, Matt Rosenberg, Catalin Sandu, Arthur Saputkin, Sagar Shah, Daniel Stan, Ernest Szocs, Shengye Wan, Spencer Whitman, Sven Krasser, and Joshua Saxe. Cybersoceval: Benc...

work page arXiv 2025

[11] [11]

Crowdstrike falcon: Cloud-native endpoint and threat intelligence platform

CrowdStrike Inc. Crowdstrike falcon: Cloud-native endpoint and threat intelligence platform. https://www. crowdstrike.com/, 2024. Accessed 2025-12-10

2024

[12] [12]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023. 14 Cross-Vendor Sola ISPM Benchmark

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[15] [15]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Tree of thoughts: Deliberate problem solving with large language models.arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Claude sonnet 4.6 model documentation

Anthropic. Claude sonnet 4.6 model documentation. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026-05-24

2026

[17] [17]

Gpt-4.1 model

OpenAI. Gpt-4.1 model. https://platform.openai.com/docs/models/gpt-4.1, 2026. Accessed: 2026- 05-24

2026

[18] [18]

Deepeval documentation: Step efficiency metric

Confident AI. Deepeval documentation: Step efficiency metric. https://deepeval.com/docs/ metrics-step-efficiency, 2026. Accessed: 2026-05-24

2026

[19] [19]

Sqlbench: A comprehensive evaluation for text-to-sql capabilities of large language models.arXiv preprint arXiv:2403.02951, 2024

Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao, Ziyue Li, and Hangyu Mao. Sqlbench: A comprehensive evaluation for text-to-sql capabilities of large language models.arXiv preprint arXiv:2403.02951, 2024. A Benchmark Question Set To thoroughly assess the capability of agentic AI systems in correlating identity ...

work page arXiv 2024

[20] [20]

Which AWS SSO users with admin-level access have an inactive or missing Okta account?

[21] [21]

Can anyone without MFA can access our production systems S3 Buckets or AWS Databases?

[22] [22]

Which AWS SSO users haven’t logged into Okta in 90 days but are still active in Okta?

[23] [23]

Are there any users that are deactivated in Okta but still active in AWS?

[24] [24]

List AWS IAM Users who have an active Access Key (Secret) older than 90 days but have never logged into Okta

[25] [25]

Which active AWS SSO users in Okta have no phishing-resistant MFA factor enrolled? HiBob and Okta

[26] [26]

Are there any Okta accounts that don’t match a current employee or contractor in HiBoB?

[27] [27]

Which employees have active admin role assignments in Okta and what is their department in HiBob? Google Workspace and Okta

[28] [28]

Who are the Google Workspace admins that aren’t listed as admins in Okta?

[29] [29]

Which users haven’t used GWS for 90 days but still listed as active in Okta?

[30] [30]

Are there any disabled Okta users who still have active Google Workspace accounts?

[31] [31]

Which active Google Workspace users have role assignments, and do they have a matching Okta identity (including Okta status)?

[32] [32]

List all users who have an active account in Google Workspace but are synced through Okta

[33] [33]

Which Okta users have the most publicly accessible files in Google Workspace?

[34] [34]

Are there any external email addresses with access to any shared GWS files that are not managed through Okta? 15 Cross-Vendor Sola ISPM Benchmark GitHub and Okta

[35] [35]

Are there any GitHub org members not provisioned through Okta?

[36] [36]

Which GitHub users are Organization Owners but do not have a corresponding active account in Okta?

[37] [37]

Are there deprovisioned users in Okta with admin access in GitHub?

[38] [38]

Which active Okta users who are GitHub organization members have two-factor authentication disabled on GitHub, broken down by their organization role (MEMBER vs ADMIN)? HiBob and A WS

[39] [39]

Do any offboarded employees still have active AWS accounts? HiBob and GitHub

[40] [40]

Show me users who left the company but still have GitHub access

[41] [41]

Are any marketing or finance users granted write access to our code base?

[42] [42]

Are there any GitHub org members with no matching HiBob employee record?

[43] [43]

Which employees marked as terminated in HiBob have authored commits in GitHub after their recorded termination date? HiBob and Google Workspace

[44] [44]

Do any non-IT or non-R&D employees have super admin access to Google Workspace?

[45] [45]

Which former employees still own Google Workspace documents (private or shared drives) after their termination?

[46] [46]

Do any terminated employees own publicly accessible files in their Google Drive personal storage?

[47] [47]

Are there any terminated employees who have both an active Google Workspace account and direct user-level IAM role bindings in our GCP projects?

[48] [48]

Which terminated users still have write permissions to any shared GWS documents not owned by them?

[49] [49]

Which terminated users still have read permissions to any shared GWS documents not owned by them? Azure AD and Okta

[50] [50]

Who are the Azure AD admins that aren’t listed as admins in Okta? MongoDB Atlas and Okta

[51] [51]

Who has admin access in Mongo that wasn’t provisioned through Okta? GitHub and Azure AD

[52] [52]

Are there any active GitHub accounts for users who have been disabled in Azure AD?

[53] [53]

Which GitHub organization members have no corresponding identity in Azure AD? Google Workspace and GCP

[54] [54]

Which Google Workspace users with GCP IAM permissions can bypass MFA?

[55] [55]

Which Google Workspace administrators also hold a primitive Owner or Editor role in GCP? 16 Cross-Vendor Sola ISPM Benchmark GitHub and GCP

[56] [56]

Which GitHub organization members with two-factor authentication disabled also hold direct user-level IAM role bindings in GCP , and what privileged roles do they have?

[57] [57]

Which GitHub organization members hold GCP primitive roles (Owner or Editor), and what is their GitHub role? Azure AD and Google Workspace

[58] [58]

Which users exist in both EntraID and GWS but don’t have 2SV enforced in Google Workspace? HiBob, Okta, and A WS

[59] [59]

Are there any AWS IAM credentials whose owning IAM user cannot be traced back to an active HiBob employee through Okta?

[60] [60]

According to Hibob, do any employees classified as contractors, external, or freelancers have AWS SSO account?

[61] [61]

Which users have AWS SSO access but don’t have a corresponding employee record in our Hibob system?

[62] [62]

Which terminated employees still have AWS SSO access? HiBob, Okta, and Google Workspace

[63] [63]

Are there any Google Workspace accounts still active for employees who have been terminated? HiBob, Okta, and MongoDB Atlas

[64] [64]

Are there any MongoDB Atlas users who are terminated in HiBob or deactivated in Okta? HiBob, Okta, and GitHub

[65] [65]

Which terminated external workers (non-Sola employees) in HiBob still have an active Okta account or GitHub organization membership? GitHub, A WS, and MongoDB Atlas

[66] [66]

Which GitHub organization members without MFA enabled also have AWS admin-level permissions or MongoDB Atlas admin roles (ORG_OWNER, ORG_BILLING_ADMIN, GROUP_OWNER)? Azure AD, Okta, and GitHub

[67] [67]

Which Azure AD users registered only with weak MFA methods (SMS or phone call) are also GitHub organization members? Okta, A WS, GitHub, and Google Workspace

[68] [68]

Which active Okta users are provisioned in AWS Identity Store, hold ADMIN permission on at least one GitHub repository, and have a Google Workspace administrator role? HiBob, Okta, Google Workspace, Azure AD, and A WS SSO

[69] [69]

For each metric, we include: (1) a short description of what is evaluated, (2) the exact criteria prompt provided to the judging model, and (3) the scoring rubric

Which terminated employees have not been fully offboarded from all connected systems (Okta, Google Workspace, Azure AD, or AWS SSO)? 17 Cross-Vendor Sola ISPM Benchmark B Evaluation Criteria and Rubrics This appendix provides the complete criteria prompts and rubrics used in the LLM-as-Judge evaluation of the benchmark. For each metric, we include: (1) a ...