pith.machine review for the scientific record.sign in
cs.SI
Social and Information Networks
Covers the design, analysis, and modeling of social and information networks, including their applications for on-line information access, communication, and interaction, and their roles as datasets in the exploration of questions in these and other domains, including connections to the social and biological sciences. Analysis and modeling of such networks includes topics in ACM Subject classes F.2, G.2, G.3, H.2, and I.2; applications in computing include topics in H.3, H.4, and H.5; and applications at the interface of computing and other disciplines include topics in J.1--J.7. Papers on computer communication systems and network protocols (e.g. TCP/IP) are generally a closer fit to the Networking and Internet Architecture (cs.NI) category.
NPAP (Network Partitioning and Aggregation Package) is an open-source Python library for reducing the spatial complexity of network graphs. Built on NetworkX, it provides an accessible standalone package designed to be readily integrated with other software and frameworks. Instead of treating the spatial reduction process as a single action, NPAP explicitly splits it into two distinct steps: partitioning, which assigns vertices (nodes) to groups (clusters), and aggregation, which reduces the network based on a given assignment. NPAP's strategy pattern architecture allows users to employ and register custom partitioning and aggregation strategies seamlessly without modifying the core code. Currently, NPAP provides 13 different partitioning strategies and two pre-defined aggregation profiles. Although initially developed with a focus on power systems, its architecture is general-purpose and applicable to any network graph.
With the introduction of large-scale network data, including population-scale social networks, techniques for privacy-aware sharing of network data become increasingly important. While existing $k$-anonymity approaches can model different attacker scenarios, they typically assume that attacker knowledge exactly matches the published network structure. We argue that exact knowledge is often unrealistic and introduce $\phi$-$k$-anonymity, a fuzzy variant of $k$-anonymity in which parameter $\phi$ captures the level of uncertainty in attacker knowledge. Across a benchmark of $39$ real-world networks, a realistic level of uncertainty ($\phi=5\%$) renders, on average, $64\%$ of previously unique nodes anonymous. To further enhance anonymity, we apply anonymization algorithms under a $5\%$ edge modification budget. While full anonymization is often unattainable under exact $k$-anonymity, with low uncertainty ($\phi=10\%$) our newly proposed Greedy algorithm anonymizes over $99\%$ of the nodes. Uncertainty also enables effective anonymization in otherwise difficult to anonymize dense synthetic graphs. Additionally, data utility in terms of structural properties and performance on network analysis tasks is well preserved, with most metrics changing less than $5\%$. Overall, our findings suggest that modest uncertainty assumptions yield high levels of anonymity and utility, motivating further research on uncertainty-aware privacy guarantees for network data.
Public digital conversation around major sporting events takes place within a hybrid system in which journalists and the media compete with new intermediaries, including influencers, to gain greater visibility and engage with audiences. This study analyses the Qatar 2022 World Cup as a case of high informational intensity and public opinion monitoring. To that end, social network analysis was applied to X/Twitter using the hashtag #Qatar2022, analysing 1,343 high-engagement accounts, including those of journalists, media and influencers, alongside a random sample of 5,000 users. The findings indicate that journalists are under-represented in the user population as a whole, but significantly over-represented among the highest-engagement accounts, and they maintain stable visibility. The media, by contrast, attract a lower average level of attention and tend to achieve only sporadic peaks of impact. Accordingly, journalistic authority on social media is observed less as dominance in terms of participation volume and more as the capacity to occupy reference positions when public attention is being shaped during the event.
Graph-structured data underpins applications from citation analysis and social-network modeling to molecular design and knowledge-graph construction, and Large Language Models (LLMs) are increasingly used as prompt-driven graph synthesizers. Classical graph-generation reviews catalog deep generative models and their evaluation primitives, but predate the LLM era and provide no foundation for evaluating instruction-following graph synthesis. Recent LLM-era benchmarks evaluate models along graph-type or task-domain axes; such organizations, however, average over structural complexity and cannot localize where in the complexity spectrum an LLM breaks down. To close this diagnostic gap, we introduce GraphInstruct, a progressive-complexity benchmark that stratifies LLM graph generation into six complexity levels and five evaluation dimensions, paired with 800 hand-authored instructions, 1,582 algorithmically synthesized reference solutions, and a 12-LLM capability evaluation across 45 (model, strategy) configurations. We find that discriminative power peaks at multi-constraint composition rather than reasoning depth, that no single prompting strategy dominates across levels or model families, and that domain-semantic constraints remain iteration-invariant under all tested methods -- pointing to retrieval rather than additional compute as the next research frontier. Atop the benchmark, a verification-guided iterative framework with constraint-aware adaptive prompting consistently surpasses the prompt-engineering ceiling on tested target models, demonstrating that the benchmark's fine-grained signals drive method development.
We consider the problem of assessing a group of nodes in a network. Our focus is on vitality indices -- a natural class of centrality measures that evaluate the importance of a node by examining the impact of its removal on the network. We conduct a comprehensive analysis of group vitality indices. Specifically, we show that every vitality index admits a unique extension to groups, which can be defined using a group variant of the Shapley value recently proposed in the literature. We also provide an axiomatization of the entire class, along with two specific group vitality indices that satisfy additional normalization conditions. Furthermore, we study the computational properties of all vitality indices, as well as Group Attachment Centrality.
In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced by explicit adversarial jailbreak attacks. Observing that simple weight amplification of COCO neurons yields only marginal gains, we propose two training-free, lightweight editing strategies: Local Enhancement (LE-COCO) and Networked Enhancement (NE-COCO). Comprehensive evaluations show that our methods bolster robustness against adversarial jailbreaks and achieve strong performance on open-ended safety benchmarks, while preserving foundational generative proficiency. While this study primarily addresses social stereotypes, the COCO mechanism holds significant potential for diverse domains like hallucination detection, offering valuable insights toward the development of self-evolving AI agents.
Multilayer networks are widely used across biology to represent systems in which complex networks vary across space, time, or interaction types. However, interactive visualization tools remain limited. We present MiRA (Multilayer Interactive Rendering Application), a browser-based, installation-free web application for visualizing biological multilayer networks. MiRA offers seven complementary visualization modes and interactive features that enable researchers to visually navigate the high complexity of multilayer networks for research and education.
Empirical networked systems are often only partially observed: sampling frames, crawling policies, privacy constraints, and temporal gaps can leave actors and edges unobserved. This complicates robustness and sensitivity analysis because many graph-learning pipelines implicitly treat the observed node set as exhaustive. Link prediction and graph completion repair structure among known vertices, whereas full-graph generators synthesize new graphs rather than extending an observed one as a fixed backbone. We study the complementary task of controlled node insertion: generating plausible new actors and attaching them to an existing graph while preserving interpretable global topology.
We introduce the Astro Generative Network (AGN), a variational graph autoencoder that samples latent vectors to decode node features and then integrates new vertices through similarity-based attachment to the observed backbone. We distinguish the recommended configuration, AGN, from AGN-original, a diagnostic baseline that permits generated-generated edges. Across three synthetic regimes, AGN-original forms dense generated-generated subgraphs that artificially inflate clustering and density. Disabling those edges removes this artifact while preserving degree and path-length behavior. In our experiments, AGN keeps clustering and modularity changes modest relative to pre-insertion values, while novelty diagnostics show non-trivial separation from existing nodes without claiming domain-grounded identities.
Our contribution is methodological: a reproducible insertion protocol and evaluation lens for incomplete network science and engineering
As artificial intelligence increasingly mediates public discourse, it becomes important to understand how human-AI collectives shape opinion formation, deliberation, and democratic outcomes. We present a novel experimental method for studying opinion dynamics in hybrid human-AI social networks. Participants, human or AI, were embedded in $5\times5$ grid lattice networks and iteratively asked to select and revise statements on a given polarizing topic over eight rounds. We compared three conditions: human-only, AI-only, and hybrid networks with equal proportions of human and AI participants. Hybrid human-AI networks achieved the lowest final polarization while, in contrast, human-only networks exhibited higher polarization with lower neighbor agreement. We also ran additional experiments varying Large Language Model (LLM) prompt framing to explore whether instruction design might influence convergence patterns. Although these early findings are preliminary and cannot yet support broad generalizations, they highlight the potential value of experimental social networks for understanding opinion dynamics in human-AI hybrid societies.
While there is widespread agreement that markets for ecosystem services (MES) have transformed conservation, it is less clear whether they have transformed the practice of environmental science to meet market needs for stable commodities. We examine this further through the case of blue carbon. Putting marine ecosystems on the MES agenda, blue carbon makes the case to incorporate coastal wetlands into carbon markets to finance their conservation and mitigate climate change. Using a mixed methods approach combining bibliometric and Natural Language Processing (NLP) analyses of peer-reviewed coastal wetland literature, a qualitative review of highly cited publications, and semi-structured interviews with blue carbon scientists, we argue that blue carbon has reshaped coastal wetland science: broadly toward strategic science on conservation and restoration to address global climate change, and specifically by reframing biogeochemistry to meet market demands for stable carbon commodities. Measured by number and share of papers, the blue carbon field is growing quickly and outpacing other research. Its papers are disproportionately represented among the most cited coastal wetland publications, suggesting a growth in both scientific production and authority. Our results show the emergence of blue carbon has redirected biogeochemical research on coastal wetlands from dynamic cycles and processes to stored and preserved carbon, aligning research with the stable carbon commodities needed for carbon markets. This strongly suggests that coastal wetland scientists, although skeptical of carbon offsets, are shifting their work to align with the needs of market frameworks in well-intentioned efforts to promote coastal wetland conservation. If this continues, alternatives to market-based policies are unlikely to emerge, making scientists' own doubts about their effectiveness a serious cause for concern.
Consumption Drives Production (CDP) on social platforms aims to deliver interpretable incentive signals for creator ecosystem building and resource utilization improvement, which strongly relies on attribution. In large-scale and complex recommendation systems, the absence of accurate labels together with unobserved confounding renders backdoor adjustments alone insufficient for reliable attribution. To address these problems, we propose Adversarial Learning Mediator based Multi-Touch Attribution (ALM-MTA), an extensible causal framework that leverages front-door identification with an adversarially learned mediator: a proxy trained to distill outcome information to strengthen the causal pathway from treatment to outcome and eliminate shortcut leakage. We then introduce contrastive learning that conditions front-door marginalization on high-match consumption-upload pairs to ensure positivity in large treatment spaces. To assess causality from non-RCT logs, we also incorporate a non-personalized bucketed protocol, estimating grouped uplift and computing AUUC over treatment clusters. Finally, we evaluate ALM-MTA using a real-world recommendation system with 400 million DAU and 30 billion samples. ALM-MTA increases DAU by 0.04% and daily active creators by 0.6%, with unit exposure efficiency increased by 670%. On causal utility, ALM-MTA achieves higher grouped AUUC than the SOTA in every propensity bucket, with a maximum gain of 0.070. In terms of accuracy, ALM-MTA improves upload AUC by 40% compared to SOTA. These results demonstrate that front-door deconfounding with adversarial mediator learning provides accurate, personalized, and operationally efficient attribution for creator ecosystem optimization.
A taxonomy by developmental history and architecture reviews how attention filters noise while preserving graph structure
abstractclick to expand
Graph neural networks (GNNs) aim to learn well-trained representations in a lower-dimension space for downstream tasks while preserving the topological structures. In recent years, attention mechanism, which is brilliant in the fields of natural language processing and computer vision, is introduced to GNNs to adaptively select the discriminative features and automatically filter the noisy information. To the best of our knowledge, due to the fast-paced advances in this domain, a systematic overview of attention-based GNNs is still missing. To fill this gap, this paper aims to provide a comprehensive survey on recent advances in attention-based GNNs. Firstly, we propose a novel two-level taxonomy for attention-based GNNs from the perspective of development history and architectural perspectives. Specifically, the upper level reveals the three developmental stages of attention-based GNNs, including graph recurrent attention networks, graph attention networks, and graph transformers. The lower level focuses on various typical architectures of each stage. Secondly, we review these attention-based methods following the proposed taxonomy in detail and summarize the advantages and disadvantages of various models. A model characteristics table is also provided for a more comprehensive comparison. Thirdly, we share our thoughts on some open issues and future directions of attention-based GNNs. We hope this survey will provide researchers with an up-to-date reference regarding applications of attention-based GNNs. In addition, to cope with the rapid development in this field, we intend to share the relevant latest papers as an open resource at https://github.com/sunxiaobei/awesome-attention-based-gnns.
YouTube is central to contemporary mass media. However, the official YouTube API does not provide access to the full set of creators or creator metadata on the platform. This lack of basic visibility into the YouTube ecosystem hinders understanding of the platform's creator economy. Researchers currently have no easy, transparent, or replicable way to construct large-scale datasets of YouTube creators and their audiences over time. This makes it challenging to study vital social questions, such as how changes to the YouTube recommendation algorithm shape creator incentives and by extension the mass media on the platform. We address this gap with TubeCensus, a large-scale longitudinal dataset of YouTube creators and subscriber counts, constructed by collecting, linking, and organizing nearly two decades of YouTube page captures from the Internet Archive. This approach is transparent and replicable and does not require interaction with the YouTube API, whose output can change over time. We validate the coverage of TubeCensus against prior estimates of YouTube's size and find that our resource includes creators responsible for at least 30-36% of all YouTube content. We also find that TubeCensus provides good coverage of prominent creators. To support future research, we hide the substantial complexities of the YouTube identifier system and Internet Archive capture system by distributing our dataset via an easy-to-use pip package. Finally, we use our resource to complete basic exploratory analysis of YouTube channel content and the mechanisms associated with YouTube channel growth.
Legal disputes unfold through sequences of filings in which parties update their positions and may settle at any stage. Most computational studies of legal prediction, however, focus on adjudicated outcomes and treat cases as static objects observed only at the end of litigation. Here we develop a temporally structured framework for predicting outcomes in civil litigation using 835,190 court filings between 1996 and 2022. We represent each case as a sequence of documents and model litigation as a three-outcome process: plaintiff win, plaintiff loss, or settlement. Documents are encoded using structured legal features, text embeddings, and information about judges and law firms, and a classifier estimates outcome probabilities at each stage of the case. The model achieves class-specific AUC values between 0.74 and 0.81, and reaches up to 97% accuracy for high-confidence plaintiff-win predictions. To study heterogeneity in predictability, we define case complexity as the entropy of the predicted outcome distribution. Richer factual and relational information improves prediction primarily in low-complexity cases, whereas its marginal contribution declines as complexity increases, suggesting that some disputes remain difficult not because information is missing, but because outcomes are less determinate. Consistent with this interpretation, complexity increases over the course of litigation, indicating that additional filings can amplify uncertainty rather than resolve it. Settlement rates follow an inverted U-shape with respect to complexity, peaking at intermediate levels of predictive uncertainty and declining at both low and high levels of complexity. These findings suggest that predictive uncertainty is not merely model error, but an empirical signal of legal complexity, litigation dynamics, and the conditions under which disputes are resolved through adjudication or settlement.
Online communities are a global phenomenon, but assessing their actual geographical spread requires accurate and scalable measurement. We propose and evaluate methods that infer the time zone of online communities solely from their temporal activity patterns, requiring nothing beyond hourly activity counts. Grounding our approach in the well-established finding that posting rhythms encode circadian structure, we compare time-domain and frequency-domain methods against a parsimonious heuristic: that activity reaches its minimum around 4 a.m. local time. On Reddit, we show that the best-performing method is accurate to a sub-30-minute resolution, and that fewer than a thousand comments are sufficient to reach peak performance. Similarly, our heuristic almost matches the accuracy of more complex methods, recovering the correct time zone within a one-hour margin on average. This simple method correlates significantly with the actual distribution of Reddit's geographical spread; we validate its generalizability across communities organized around diverse cultural phenomena, from sports to finance, and apply it at scale to characterize the geographic evolution of Reddit from its founding to the present. Our method is portable across platforms and requires no user disclosure, making it a practical baseline for any study that must account for the geographic structure of online behavior.
Despite Facebook's central role in American civic life, a clear, evidence-based understanding of users' long-term information environments has remained elusive, hindering assessments of the platform's societal impact. This study addresses that gap by analyzing a unique decade-long dataset, constructed by collecting the full list of public pages and groups followed by over 1,100 American users. This approach allows us to examine the potential information exposure of these users by analyzing hundreds of millions of posts from 2012 to 2023. We find that political content constitutes a modest 18% of a user's potential information diet, which is predominantly composed of lifestyle and entertainment topics. This aggregate view, however, masks a deeply stratified reality: we uncover significant and persistent disparities in the volume and ideological leaning of political content across age, gender, and racial lines. Furthermore, we quantify the porous boundaries between content categories, showing how political discourse frequently permeates non-political spaces. Leveraging the dataset's longitudinal nature, we also assess the impact of major platform interventions. We find that Meta's 2018 "Meaningful Social Interactions" update dramatically increased the share of political content by contracting the visibility of non-political posts. By providing a granular, decade-long map of potential information exposure, our study offers one of the first representative and longitudinal picture drawn from platform-independent data. Our findings underscore the critical need for researchers to measure exposure, not merely engagement, and to account for the significant volume of political content that circulates in non-political spaces.
In this article we present a new centrality measure called ksi-centrality. We show that ksi-centrality distinguishes real networks from random ones, similar to degree centrality: the ksi-centrality distribution is right-skewed for real networks and centered for random Erdos-Renyi networks, and has linear pattern with a heavy tail on a log plot. Furthermore, the ksi-centrality distribution is centered for models simulating real networks: Barabasi-Albert, Watts-Strogatz, and Boccaletti-Hwang-Latora. Thus, this centrality distribution is an additional and independent property with respect to scale-freeness. We also introduce a normalized version of ksi-centrality and show that it is related to algebraic connectivity and the Chegeer's value of a network. Moreover, the average value of this normalized centrality is in bijective correspondence with the relative number of edges that a new node connects to others in the Barabasi-Albert preferential attachment model, thus answering the question of how to choose the parameter $m$ to model a given real-world network.
In this article we present a new centrality measure called ksi-centrality. We show that ksi-centrality distinguishes real networks from random ones, similar to degree centrality: the ksi-centrality distribution is right-skewed for real networks and centered for random Erdos-Renyi networks, and has linear pattern with a heavy tail on a log plot. Furthermore, the ksi-centrality distribution is centered for models simulating real networks: Barabasi-Albert, Watts-Strogatz, and Boccaletti-Hwang-Latora. Thus, this centrality distribution is an additional and independent property with respect to scale-freeness. We also introduce a normalized version of ksi-centrality and show that it is related to algebraic connectivity and the Chegeer's value of a network. Moreover, the average value of this normalized centrality is in bijective correspondence with the relative number of edges that a new node connects to others in the Barabasi-Albert preferential attachment model, thus answering the question of how to choose the parameter $m$ to model a given real-world network.
Chat communication is often fast-paced, creating the expectation of quick replies. While the timing of exchanges is known to foster closeness and enjoyment, it remains largely unexplored whether chat partners with strong ties reciprocate each other's response times. Using 3.4 million messages from 889 chats across 97 donations of anonymous WhatsApp and Instagram chats, we analyzed response times, their balance between chat partners, and its stability over time. To our knowledge, this is the first study to examine response speed as an expression of reciprocity, bridging a key aspect of online communication with a fundamental principle of social interactions. We found that around 70% of WhatsApp and 44% of Instagram messages were answered within five minutes, confirming the fast pace of instant messaging. Overall, the response speed between chat partners was similar. The response speed similarity was evident both in the overall response-time distributions of chat partners assessed with Jensen-Shannon distance and in the steep regression slopes (0.786 for WhatsApp and 0.796 for Instagram) linking one person's probability of responding within five minutes to the partner's corresponding probability. Importantly, the dispersion of response time similarity over months showed that this balance persists over time. Our results position response time balance as a marker of reciprocity in computer-mediated communication, offering a new way to quantitatively study this fundamental principle of social interaction. We suggest using response speed balance as a complementary metric in the analysis of relationship dynamics, such as the strengthening or weakening of social ties.
Source localization is a representative inverse inference task in information propagation, aiming to identify the source node or node set that triggers the propagation results based on the observed information. A primary challenge is quantifying the inherent uncertainty between observed outcomes and potential sources. Although deep generative models have partially mitigated this issue, most existing approaches primarily focus on uncertainty induced by network topology, attempting to learn a direct mapping from propagation outcomes to sources based on network structure, while overlooking the additional uncertainty stemming from the highly stochastic nature of the propagation process. To address this limitation, we propose a Propagation Dynamics aware framework for Source Localization (PDSL), a novel method that integrates a deep generative model with propagation dynamics to approximate the source distribution and explicitly mitigate uncertainty arising from diffusion stochasticity. Moreover, we employ Graph Neural Ordinary Differential Equations to model the continuous dynamics of diffusion processes without relying on a predefined diffusion mechanism. Additionally, a matching mechanism is designed to extract relevant data blocks that enhance source generation reliability. Comprehensive experiments on both synthetic and real-world diffusion datasets demonstrate the superior performance of the proposed framework across diverse application scenarios.
The paper introduces forward-backward Green cosine geometry to improve directed community detection and overlap expansion using centered…
abstractclick to expand
Community detection in directed graphs is challenging because edge asymmetry induces non-reversible diffusion, direction-dependent accessibility, and distinct source and target roles. This paper develops a Green-based cosine geometry for directed community detection and for expanding a disjoint partition into an overlapping cover. The key observation is that hitting-time information is natural for directed graphs, but raw hitting-time vectors are not well suited for cosine comparison: they contain a source-independent stationary baseline, whereas cosine similarity is not translation-invariant. We therefore replace raw hitting-time profiles by centered Green profiles of the directed random walk and use the diffusive part of the truncated Green profile, excluding the time-zero self-spike. To account for asymmetry, we concatenate the Green profile of the original walk with the corresponding profile on the edge-reversed graph, yielding forward--backward Green coordinates.
The framework gives two algorithms. Di-Green-FB-cosine-KMeans clusters vertices in the Green cosine space to obtain a disjoint directed partition. Di-Green-FB-Cosine Overlap expands an initial partition into an overlapping cover using a community-adaptive cosine rule. The initial partition can be supplied by any disjoint method; in the main pipeline it is produced by Di-Green-FB-cosine-KMeans. Experiments on synthetic directed benchmarks show that the proposed geometry improves over raw hitting-time cosine variants and is competitive with directed spectral and flow-based baselines. Real-network experiments, evaluated by directed modularity as an internal quality measure, indicate that the same geometry produces coherent directed partitions. Synthetic overlap experiments further show that the method recovers additional memberships effectively, especially in moderately and weakly separated directed networks.
The overreaches of mainstream social media platforms have been extensively reported and studied. For activist communities, these platforms pose risks of surveillance, censorship, or erasure. Decentralized social networks (DSNs) serve as alternative online spaces that appear to prioritize values such as user privacy, free speech, and community control. However, the decentralized ecosystem is vast and complex, making it difficult for communities to understand how to best use these platforms for their organizing aims. We aim to fill this gap by proposing a conceptual framework for navigating the DSN landscape that defines core activist community needs -- minimal overhead, community building and reach, on- and off-line safety, and operational sustainability -- and links them to concrete platform affordances such as resource efficiency, interoperability, and data ownership. We apply the framework to (1) evaluate and compare the sociotechnical tradeoffs of two contemporary DSNs (Mastodon and Bluesky), (2) understand broader community configurations that emerge across different DSN infrastructures and their implications for collective action, and (3) explore how two distinct activist communities facing infrastructural and political constraints might use the framework to find platforms that align with their needs. We conclude by reflecting on the theoretical promises of DSNs and the structural conditions that shape and constrain participation across them.
Hypergraph partitioning is a fundamental optimization problem with applications in data management and other domains involving higher-order relations. In this paper, we study balanced hypergraph partitioning from the perspective of quantum optimization. We formalize balanced $k$-way hypergraph partitioning with general hyperedge cut functions, and derive corresponding binary optimization formulations targeted at quantum optimization methods in both the two-way and multi-way settings. Our discussion highlights which cut functions admit Quadratic Unconstrained Binary Optimization (QUBO) encodings and which instead lead to higher-order binary objectives or rational forms. As a preliminary empirical validation, we focus on balanced two-way partitioning with the all-or-nothing cut on 3-uniform hypergraphs, where a direct QUBO is available, and evaluate simulated Quantum Approximate Optimization Algorithm (QAOA) and Simulated Annealing (SA) on small instances against exact solutions. The results show that the formulation is effective on small hypergraphs and that the balance-penalty weight plays a critical role in trading off cut quality and balance.
Models for cross-sectional network data have become increasingly well-developed in recent decades, and are widely used. This has led to a growing interest in the connection between such cross-sectional models and the behavioral processes from which the corresponding networks were presumably generated. Here, we build on prior work in this area to present a behavioral micro-foundation for cross-sectional network models, based on a continuous time stochastic choice mechanism, that can accommodate highly general classes of cases (including agents who are not themselves in the network, and multilateral edge control). As we show, the equilibrium behavior of this process under appropriate conditions can be expressed in exponential family form, allowing estimation of individual preferences using existing methods; the graph potential separates naturally into a preference-based term reflecting agent utilities, and an entropic term reflecting the rules of tie formation. We illustrate our approach via an analysis of friendship in a professional organization, and modeling of phase transitions in the structure of small groups.
Accurate prediction of physician referral links is essential for optimizing care coordination and reducing fragmentation in healthcare delivery. However, existing computational methods, ranging from triadic closure heuristics to graph neural networks, fail to capture the intrinsic properties of physician referral networks, including sparsity, disassortative degree mixing, and hub-dominated topology. Here, we propose H3, a healthcare three-hop index that addresses these limitations by modeling indirect referral pathways through intermediate physicians, with degree-based normalization and a redundancy penalty to mitigate hub-mediated noise. Using Medicare Physician Shared Patient Patterns data, we evaluate H3 under two complementary prediction regimes: within-period prediction, which assesses recovery of contemporaneous referral links under sparse conditions, and cross-period prediction, which tests robustness to temporal shift as referral windows expand. Across both regimes, H3 consistently outperforms classical heuristics and deep learning-based baselines. Unlike black-box neural network approaches, H3 produces fully decomposable predictions traceable to specific intermediary physicians, offering a transparent and deployable solution for referral network completion.
Human societies continuously transform scattered information into collective judgments and coordinated action, whether through markets discovering prices, governments allocating resources, communities enforcing norms, or science converging on reliable claims. Importantly, the computational difficulty of collective decision-making, particularly the time and communication required to reach solutions, imposes fundamental constraints on social organization. While theoretical computer science offers formal tools for analyzing such problems, for instance, by analyzing resource requirements, including time and memory, surprisingly, there is no domain of social science that focuses on the nature of computation in the human world. This perspective argues that we now have the opportunity to deploy these computational frameworks to study human social organization, opening research directions at the intersection of computer science and social science. We highlight core social phenomena that can be framed as computational, including (i) distributed consensus and coordinated action, (ii) societal restructuring with scale, (iii) hierarchical and modular structure, and (iv) externalized memory systems. We identify several concepts from theoretical computer science that may provide insight into these phenomena, especially emphasizing more recently developed approaches beyond the paradigm of Turing~Machines and worst-case computational complexity.
This short paper explores trends in extremist Facebook data from July 2023 to June 2024. We examined engagement, sentiment, and topics within Facebook groups categorized as anti-Israel/Semitic, anti-Palestine/Muslim, and anti-both, mapping these trends against five major events related to the recent Israel-Hamas conflict. Our findings support the hypothesis that shifts in trends correspond with these key events, showing varying patterns across different group categories. We observed decreased activity proportion in anti-both groups and increased activity proportion in the two one-sided hate groups at the conflict's onset. This pattern reversed after the Israeli troop withdrawal from Khan Yunis, Gaza. During the conflict, negative content proportion surged, and neutral content proportion fell in all the three group categories. Anti-Palestine/Muslim groups' discourses shifted from religious to social media activism and political/protest around the time the war began, while anti-Israel/Semitic groups moved from political/protest to religious topics a couple of weeks before the war.
Political news on social media rarely circulates in isolation: audiences actively engage, react, and clash. Whether these interactions reflect agreement or conflict may depend on the ideological discrepancy between publishers and the news content they share. This study investigates this relationship using Facebook posts linking to political news during a Brazilian presidential election. We analyze five dimensions of engagement: ideological discrepancy between publishers and content, emotional responses, audience consensus, toxicity in posts, and content topics. Our results show that ideological discrepancy is associated with differences in engagement, exhibiting a nonlinear pattern: consensus declines under conditions of very high ideological mismatch and, in our data, also under very high alignment, while toxicity increases primarily under extreme mismatch. A statistical model indicates that emotional valence, toxicity, and ideological discrepancy are the factors most strongly associated with consensus. Among highly partisan publishers, higher toxicity is associated with increased audience consensus, suggesting that hostile discourse may co-occur with in-group agreement in strongly ideological contexts. Overall, these findings highlight how ideological discrepancy, emotional reactions, and interaction dynamics are associated with consensus and polarization in online political engagement.
We place geo-targeted advertisements on Facebook to encourage users to fill out an online survey, following a process known as river sampling. We discovered a large number and variety of users also came to our survey through snowball sampling, including shared social media posts and other word-of-mouth referral methods. In this article, we analyze the differences between the respondents from river and snowball sampling. We present evidence that the respondents obtained by snowball sampling are more likely to complete the survey and contain a higher fraction of new users and women than those obtained by river sampling. Additionally, the evidence indicates that users from snowball sampling give shorter responses and take less time on the survey than users from river sampling. We hope these findings provide insight for other researchers who incorporate social media strategies when fielding surveys.
While Graph Foundation Models (GFMs) have achieved remarkable success in homogeneous graphs, extending them to multi-domain heterogeneous graphs (MDHGs) remains a formidable challenge due to cross-type feature shifts and intra-domain relation gaps. Existing global feature alignment methods (PCA or SVD) enforce a shared feature space blindly, which distorts type-specific semantics and disrupts original topologies, inevitably leading to "Type Collapse" and "Relation Confusion". To address these fundamental limitations, we propose Decoupled relation Subspace Alignment (DRSA), a novel, plug-and-play relation-driven alignment framework. DRSA fundamentally shifts the paradigm by decoupling feature semantics from relation structures. Specifically, it introduces a dual-relation subspace projection mechanism to coordinate cross-type interactions within a shared low-rank relation subspace explicitly. Furthermore, a feature-structure decoupled representation is designed to decompose aligned features into a semantic projection component and a structural residual term, adaptively absorbing intra-domain variations. Optimized via a stable alternating minimization strategy based on Block Coordinate Descent, DRSA constructs a well-calibrated, structure-aware latent space. Extensive experiments on multiple real-world benchmark datasets demonstrate that DRSA can be seamlessly integrated as a universal preprocessing module, significantly and consistently enhancing the cross-domain and few-shot knowledge transfer capabilities of state-of-the-art GFMs. The code is available at: https://github.com/zhengziyu77/DSRA.
The rapid diversification of social media platforms and the increasing restrictions on official APIs have significantly complicated cross-platform analysis. Researchers are often forced to rely on heterogeneous datasets obtained through web scraping and historical archives; however they often lack structural consistency. Prior to conducting cross-platform social media analyses, one needs to answer three critical questions: (1) What makes platforms different and similar? (2) How were the datasets collected? (3) How can we align the datasets of different platforms to conduct fair analyses? To address these questions, we introduce the Social Media Data Toolkit (\projectname{}), a comprehensive Python framework designed for the standardization, anonymization, and enrichment of social network datasets. \projectname{} unifies diverse data structures into a generic schema comprising Communities, Accounts, Posts, Actions, and Entities to facilitate multi-platform research. The framework features a configurable anonymization module to secure Personally Identifiable Information (PII) and an extendable enrichment layer that integrates Large Language Models (LLMs) and network analysis tools for downstream tasks such as stance detection and toxicity scoring without creating codebase for different datasets. We demonstrate the versatility of \projectname{} through four case studies spanning from textual analysis of the content to network analysis across platforms. To offer reproducible social media research, \projectname{} is released as an open-source tool featuring detailed documentation and practical guides for researchers at any skill-level. It can be accessed at github.com/ViralLab/SMDT and varollab.com/SMDT.
News consumption behavior is shaped by the coupling between temporal dynamics and content selection. This study proposes a multi-scale temporal-content framework and validates it on two large real-world news datasets, MIND and Adressa. Results reveal hierarchical temporal patterns. At the macroscale, Fourier modeling identifies clear circadian rhythms; at the mesoscale, session intervals follow a power-law distribution with $\alpha \approx 1$; and at the microscale, within-session action counts and inter-action intervals follow exponential distributions with $\lambda \approx 0.3$ and $\lambda \approx 0.02$, respectively. Content analysis shows that clicks are mainly driven by historical interests, while this dependence weakens as content diversity increases. Temporal-content coupling further indicates that users' historical interests dominate active time periods in shaping behavior. Preference groups also differ: timeliness and entertainment-oriented users click more frequently and rely more on historical interests, whereas diversified users click less and are more sensitive to content diversity.
Recommendation algorithms have become the dominant mechanism for information distribution on digital platforms, profoundly shaping personalized information consumption environments. However, gender bias, as a significant form of algorithmic discrimination, may cause users to experience unequal exposure within different political information environments. Taking YouTube as a case, we conduct a controlled social-bot field experiment, where male-coded and female-coded profiles are constructed. We track the exposure and click patterns of these bots to analyze their recommendation trajectories. We analyze the distribution of recommended content from two dimensions: allocative bias and structural bias. First, we find statistically significant differences in allocative bias across male-coded and female-coded profiles, particularly in terms of issue distribution, ideological orientation, and political entities. Secondly, we observe structural bias in the political information environments, characterized by distinct clustering patterns. Additionally, time-series analysis shows that exposure pathways continue to be shaped over time by both communities detected in the co-occurrence network and individual profile-level dynamics. Finally, we construct a simple collaborative-filtering model that reproduces the observed gender bias. We argue that gender bias in recommendation systems is reflected not only in the allocation of political content, but also in how community structures shape these environments, reinforcing societal inequalities and highlighting the need for algorithmic fairness.
The rapid proliferation of harmful and emotionally damaging content on social media platforms has intensified concerns regarding societal harm. While content moderation efforts primarily focus on detecting and removing harmful posts, less attention has been given to mitigating harm through stylistic text transformation while preserving semantic meaning. In this paper, we propose a writing-assistance framework that can reduce societal harm by transforming aggressive, toxic, or emotionally harmful comments into softer, more neutral stylistic forms inspired by Notepad AI, a simple AI writing assistant. Rather than censoring or suppressing speech, we apply controlled stylistic modifications to preserve core informational content while reducing emotional intensity and identity-based attacks. We introduce an Emotion Drift Index (EDI) metric to systematically quantify emotional change and evaluate the effectiveness of stylistic rewriting, thereby reducing harmful interactions in online environments.
Fostering coordinated pro-environmental behaviors at scale is a key challenge for climate mitigation. Individual actions only generate meaningful impact when they diffuse widely and become socially coordinated, yet monitoring such processes remains difficult with traditional survey-based tools alone.
In this study, we examine whether large-scale online climate discourse is associated with differences in offline pro-environmental behavior across European regions. We combine geolocated Twitter data from the Climate Change Twitter Dataset (2017-2019) with survey-based measures from the 2019 Special Eurobarometer, focusing on the regional density of climate-related tweets and the average number of self-reported pro-environmental actions.
We find a strong positive association between tweet density and pro-environmental behavior that remains robust to socio-economic controls, alternative spatial aggregations, and a wide range of robustness checks. To move beyond aggregate volume, we further decompose online discourse using Natural Language Processing tools that capture distinct social dimensions. While knowledge exchange shows no clear relationship with offline behavior, the prevalence of activism- and social support-related expressions is negatively associated with pro-environmental actions.
Overall, our results suggest that online climate discourse can serve as an informative, attention-related signal of regional differences in pro-environmental behavior, but that different forms of online engagement relate to offline action in markedly different ways. More broadly, the study highlights the potential of integrating large-scale digital traces with survey data to investigate collective behavior in socio-environmental systems, while remaining explicitly observational in scope.
The theory of planned behavior (TPB) is one of the most influential frameworks in social psychology, stating that a person's behavior is driven by intention, which is primarily shaped by attitude, subjective norms, and perceived behavioral control. Despite its strong empirical support, TPB remains a static conceptual framework without explicit mathematical formulations that capture the temporal evolution of its components. To address this gap, we develop a dynamic agent-based modeling framework that integrates the core principles of TPB with a behavior-to-attitude feedback mechanism. Specifically, we define behaviors based on their feedback effects on attitude and examine when the population undergoes collective transitions by either adopting a beneficial behavior or rejecting a harmful one. Results from our model demonstrate that collective transitions can be effectively controlled by adjusting two key behavioral parameters that reflect agents' attitude influence and decision rationality. These findings provide quantitative insights on TPB, highlighting the key factors that drive collective behavioral transitions and the need for further socio-psychological case studies.
This paper studies how online discussion shapes and assesses political violence across different settings, particularly how moral evaluation, as a social perception, varies across institutional contexts. We take France and the United States as case studies, both democracies, and three incidents of political violence: the 2020 killing of Samuel Paty in France, the 2025 shooting of Charlie Kirk in the United States, and the 2026 murder of Quentin Deranque in France. Using publicly available posts on Instagram and Facebook, we use GPT-4o-mini for zero-shot classification and social network analysis. Our research demonstrates clear cross-national differences in how moral values are perceived, the emotional intensity expressed, the framing of institutions, and the structure of semantic networks. In France, the discourse tends to focus on the victim's civic role rather than their political affiliation, whilst in the U.S., the conversation is more ideologically divided, with moral judgments frequently reflecting partisan lines. By comparing the two French cases -- a civic victim (Paty) versus the politically-affiliated victim (Deranque) -- we find evidence consistent with the \textit{civic floor hypothesis}, which demonstrates France's institutional framework upholds a cross-partisan civic baseline regardless of the victim's political ties. We conclude by analyzing the implications of computational social perception for multilingual NLP and by exploring moral judgment in cross-national digital political discourse.
Generating realistic synthetic citation, patent, or component dependency networks is essential for benchmarking community detection, graph visualisation, and network data mining algorithms. We present the first systematic comparison of generators of directed graphs that are nearly acyclic and have a ground-truth community structure. We evaluate 12 methods across 7 real citation networks and 26 metrics. We propose the practice of reversing directions of edges in static generators to break cycles and induce a citation-like flow, which significantly improves the performance of a degree-corrected Stochastic Block Model. Our novel methodological approach to evaluating community detection benchmarks distinguishes between endogenous and exogenous mesoscopic similarities, with the latter proving more important. This distinction reveals that high-parameter models suffer from overfitting by memorising planted community statistics which lead to their failing to produce realistic networks. Finally, we introduce the Citation Seeder (CS) algorithm, an iterative generator grounded in the Price-Pareto model of citation networks, with interpretable parameters and O(N+E) runtime. CS achieves competitive results against the best-performing baselines while using up to four orders of magnitude fewer parameters and providing a clean framework for explaining and predicting a network's future growth.
Quantitative analysis of the kinematic chain in sports motion is essential for performance evaluation and injury prevention. Conventional methods such as the kinematic-sequence (KS) and continuous relative phase (CRP) are confined to adjacent joint pairs and lack a unified framework for whole-body coordination, while segmental power-flow analysis requires force plates and inertial parameters that restrict it to laboratory environments. We apply Complex Hilbert Principal Component Analysis (CHPCA) separately to each motion phase (backswing and downswing) on markerless 3D pose estimation data, extracting the dominant whole-body phase pattern as a single complex eigenvector. The pipeline further includes a fully automatic signal-based phase segmentation (no priors on strike count or rest location) and an extension to 1,079 body-surface mesh vertices, so that the kinematic chain is represented as a continuous phase field across the body. On 14 hammer-striking trials of a single subject, the framework reveals (i) a trunk-anchored global phase architecture, (ii) a functional asymmetry between preparation and execution phases quantified by Mode-1 contribution (45.5% vs. 70.5%) and inter-trial Spearman consistency (0.38 vs. 0.58), and (iii) a consistent reorganisation across both skeletal joints and mesh vertices ($p < 10^{-10}$ on 1,079 vertices). As a methodological consistency check, pairwise phase differences from the Mode-1 eigenvector are compared against CRP on all 190 joint pairs by a permutation test ($\rho = 0.473$, $p = 0.0005$). A correspondence analysis between Mode-1 amplitude and kinetic-energy mobilisation variance further shows a strong positive correlation in the downswing ($\rho \approx 0.71$ on both skeleton and mesh) and no correlation in the backswing, indicating that the proposed framework bridges kinematic and kinetic descriptions of coordination through phase structure.
Climate misinformation continues to erode support for climate action, a challenge that is especially acute in the Global South, where high climate vulnerability intersects with development pressures. In rapidly evolving digital ecosystems, misinformation adapts to platform incentives, shifting from overt rejection of climate science toward more subtle narratives that contest proposed solutions. This study integrates large-scale platform data with qualitative content analysis to examine how information systems shape contemporary climate discourse. Using a dataset of 226,775 climate-related YouTube videos from Brazil (2019-2025), we identify two dominant misinformation strategies: traditional denial that disputes scientific evidence and an emerging "new denial" that accepts climate change while undermining mitigation and adaptation policies. We find a pronounced transition to solution-focused narratives that target renewable energy, climate governance, and environmental advocates. New denial content is produced by a wider array of actors, attracts higher engagement, and employs more sophisticated persuasive techniques. These patterns disproportionately affect regions already facing structural inequities and bring broader concerns about platform accountability in unequal information environments and suggest the need for governance approaches capable of addressing new denial, a rapidly adapting form of harmful content that often evades existing moderation policies.
Make America Healthy Again (MAHA) is a health-related campaign slogan proposed by Robert F. Kennedy Jr. and later incorporated into the political coalition of President Trump. While #MAHA quickly circulated beyond the campaign itself and became a prominent hashtag for public discussion, it remains unclear whether this public discourse reflected, reshaped, or diverged from the stated agenda of the MAHA campaign. This study presents a large-scale, cross-platform analysis of early #MAHA public discourse between September 2024 and January 2025, using the framework of Agenda-Melding Theory. Drawing on 41,819 #MAHA-related posts, this study combines structural topic modeling, interrupted time-series analysis, and AI-assisted data annotation to examine the thematic structure and temporal dynamics. The most prominent finding is the substantial disconnect between #MAHA public discourse and the stated MAHA agenda: 81.3% of posts did not engage any of the five campaign priorities of the MAHA campaign. There were also pronounced cross-platform differences, with online platforms clustering into three broad discourse environments: (a) grassroots partisan-support spaces, (b) informational sources, and (c) health-focused spaces. #MAHA functioned less as a unified campaign agenda than as a symbolic frame interpreted differently across platforms. More broadly, this study provides useful empirical insight into how campaign slogans are reinterpreted and how public agendas are formed, amplified, and transformed in the fragmented digital environments.
In recent years, e-commerce platforms have become one of the most prominent examples of large-scale interaction networks, where understanding influence dynamics among users, products, and digital entities is essential for applications such as online marketing, recommendation systems, and customer behavior analysis. A key challenge in these platforms is that interactions are often uncertain, noisy, and inferred from implicit signals rather than explicitly defined relationships. This uncertainty cannot be effectively captured using deterministic network models...
17 pre-registered tests across biology and systems confirm the ordering; an independent C. elegans check yields r = 0.777.
abstractclick to expand
Hub importance scores in multilayer networks persist more strongly between functionally similar layers than dissimilar ones. We call this the Functional Proximity Law and test it across 17 pre-registered experiments: 12 canonical domains (9 confirmed, 3 denied; molecular biology, neuroscience, computer systems, ecology, linguistics) plus 5 external validations on independently-authored datasets. Eight canonical domains reach p < 0.05 individually; the directional inequality holds in all 9 confirmed. Three DENIED domains reveal named structural boundary conditions that narrow the law's scope. A fully external validation on the C. elegans connectome -- where both data and layer definitions are independent of the authors -- yields r = 0.777 (p = 0.004). Binomial probability of 14/17 pre-registered confirmations by chance: p ~ 0.006. The law is falsifiable, makes testable directional predictions, and identifies the structural conditions under which it fails.
The concept of homophily is pervasive in online social media. While many empirical studies have relied on external sociodemographic traits to investigate it, significantly less is known about homophily at the cognitive level, that is, at the level of shared opinions or values. For such "value homophily", in this paper we study interval-based patterns of opinion homophily from a bounded confidence perspective. We consider three heterogeneous datasets from Reddit and Twitter covering polarizing issues, with user opinions quantified via sentiment analysis and fact-checking, and analyze the interaction networks formed by weaker (reply-based) and stronger (follow-based) social ties. Our findings show that users' interaction neighborhoods are significantly more concentrated in opinion space than expected by chance, with tie strength and issue polarization further amplifying this effect. Moreover, users often exhibit asymmetric tolerance ranges, with asymmetry typically directed toward locally mainstream positions rather than more radical or opposing ones. These findings support a bounded confidence interpretation of online value homophily.
Large language models (LLMs) frequently produce \emph{detail hallucinations} when processing long regulatory documents, including subtle errors in threshold values, units, scopes, obligation levels, and conditions that preserve surface plausibility while corrupting safety-critical parameters. We formalize this phenomenon through a fine-grained \emph{Detail Error Taxonomy} of five error types and introduce \textbf{DetailBench}, a benchmark built from 172 real regulatory documents and 150 synthetic documents spanning three jurisdictions, with human-annotated detail-level ground truth comprising 13,000 preference pairs. We propose \textbf{DetailDPO}, a targeted preference optimization framework that constructs contrastive pairs differing in exactly one detail dimension, concentrating DPO gradient signal on detail-bearing~tokens. We provide theoretical analysis showing why \emph{minimal detail perturbation} pairs yield gradient concentration under mild assumptions. Experiments on the Qwen2.5 family (7B, 14B, 72B) and Llama-3.1-8B across three context-length tiers (8K--64K tokens) show that DetailDPO reduces the Detail Error Rate by 42--61\% relative to baselines, with consistent gains across all five error types and cross-domain transfer to financial and medical documents.
Many processes related to status, power, and influence within social networks have been modeled using forced linear diffusion models; examples include the highly successful Friedkin-Johnsen model of social influence, the status/power scores of Katz and Bonacich, and the widely used network autocorrelation model. While a basic assumption of such models is that the impact of one individual on another through any given path falls exponentially with path length, the total impact of the first individual on the second involves contributions from walks of all lengths; thus, while total impact is expected to decline with network distance, the relationship is not trivial. Here, we provide an approximate solution for the total impact of one node on another as a function of network distance, showing that the total impact is given to first order by a product of eigenvector centrality scores together with an expression in terms of the graph spectrum (eigenvalues of the adjacency matrix) that falls exponentially with distance. We also show how this solution can be refined using higher-order eigenvectors of the adjacency matrix. A numerical study on interpersonal networks drawn from educational settings verifies an average exponential decline in impact strength under the linear diffusion model, and shows that the first-order eigenvector approximation can often be a good proxy for total impact as obtained from the exact solution. This suggests a simple model that can be used to approximate total impact for social influence or status processes in a range of settings.
Psychiatric disorders have been traditionally conceptualized as latent conditions producing observable symptoms, but recent studies suggest that psychopathology may emerge from symptoms interactions. Psychometric networking model these relations focusing on pairwise associations but overlooks higher-order dependencies arising among groups of variables. These dependencies may reflect synergistic mechanisms, where joint symptom configurations convey more information than pairwise relations, or redundancy, where information overlaps. We introduce an information-theoretic multiplex hypergraph framework to identify and compare higher-order interactions in eating disorders data, across diagnostic groups (e.g., anorexia nervosa). Higher-order structures are quantified using $\Omega$-information, a measure that captures the balance between redundancy and synergy. To address the combinatorial growth of candidate subsets, multiple testing and estimation instability, we propose a structured pipeline comprising: (i) targeted candidate selection based on dyadic network topology and theory-driven subscale information; (ii) a three-stage inferential procedure combining null-model testing with bootstrap robustness assessment; and (iii) the construction and analysis of diagnosis-layered, synergistic and redundant multiplex hypergraphs. Results highlight how synergy captures the emergent, higher-order organization of diagnoses, revealing both a stable transdiagnostic core and diagnosis-specific ways in which these domains combine. By contrast, redundancy is confined to eating and body-image related content, marking reinforcement rather than broader symptom integration.
Algorithmic systems increasingly function as epistemic infrastructures that govern the conditions of interpretative access and social belief. Yet, mainstream auditing strategies operationalize fairness primarily in predictive terms - error rates, calibration, or group-level parity - leaving epistemic harms under-theorized and under-measured. We propose a quantitative framework for evaluating forms of epistemic injustice in algorithmic environments. First, we introduce a deficit-based template that models epistemic injustices as gaps between ideal and realized conditions across features such as credibility, uptake, and epistemic agency. We map these deficits to concrete stages of algorithmic mediation, showing how epistemic injustice can persist even when standard fairness constraints are satisfied. Drawing on distributive fairness indices, we distinguish two evaluation stances: resource inequality, where indices are applied to distributions of epistemic goods directly, and capability/rights inequity, where indices are applied to output-induced epistemic opportunity. We provide an epistemic translation of canonical indices, illustrating how they diagnose complementary signatures of unfairness - such as exclusionary tails and hierarchical concentration - and support longitudinal auditing under iterative deployment. We also provide a simulation study of a recommender-mediated opinion dynamics setting, showing how the proposed indices capture the evolution of epistemic unfairness under repeated platform interventions. The result is a measurement framework that makes the epistemic dimension of algorithmic harms explicit for system design and evaluation.
We present MediaGraph, a network-theoretic framework for analyzing reporting preferences in news media through entity co-occurrence networks. Using articles from four Indian news-sources, two mainstream (The Times of India and The Indian Express) and two fringe outlets (dna and firstpost), we construct source-specific co-occurrence networks around the 2020-21 and 2024 Farmers Protests. We analyze these networks along three network theoretic axes of centrality, community structure, and co-occurrence link predictability. The link predictability metric is a novel metric proposed that quantifies the consistency of entity associations over time using a GraphSAGE-based model. Our results reveal significant differences in reporting preferences across sources for the same event, and a consistent under-representation of farmer leaders across sources. By shifting the focus from textual signals to relational structures, our approach offers a scalable, label-independent perspective on media analysis and introduces link predictability as a complementary measure of reporting behavior.
When users strongly prefer similar opinions, a bit of structural similarity keeps networks connected and opinions moderate.
abstractclick to expand
Recommendation algorithms, used in online social networks, shape interactions between users. In particular, link-recommendation algorithms suggest new connections and affect how individuals interact and exchange information. These algorithms' efficacy relies on key mechanisms governing the creation of social ties, such as triadic closure and homophily. The first is achieved through structural similarity and represents a heightened chance of recommending users to one another given mutual friends; the second is related to opinion similarity and conveys an increased chance of recommending a connection given similar individual characteristics. These two mechanisms jointly shape the evolution of social networks and behaviors unfolding over them. Their combined effect on the co-evolution of opinion and structure dynamics remains, however, poorly understood. Here, we study how social networks and opinions co-evolve given the joint effect of rewiring based on opinion and structural similarity. We show that both similarity metrics lead to polarized states, but differ in how they impact network fragmentation and opinion diversity. While strongly relying on opinion similarity leads to a higher variation of opinion, rewiring via network similarity leads to a larger number of (dis)connected components, resulting in fragmented networks that lean towards one of the signed opinions. Under strong homophilic settings, introducing a weak dependence on structural similarity prevents network fragmentation and favors moderate opinions. This work can inform the design of new recommender algorithms that explicitly account for interacting social and recommendation mechanisms, with the potential to foster moderate opinion coexistence even in inherently polarizing settings.
Continuous proximity scores from node-hyperedge resource flows lift performance on link prediction, node ranking, and community detection.
abstractclick to expand
Hypergraphs serve as an effective tool widely adopted to characterize higher-order interactions in complex systems. The most intuitive and commonly used mathematical instrument for representing a hypergraph is the incidence matrix, in which each entry is binary, indicating whether the corresponding node belongs to the corresponding hyperedge. Although the incidence matrix has become a foundational tool for hypergraph analysis and mining, we argue that its binary nature is insufficient to accurately capture the complexity of node-hyperedge relationships arising from the fact that different hyperedges can contain vastly different numbers of nodes. Accordingly, based on the resource allocation process on hypergraphs, we propose a continuous-valued matrix to quantify the proximity between nodes and hyperedges. To verify the effectiveness of the proposed proximity matrix, we investigate three important tasks in hypergraph mining: link prediction, vital nodes identification, and community detection. Experimental results on numerous real-world hypergraphs show that simply designed algorithms centered on the proximity matrix significantly outperform benchmark algorithms across these three tasks.
A hypergraph is called uniform when every hyperedge contains the same number of vertices, otherwise, it is called non-uniform. In the real world, many systems give rise to non-uniform hypergraphs, such as email networks and co-authorship networks. A uniform hypergraph has a natural one-to-one correspondence with its adjacency tensor. In 2019, Benson proposed the eigenvector centrality of uniform hypergraphs via its adjacency tensor. In this paper, we define an adjacency tensor for hypergraphs and propose the eigenvector centrality for hypergraphs. When the hypergraph is uniform, our proposed eigenvector centrality reduces to Benson's. When each edge of the uniform hypergraph contains exactly two vertices, our proposed centrality reduces to the eigenvector centrality of graphs. We conducted experiments on several real-world hypergraph datasets. The results show that, compared to traditional centrality measures, the proposed centrality measure provides a unique perspective for identifying important vertices and can also effectively identify them.
Network community detection is usually considered as an unsupervised learning problem. Given a network, the aim is to partition it using some general purpose algorithm. In this paper we instead treat community detection as a hypothesis testing problem. Given a network, we examine the evidence for specific community structure in the observed network compared to a null model. To do this we define an appropriate test statistic, analogous to a z-score, and several null models derived from maximising entropy under different constraints in the canonical ensemble. We demonstrate the application of this method on real and synthetic data and contrast our method to Bayesian approaches based on the stochastic block model. We demonstrate that this method gives definitive answers to concrete questions, which can be more useful to analysts than the output of a generic algorithm.
During major political events, social media platforms encounter increased systemic risks. However, it is still unclear if and how they adjust their moderation practices in response. The Digital Services Act Transparency Database provides-for the first time-an opportunity to systematically examine content moderation at scale, allowing researchers and policymakers to evaluate platforms' compliance and effectiveness, especially at high-stakes times. Here we analyze 1.58 billion self-reported moderation actions by the eight largest social media platforms in Europe over an eight-month period surrounding the 2024 European Parliament elections. We found that platforms did not exhibit meaningful signs of adaptation in moderation strategies as their self-reported enforcement patterns did not change significantly around the elections. This raises questions about whether platforms made any concrete adjustments, or whether the structure of the database may have masked them. On top of that, we reveal that initial concerns regarding platforms' transparency and accountability still persist one year after the launch of the Transparency Database. Our findings highlight the limits of current self-regulatory approaches and point to the need for stronger enforcement and better data access mechanisms to ensure that online platforms meet their responsibilities in protecting the democratic processes.
Social learning networks (SLNs) are graphical representations that capture student interactions within educational settings (e.g., a classroom), with nodes representing students and edges denoting interactions. Accurately predicting future interactions in these networks (i.e., link prediction) is crucial for enabling effective collaborative learning, supporting timely instructional interventions, and informing the design of effective group-based learning activities. However, traditional link prediction approaches are typically tuned to general online social networks (OSNs), often overlooking the complex, non-Euclidean, and dynamically evolving structure of SLNs, thus limiting their effectiveness in educational settings. In this work, we propose a graph neural network (GNN) framework that jointly considers the temporal evolution within classrooms and spatial aggregation across classrooms to perform link prediction in SLNs. Specifically, we analyze link prediction performance of GNNs over the SLNs of four distinct classrooms across their (i) temporal evolutions (varying time instances), (ii) spatial aggregations (joint SLN analysis), and (iii) varying spatial aggregations at varying temporal evolutions throughout the course. Our results indicate statistically significant performance improvements in the prediction of future links as the courses progress temporally. Aggregating SLNs from multiple classrooms generally enhances model performance as well, especially in sparser datasets. Moreover, we find that jointly leveraging both the temporal evolution and spatial aggregation of SLNs significantly outperforms conventional baseline approaches that analyze classrooms in isolation. Our findings demonstrate the efficacy of educationally meaningful link predictions, with direct implications for early-course decision-making and scalable learning analytics in and across classroom settings.
The optimal number to explore together rises with their potential but does not depend on how likely each is to sell individually.
abstractclick to expand
We study online learning for new products on a platform that makes capacity-constrained assortment decisions on which products to offer. For a newly listed product, its quality is initially unknown, and quality information propagates through social learning: when a customer purchases a new product and leaves a review, its quality is revealed to both the platform and future customers. Since reviews require purchases, the platform must feature new products in the assortment ("explore") to generate reviews to learn about new products. Such exploration is costly because customer demand for new products is lower than for incumbent products. We characterize the optimal assortments for exploration to minimize regret, addressing two questions. (1) Should the platform offer a new product alone or alongside incumbent products? The former maximizes the purchase probability of the new product but yields lower short-term revenue. Despite the lower purchase probability, we show it is always optimal to pair the new product with the top incumbent products. (2) With multiple new products, should the platform explore them simultaneously or one at a time? We show that the optimal number of new products to explore simultaneously has a simple threshold structure: it increases with the "potential" of the new products and, surprisingly, does not depend on their individual purchase probabilities. We also show that two canonical bandit algorithms, UCB and Thompson Sampling, both fail in this setting for opposite reasons: UCB over-explores while Thompson Sampling under-explores. Our results provide structural insights on how platforms should learn about new products through assortment decisions.
This paper examines Web3 ecosystems not merely as markets for digital assets, but as networked social spaces where economic transactions give rise to enduring social ties, shared narratives, and collective identities. Leveraging large-scale data mining of fused on-chain blockchain transactions and off-chain social media activity, we analyze over one hundred NFT collections to uncover how different forms of participation structure community formation in decentralized environments. Using network analysis, we identify distinct ecosystem roles, such as long-term holders, active traders, and short-term speculators, and demonstrate how each produces markedly different network topologies, levels of cohesion, and pathways for influence. We complement this structural analysis with discourse analysis of social media engagement, revealing how narrative production, visibility, and sustained interaction persist even as transactional activity declines. Our findings show that communities centered on holding behavior evolve from transactional networks into socially embedded ecosystems characterized by dense ties, decentralized influence, and ongoing cultural participation, while trader- and speculator-dominated networks remain fragmented and transactional. By linking network structure with discursive dynamics, this study provides a sociotechnical framework for understanding how value, identity, and inequality are negotiated in Web3 spaces. The approach offers a scalable method for detecting patterns of inclusion, exclusion, and representational imbalance, advancing network-based research on digital communities beyond purely economic or technical accounts.
Community structure is prevalent in real-world networks, with empirical studies revealing heterogeneous distributions where a few dominant majority communities coexist with many smaller groups. These small-scale groups, which we term minority communities, are critical for understanding network organization but pose significant challenges for detection. Here, we investigate the detectability of minority communities from a theoretical perspective using the Stochastic Block Model. We identify three distinct phases of community detection: the detectable phase, where overall community structure is recoverable but minority communities are merged into majority groups; the distinguishable phase, where minority communities form a coherent group separate from the majority but remain unresolved internally; and the resolvable phase, where each minority community is fully distinguishable. These phases correspond to phase transitions at the Kesten-Stigum threshold and two additional thresholds determined by the eigenvalue structure of the signal matrix, which we derive explicitly. Furthermore, we demonstrate that spectral clustering with the Bethe Hessian exhibits significantly weaker detection performance for minority communities compared to belief propagation, revealing a specific limitation of spectral methods in identifying fine-grained community structure despite their capability to detect macroscopic structures down to the theoretical limit.
Dynamic signed networks (DSNs) are common in online platforms, where time-stamped positive and negative relations evolve over time. A core task in DSNs is dynamic edge prediction, which forecasts future relations by jointly modeling edge existence and polarity (positive, negative, or non-existent). However, existing dynamic signed network embedding (DSNE) methods often entangle positive and negative signals within a shared temporal state and rely on node-specific temporal trajectories, which can obscure polarity-asymmetric dynamics and harm inductive generalization, especially under cold-start evaluation. We study an inductive setting where each test edge contains at least one endpoint node held out from training, while its interactions prior to the prediction time are available as historical evidence. The model must therefore infer representations for unseen nodes solely from such limited history. We propose IDP-DSN, an Inductive Dual-Polarity framework for Dynamic Signed Networks. IDP-DSN maintains sign-selective memories to model positive and negative temporal dynamics separately, performs history-only neighborhood inference for unseen nodes (instead of learned node-wise trajectories), and enforces polarity-wise static--dynamic disentanglement via an orthogonality regularizer. Experiments on BitcoinAlpha, BitcoinOTC, Wiki-RfA, and Epinions demonstrate consistent improvements over the strongest baselines, achieving relative Macro-F1 gains of 16.8/23.4%, 16.9/24%, 30.1/25.5%, and 18.7/28.9% in the transductive/inductive settings, respectively. These results highlight the effectiveness of IDP-DSN on DSNs, particularly under inductive cold-start evaluation for dynamic signed edge prediction.
Social simulation is essential for understanding collective human behavior by modeling how individual interactions give rise to large-scale social dynamics. Recent advances in large language models (LLMs) have enabled multi-agent frameworks with human-like reasoning and communication capabilities. However, existing LLM-based simulations treat social networks as fixed communication scaffolds, failing to leverage the structural signals that shape behavioral convergence and heterogeneous influence in real-world systems, which often leads to inefficient and unrealistic dynamics. To address this challenge, we propose TopoSim, a unified topology-aware social simulation framework that explicitly integrates structural reasoning into agent interactions along two complementary dimensions. First, TopoSim aligns agents with similar structural roles and interaction contexts into shared backbone units, enabling coordinated updates that reduce redundant computation while preserving emergent social dynamics. Second, TopoSim models social influence as a structure-induced signal, introducing heterogeneous interaction patterns grounded in network topology rather than uniform influence assumptions. Extensive experiments across three social simulation frameworks and diverse datasets demonstrate that TopoSim achieves comparable or improved simulation fidelity while reducing token consumption by 50 - 90%. Moreover, our approach more accurately reproduces key structural phenomena observed in real-world social systems and exhibits strong generalization and scalability.
Climate hazards in Hawai'i are increasing in both frequency and severity, with varying impacts over vulnerable communities. This paper presents the Community Census and Spatial Visualization Index (CCSVI), a web-based geospatial visualization platform that integrates climate hazard data with socioeconomic and infrastructural datasets. This system enables users to explore the correlation between environmental risks and social vulnerability through interactive mapping and layered data visualizations. Social vulnerability and climate hazard data are commonly collected individually, this causes the data to be disjointed making it difficult to combine and analyze directly. With data being unrelated when collected, finding direct comparisons and combining the data is difficult resulting in many non-expert users to not understand the data. Additionally, many existing tools focus on only one of these types of data, limiting their interactivity and failing to make any improvements. CCSVI aims to handle the lack of accessible, unified, and interactive systems analyzing the relationship between climate hazards and social vulnerabilities across the state of Hawai'i. This support favors assisting decision-makers, researchers, and community members in identifying at-risk populations, improving disaster preparedness, and creating informed climate adaptation strategies.
Online political hostility is pervasive, yet it remains unclear how toxicity varies across campaign issues and political ideology, and what psychosocial signals and framing accompany toxic expression online. In this work, we present a large-scale analysis of discourse on X (Twitter) during the five weeks surrounding the 2024 U.S. presidential election. We categorize posts into 10 major campaign issues, estimate the ideology of posts using a human-in-the-loop LLM-assisted annotation process, detect harmful content with an LLM-based toxicity detection model, and then examine the psychological drivers of toxic content. We use these annotated data to examine how harmful content varies across campaign issues and ideologies, as well as how emotional tone and moral framing shape toxicity in election discussions. Our results show issue heterogeneity in both the prevalence and intensity of toxicity. Identity-related issues displayed the highest toxicity intensity. As for specific harm categories, harassment was most prevalent and intense across most of the issues, while hate concentrated in identity-centered debates. Partisan posts contained more harmful content than neutral posts, and ideological asymmetries in toxicity varied by issue. In terms of psycholinguistic dimensions, we found that toxic discourse is dominated by high-arousal negative emotions. Left- and right-leaning posts often exhibit similar emotional profiles within the same issue domain, suggesting emotional mirroring. Partisan groups frequently rely on overlapping moral foundations, while issue context strongly shapes which moral foundations become most salient. These findings provide a fine-grained account of toxic political discourse on social media and highlight that online political toxicity is highly context-dependent, underscoring the need for issue-sensitive approaches to measuring and mitigating it.
Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\textit{general}, \textit{popular}, \textit{engaging}, \textit{informative}, \textit{controversial}, \textit{neutral}). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, where author demographics can be inferred from profile bios, political leaning bias is the clearest demographic signal: left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality in the dataset, and this pattern largely persists across prompts.
Influence maximization (IM) is a fundamental problem in complex network analysis, with a wide range of real-world applications. To date, existing approaches to influential node identification in IM have predominantly relied on standard graphs, failing to capture higher-order intrinsic interactions embedded in many real-world systems. Hypergraphs can be employed to better capture higher-order interactions. However, using hypergraphs may lead to an excessively large search space and increased complexity in modeling cascading dynamics, making it challenging to accurately identify influential nodes. Therefore, in this study, we propose a new hypergraph-modeled IM method, based on the Discrete Particle Swarm Optimization algorithm and the threshold model. In the proposed method, a particle (i.e., a candidate solution) represents the selection information of seed nodes, and the fitness function is designed to accurately and efficiently evaluate the influence of seed nodes via a two-layer local influence approximation. We also propose a degree-based initialization strategy to improve the quality of initial solutions and develop rules for updating particles' velocity and position, incorporated with a local search to drive particles toward better solutions. Experimental results demonstrate that the proposed method outperforms baseline methods on both synthetic and real-world hypergraphs. In addition, ablation studies validate the effectiveness of both the local search and the initialization strategies.
Home eviction poses a significant threat to housing stability, a critical determinant of health. This study examines the relationship between eviction and health and substance use within the unhoused population of King County, Washington. Using a sample of 1,106 individuals experiencing homelessness, we employed a quasi-experimental design to compare the health outcomes of those who have experienced eviction with those who have not. Our findings reveal eviction is associated with an 8.3% point increase (SE = 0.039) in the likelihood of reporting poor general health and an 9.5% increase (SE = 0.032) in substance use disorder. No significant effect was found for mental health outcomes. While these results highlight the severe health risks linked to eviction, further research with more precise estimates is necessary to better understand long-term effects. These findings contribute to the growing evidence of how home eviction undermines the well-being of vulnerable populations.
Recommender systems on social media increasingly mediate how users encounter mental health content, yet it remains unclear whether they distinguish help-seeking from distress expression. We conduct a controlled 7-day audit of TikTok's "For You" page using 30 fresh accounts and LLM-guided agents that vary initial search framing (distress- vs. help-initiated) and interaction strategy (engaged, avoidant, passive). Across 8,727 recommended videos, interaction behavior dominates exposure outcomes: engagement rapidly saturates feeds with mental health content (~45% of daily recommendations), while avoidance and passive viewing reduce but do not eliminate exposure (~11-20%). Search framing mainly shifts composition rather than volume--help-initiated searches yield more potentially supportive material, yet potentially harmful content persists at low but non-zero levels, including content in the Suicide/Self-Harm category. These findings suggest limited sensitivity to user intent signals in TikTok's recommendations and motivate context-aware safeguards for sensitive topics.
We investigate platform-native citation farming on ResearchGate by analyzing almost 3000 papers uploaded by five suspected boosting-service provider accounts. From the uploaded papers and associated metadata, we construct both paper-level and author-level citation networks. We introduce an interpretable structural signal for coordinated boosting, equal references groups: clusters of papers with equal reference lists. We find that many papers from our collection exhibit this motif, that is, they disproportionately cite a small set of authors, consistent with coordinated or automated boosting rather than independent scholarly practice. Finally, we show that for some authors in our dataset a substantial share of their citations can be attributed to these suspicious groups. A different citation network was used to validate the rareness of such motifs in legitimate scientific work.
Open-source large language models have made platforms such as Hugging Face central hubs for decentralized AI innovation. Yet these ecosystems are shaped not only by collaboration, but also by competition for priority and community attention. Drawing on Hill and Stein's Race-to-the-Bottom framework, this study extends the logic of project potential, maturation, competition, and quality from scientific production to open-source LLM ecosystems, where prominent base models attract concentrated derivative entry under rapid and highly visible platform feedback. Using a large-scale sample of derivative models on Hugging Face, we find that later releases and more crowded competitive environments are both associated with weaker community recognition, even after accounting for differences in model and ecosystem prominence. These findings suggest that competition for priority remains an important organizing force in open-source LLM ecosystems, shaping which derivative innovations receive community recognition.
Hazard-model analysis of 36,000 judge-years shows ties to the president raise promotion odds far more than performance or network position.
abstractclick to expand
Judicial promotions shape the composition of higher courts, yet their determinants remain poorly understood. This paper examines promotion from U.S. District Courts to Courts of Appeals using a discrete-time hazard framework that models annual promotion probability. Using a judge-year panel covering over 36,000 observations from 1930 to present, we incorporate career timing, political alignment, elite credentials, and judicial performance measures. Promotion probabilities follow a life-cycle pattern and are strongly influenced by political alignment between judges and presidents ($\beta$ = 2.12, p < 0.001). Elite credentials and productivity increase promotion likelihood, while higher reversal rates reduce it. Citation network centrality exhibits a meaningful association ($\beta$ = 0.230, p = 0.025) that operates independently of elite credentials. Promotion outcomes reflect a dynamic process shaped by timing, politics, elite networks, and performance signals, with political considerations dominating but not eclipsing judicial behavior.
A conditional player count below which queues and matches fail can mark the shift to abandoned virtual spaces under limited updates or fixed
abstractclick to expand
Online multiplayer games are population-dependent systems whose playability depends on the continued presence of an active player base. We propose a formal framework for reasoning about viability collapse in such systems under explicit scope conditions. The framework introduces a conditional Critical Mass Threshold $\Phi$, below which queue times, match quality, or role balance render a game operationally non-viable under a fixed operational profile; an uninhabited runtime taxonomy spanning pre-launch and post-decline states; and a Nostalgia Inversion Point $\psi$, at which cultural memory exceeds active participation. We model post-peak decline using a threshold-sensitive hazard model and show how games in the modeled class can cross below viability under finite official-service horizons or bounded novelty under continuing exposure. Case studies based on public concurrent-player data are used illustratively rather than as formal validation. The contribution of the paper is not a universal law, but a formal vocabulary, a collapse model, and an empirical agenda for studying online game decline, preservation risk, and uninhabited virtual worlds.
The rapid growth of open-source large language models (LLMs) has created a complex ecosystem of model inheritance and reuse. However, existing research has focused mainly on descriptive analyses of lineage evolution, with limited attention to identifying which models play a disruptive role in shaping subsequent development. Using metadata from 2,556,240 models on Hugging Face, this study reconstructs a large-scale lineage network and introduces the Model Disruption Index (MDI) to distinguish between models that reinforce existing technological trajectories and those that become new bases for later development. The results show that most models in the open-source LLM community are consolidative rather than disruptive, reflecting a highly concentrated and path-dependent evolutionary structure. Further analyses suggest that disruptive positions are more likely to emerge among large-scale models and through finetuning strategies. Overall, this study provides a new perspective for identifying disruptive models and understanding uneven technological development in open-source LLM ecosystems.
Large Language Models (LLMs) have demonstrated an unprecedented ability to simulate human-like social behaviors, making them useful tools for simulating complex social systems. However, it remains unclear to what extent these simulations can be trusted to accurately capture key social mechanisms, particularly in highly unbalanced contexts involving minority groups. This paper uses a network generation model with controlled homophily and class sizes to examine how LLM agents behave collectively in multi-round debates. Moreover, our findings highlight a particular directional susceptibility that we term \textit{agreement drift}, in which agents are more likely to shift toward specific positions on the opinion scale. Overall, our findings highlight the need to disentangle structural effects from model biases before treating LLM populations as behavioral proxies for human groups.
Community Notes is X's crowdsourced fact-checking program: contributors write short notes that add context to potentially misleading posts, and other contributors rate whether those notes are helpful. Its algorithm uses a matrix factorization model to separate ideology from note quality, so notes are surfaced only when they receive support across ideological lines. After ideology is accounted for, however, the model gives all raters equal influence on quality estimates. This slows consensus formation and leaves the quality estimate vulnerable to noisy or strategic raters. We propose Quality-Sensitive Matrix Factorization (QSMF), which uses a per-rater quality-sensitivity parameter \(\hat\rho_i\) estimated jointly with all other parameters. This connects QSMF to peer prediction: without external ground truth, it gives more influence to raters whose ideology-adjusted ratings are more consistent with the note-quality estimates learned from all the ratings.
We evaluate QSMF on 45M ratings over 365K notes from the six months before the 2024 U.S. presidential election. Split-half tests confirm that quality sensitivity is a stable, empirically recoverable rater trait. In evaluation on high-traffic notes, QSMF requires 26--40\% fewer ratings to match the baseline's accuracy. In semi-synthetic coordinated attacks on notes of opposing ideology, QSMF substantially reduces displacement on the estimated quality estimates of targeted notes relative to the baseline. In synthetic data with known ground truth, \(\hat\rho_i\) separates good from bad raters with an AUC above 0.94, and achieves much lower error in recovering the true note quality estimates in the presence of bad raters. These gains come from a single additional scalar parameter per rater, with no external ground truth and no manual moderation.
Coordinated campaigns on social media play a critical role in shaping crisis information environments, particularly during the onset of conflicts when uncertainty is high and verified information is scarce. We study the interplay between coordinated campaigns and information integrity through a case study of the 2023 Israel-Hamas War on Twitter (X). We analyze 4.5~million tweets and employ established coordination detection methods to identify 11 coordinated groups involving 541 accounts. We characterize these groups through a multimodal analysis that includes topics, account amplification, toxicity, emotional tone, visual themes, and misleading claims. Our analysis reveal that coordinated campaigns rely predominantly on low-complexity tactics, such as retweet amplification and copy-paste diffusion, and promote distinct narratives consistent with a fragmented manipulation landscape, without centralized control. Widely amplified misleading claims concentrate within just three of the identified coordinated groups; the remaining groups primarily engage in advocacy, religious solidarity, or humanitarian mobilization. Claim-level integrity, toxicity, and emotional signals are mutually uncorrelated: no single behavioral signal is a reliable proxy for the others. Targeting the most prolific spreaders of misleading content for moderation would be effective in reducing such content. However, targeting prolific amplifiers in general would not achieve the same mitigation effect. These findings suggest that evaluating coordination structures jointly with their specific content footprints is needed to effectively prioritize moderation interventions.
Online platforms where volunteers answer each other's questions are important sources of knowledge, yet participation is declining. We ran a pre-registered experiment on Stack Overflow, one of the largest Q&A communities for software development (N = 22,856), randomly assigning newly posted questions to receive an anonymous upvote. Within four weeks, treated users were 6.3% more likely to ask another question and 12.9% more likely to answer someone else's question. A second upvote produced no additional effect. The effect on answering was larger, more persistent, and still significant at twelve weeks. Next, we examine how much of these effects are due to algorithmic amplification, since upvotes also raise a question's rank and visibility. Algorithmic amplification is not important for the effect on asking additional questions, but it matters a lot for the effect on answering other questions. The increase in visibility increases the probability that another user provides an answer, and that experience appears to shift the poster toward broader community participation.
Agent model shows spurious beliefs and group animosity emerge without real conflicts or evidence.
abstractclick to expand
Our belief systems are shaped by social processes, such as observations and influence, and by cognitive processes, such as the drive for internal coherence. These processes steer how individual beliefs evolve and become connected. The resulting belief networks contain both causal and associative links, including spurious ones, such as stereotypes. Here, we develop an agent-based model of belief networks that demonstrates how two basic mechanisms -- social interaction and a drive for internal coherence -- can give rise to such stereotypes without any underlying reality. We further demonstrate how stereotypes, when coupled with shared group identity, can give rise to affective polarization, even in the absence of ideological conflicts.
Unlike the more observable phenomenon of group opinion reinforcement, self-censorship online has received comparatively less attention. Our goal in this work is to dissect the phenomena of self-censorship and to examine the implications of restrained expression for participation in public discourse, particularly in polarized contexts. We explore how social media users express their opinions online through analyses of 390 survey responses and 20 semi-structured interviews using a mixed-methods approach. We ask social media users about the differences between their publicly shared opinions and privately held beliefs, highlighting the influence of contextual factors on self-expression. Our findings show that self-censorship is associated with community context; social media users embedded within larger audiences, with lower posting frequency and perceived support, are less likely to express their opinions, and those who do speak often adjust their expressed views to align with perceived group norms. The study complements the rich literature on echo chambers and opinion reinforcement on social media platforms, highlighting the silence within the noise and its potential consequences for public discourse, which have become increasingly pertinent in an era where online platforms are pivotal to social and political narratives.
Many bipartite networks exhibit hierarchical community structure, but existing community detection methods are not well-suited for detecting hierarchy. They also do not effectively handle weighted bipartite networks. In this work, we introduce a novel modularity-based objective function, called the generalized bipartite modularity density, Qbg, specifically designed for hierarchical community detection in bipartite systems. The framework incorporates a tunable resolution parameter that enables systematic exploration of community structure across multiple scales. It leverages resolution-limit behavior in bipartite networks as a tool to uncover hierarchical organization without projecting the network or altering its intrinsic bipartite topology. We evaluate the method using a hierarchical synthetic bipartite benchmark and apply it to two empirical networks. In all cases, Qbg recovers established mesoscale structure while revealing additional hierarchical and fine-scale organization beyond that detected by conventional bipartite approaches. These results establish Qbg as a flexible, interpretable, and resolution-aware framework for hierarchical community detection in bipartite networks.
Decomposing hypergraphs is a key task in hypergraph analysis with broad applications in community detection, pattern discovery, and task scheduling. Existing approaches such as $k$-core and neighbor-$k$-core rely on vertex degree constraints, which often fail to capture true density variations induced by multi-way interactions and may lead to sparse or uneven decomposition layers. To address these issues, we propose a novel \((k,\delta)\)-dense subhypergraph model for decomposing hypergraphs based on integer density values. Here, $k$ represents the density level of a subhypergraph, while \(\delta\) sets the upper limit for each hyperedge's contribution to density, allowing fine-grained control over density distribution across layers. Computing such dense subhypergraphs is algorithmically challenging, as it requires identifying an egalitarian orientation under bounded hyperedge contributions, which may incur an intuitive worst-case complexity of up to $O(2^{m\delta})$. To enable efficient computation, we develop a fair-stable-based algorithm that reduces the complexity of mining a single $(k,\delta)$-dense subhypergraph from $O(m^{2}\delta^{2})$ to $O(nm\delta)$. Building on this result, we further design a divide-and-conquer decomposition framework that improves the overall complexity of full density decomposition from $O(nm\delta \cdot d^E_{\max} \cdot k_{\max})$ to $O(nm\delta \cdot d^E_{\max} \cdot \log k_{\max})$. Experiments on nine real-world hypergraph datasets demonstrate that our approach produces more continuous and less redundant decomposition hierarchies than existing baselines, while maintaining strong computational efficiency. Case studies further illustrate the practical utility of our model by uncovering cohesive and interpretable community structures.
Social media and online review platforms have become valuable sources for studying how people express opinions, report experiences, and respond to events across space. This work presents a practical guide to using user-generated social data for geospatial research on public opinion, human behavior, and place-based experience. It shows the promise of using these data as a form of passive, distributed, and human-centered sensing that complements traditional surveys and sensor systems. Methodologically, the chapter outlines a general workflow that includes platform-aware data collection, information extraction, geospatial anchoring, and statistical modeling. It also discusses how advances in large language models (LLMs) strengthen the ability to extract structured information from noisy and unstructured content. Four case studies illustrate this framework: COVID-19 vaccine acceptance, earthquake damage assessment, airport service quality, and accessibility in urban environment. Across these cases, social media data are shown to support timely measurement of public attitudes, rapid approximation of geographically distributed impacts, and fine-grained understanding of place-based experiences.