arxiv: 2604.14473 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

Response-Aware User Memory Selection for LLM Personalization

Jillian Fisher , Jennifer Neville , Chan Young Park

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM personalizationmemory selectionmutual informationresponse utilityuser memoryinference optimizationpersonalized generation

0 comments

The pith

Selecting LLM user memory by mutual information with model outputs yields more human-aligned and higher-quality personalization than similarity-based selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often personalize by pulling a subset of user memory into the prompt at inference time. Existing selection methods score memory items mainly by semantic similarity to the current query. The paper proposes RUMS, which instead scores subsets by the mutual information they share with the model's actual output distribution. This criterion identifies memory that most reduces uncertainty in the generated response. The resulting selections match human preferences more closely than prior approaches, improve response quality, and cut computation by up to 95 percent even on models hundreds of times larger.

Core claim

RUMS selects user memory items by measuring the mutual information between a subset of memory and the model's outputs, identifying items that reduce response uncertainty and sharpen predictions beyond semantic similarity. This information-theoretic foundation enables more principled user memory selection that aligns more closely with human selection compared to state-of-the-art methods and models 400 times larger. Memory items selected using RUMS also produce better response quality while incurring up to 95 percent lower computational cost.

What carries the argument

RUMS (Response-Utility optimization for Memory Selection), which estimates mutual information between candidate memory subsets and the LLM output distribution to rank and retain the most informative items.

Load-bearing premise

Mutual information between memory subsets and LLM outputs can be reliably estimated at inference time, and selecting higher values directly produces better and more human-aligned responses.

What would settle it

A head-to-head human evaluation in which responses generated with RUMS-selected memory receive no higher quality ratings than those generated with similarity-selected memory.

Figures

Figures reproduced from arXiv: 2604.14473 by Chan Young Park, Jennifer Neville, Jillian Fisher.

**Figure 1.** Figure 1: Overview of RUMS. RUMS-Utility quantifies how much memory subsets reduce response uncertainty (training phase), while RUMS-Models efficiently selects informative items at inference. minimizing predictive entropy is approximately equivalent to maximizing expected user utility. While this guarantee assumes the model’s distribution aligns with human preferences, our empirical results confirm that modern LLMs… view at source ↗

**Figure 2.** Figure 2: Results for H1 analysis [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cost Analysis and real-world datasets (WildChat, 50 personalized and 50 non-personalized) to assess generalization. Responses are generated using two models: GPT-4, a high-performing model with robust capabilities, and LLaMA 70B Instruct (AI, 2024a), a smaller open-source model that is relatively more susceptible to noise. We evaluate response quality using RUMS-Utility and RUMS-Models against all baselin… view at source ↗

**Figure 4.** Figure 4: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset differing by type of model used for entropy calculation. Hyperparameter: Model Size Next, we examined the effect of model size on the entropy calculation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset differing by size of model used for entropy calculation. 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Max Utility Scores 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Density Jensen Shannon: 0.50 Number of Samples: 1 Personalized Non-Personalized (a) Model Family: Llama 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Max Utility Scores 0.0… view at source ↗

**Figure 6.** Figure 6: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset calculated using models of different families. 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Max Utility Scores 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density Jensen Shannon: 0.34 Decoding Type: Greedy Personalized Non-Personalized (a) Decoding Type: Greedy 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Max Utility Scores 0.00 0.25 0.50 0.75 1.00 1… view at source ↗

**Figure 7.** Figure 7: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset differing by decoding type. Hyperparameter: Number of Tokens As noted in the main text, we limit the number of tokens used in the calculation of Equation (3) to avoid bias introduced by longer generations, since longer outputs tend to have lower entropy. To study this, we randomly sampled n = 10,000 respo… view at source ↗

**Figure 8.** Figure 8: Average entropy by the index of the token being generated. Further analysis in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset using different number of tokens. Hyperparameter: Number of Samples Next, we examine the effect of the number of Monte Carlo samples used in Equation (3) on the robustness of the utility score. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset differing number of Monte Carlo samples average over. increases. Notably, there is a substantial jump of 0.22 in Jensen–Shannon divergence when increasing from 1 to 5 users, but only a marginal increase of 0.01 when going from 5 to 10 users. Based on this, we choose u = 5 users for constructing our train… view at source ↗

**Figure 11.** Figure 11: Distribution of maximum utility scores for a personalized (blue) and non-personalized (orange) dataset differing by number of users averaged over. A.2. Human-Chosen Memory Subset Performance on Downstream Response In H2, we analyze the correlation between the memory subsets selected by RUMS-Utility, RUMS-Model and the baselines compared to human-selected subsets. Human labels are treated as the gold stand… view at source ↗

**Figure 12.** Figure 12: RUMS-Binary: Distribution Included Excluded 0.1 0.0 0.1 0.2 0.3 0.4 Average Utility Score [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 14.** Figure 14: RUMS-Binary: Difference 0.05 0.00 0.05 0.10 0.15 Difference in Average Utlity Score (Included - Excluded Items) 0 2 4 6 8 10 12 14 Count Distribution of Utility Score Differences [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 16.** Figure 16: Synthetic: % Agreement vs. Threshold 0.1100 0.1125 0.1150 0.1175 0.1200 0.1225 0.1250 0.1275 0.1300 Threshold 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Agreement Rate Synthetic Test: Agreement Rate vs Threshold optimal threshold [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 18.** Figure 18: Synthetic: F1 vs. Threshold 0.1100 0.1125 0.1150 0.1175 0.1200 0.1225 0.1250 0.1275 0.1300 Threshold 0.90 0.92 0.94 0.96 0.98 1.00 F1 Synthetic Test: F1 vs Threshold optimal threshold [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 20.** Figure 20: Real World: % Agreement vs. Threshold 0.3400 0.3425 0.3450 0.3475 0.3500 0.3525 0.3550 0.3575 0.3600 Threshold 0.970 0.975 0.980 0.985 0.990 0.995 1.000 Agreement Rate Real World Test: Agreement Rate vs Threshold optimal threshold [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

**Figure 22.** Figure 22: Real World:F1 vs. Threshold 0.3400 0.3425 0.3450 0.3475 0.3500 0.3525 0.3550 0.3575 0.3600 Threshold 0.960 0.965 0.970 0.975 0.980 0.985 0.990 0.995 1.000 F1 Real World Test: F1 vs Threshold optimal threshold [PITH_FULL_IMAGE:figures/full_fig_p021_22.png] view at source ↗

**Figure 24.** Figure 24: Optimal threshold calculated for each n = 10 users based on utility scores. It shows the optimal thresholds for the 10 users (solid blue line), as well as the mean threshold (dotted red line) and ±1 standard deviation (gray shaded area) [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗

**Figure 25.** Figure 25: Estimated monthly cost of additional prompt tokens (1M queries per day, 30 days, GPT-4 input pricing at $0.01 per 1K tokens) [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗

**Figure 26.** Figure 26: Distribution of Name 40 [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗

**Figure 27.** Figure 27: Distribution of Gender White Latina Chinese Arab Korean Black Indian Japanese Latino Asian South Asian Brown Ethnicity 0 5 10 15 20 25 30 35 Count 35 6 1 12 1 11 6 1 12 10 3 2 Memory Item: Ethnicity [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗

**Figure 28.** Figure 28: Distribution of Ethnicity Caucasian Hispanic Asian Middle Eastern African-American European South Asian Slavic African East Asian African-Caribbean Korean-American Race 0 5 10 15 20 Count 9 15 6 13 2 23 8 5 8 9 1 1 Memory Item: Race [PITH_FULL_IMAGE:figures/full_fig_p041_28.png] view at source ↗

**Figure 29.** Figure 29: Distribution of Race 41 [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗

**Figure 30.** Figure 30: Distribution of Annual Household Income English Spanish Mandarin Arabic French Hindi Russian Japanese Zulu Portuguese Cantonese Italian Punjabi Yoruba German Urdu Ukrainian Swedish Kannada Turkish Marathi Korean Wolof Norwegian Luganda Belarusian Tamil Catalan Polish Swahili Finnish Bengali Latvian Croatian Serbian Danish Language spoken at home 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Count 19 15 4 13 3 3 3 3… view at source ↗

**Figure 31.** Figure 31: Distribution of Language spoken at home English French Arabic Spanish Russian Language preference 0 20 40 60 80 Count 90 5 2 2 1 Memory Item: Language preference [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗

**Figure 32.** Figure 32: Distribution of Language preference 42 [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗

**Figure 33.** Figure 33: Distribution of Sexual Preference None Mild dyslexia Anxiety ADHD Mild hearing loss Chronic back pain Colorblindness Asthma Disabilities 0 20 40 60 80 Count 90 1 1 2 1 1 3 1 Memory Item: Disabilities [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗

**Figure 34.** Figure 34: Distribution of Disabilities Software Engineer High School Teacher Product Manager Community Organizer Barista Delivery Driver UX Designer Call Center Agent Data Analyst Electrician Taxi Driver Graphic Designer Corporate Lawyer Marketing Specialist Construction Worker Admin Assistant Tour Guide Receptionist UX Researcher HR Manager Museum Curator Librarian Teacher Event Planner Security Guard Data Enginee… view at source ↗

**Figure 35.** Figure 35: Distribution of Occupation Liberal Moderate Independent Progressive Left-leaning Conservative Apolitical Nonpartisan Political affiliation 0 5 10 15 20 25 30 Count 31 28 1 12 1 12 10 5 Memory Item: Political affiliation [PITH_FULL_IMAGE:figures/full_fig_p043_35.png] view at source ↗

**Figure 36.** Figure 36: Distribution of Political affiliation 43 [PITH_FULL_IMAGE:figures/full_fig_p043_36.png] view at source ↗

**Figure 37.** Figure 37: Distribution of Education level Coding Hiking Painting Yoga Photography Chess Cooking Calligraphy Zines Roller skating Fishing Fixing cars Cycling Cricket Movies Reading Puzzles GardeningGAA Football Origami Sailing Wine tasting Fashion Blogging Woodworking Dancing Singing Traveling Quilting Bouldering Baking Art history Walking tours Birdwatching Surfing Weightlifting Gaming Dance Makeup Soccer Dominoes … view at source ↗

**Figure 38.** Figure 38: Distribution of Hobbies and interests Soccer Basketball Tennis Skateboarding Football Cycling Cricket Ice hockey Hurling Baseball Golf Rugby Swimming Badminton Hockey Table Tennis Gaelic football Wrestling Country Skiing Esports Netball Surfing Handball Volleyball Fútbol Favorite sports 0.0 2.5 5.0 7.5 10.0 12.5 15.0 Count 16 8 7 1 12 5 12 6 1 7 2 3 1 3 2 1 1 1 1 2 1 1 1 3 1 1 Memory Item: Favorite sports… view at source ↗

**Figure 39.** Figure 39: Distribution of Favorite sports Alternative Rock Latin Pop Classical Traditional Arabic Indie Hip-Hop Electronic Bollywood Pop Folk Arabic Pop J-Pop Jazz R&B Rock Afrobeat Samba Country Lo-fi Afrobeats Indie pop Egyptian pop Techno Reggaeton Rock Nacional Ambient Pakistani Folk Calypso Latin Regional Mexican Electronica Latin pop K-pop Mbalax Enka Arabic classical Instrumental Arabic jazz K-Pop Gospel Ara… view at source ↗

**Figure 40.** Figure 40: Distribution of Preferred music genre 44 [PITH_FULL_IMAGE:figures/full_fig_p044_40.png] view at source ↗

**Figure 41.** Figure 41: Distribution of Favorite books Sci-fi Drama Thriller Indie Action Romantic comedy Historical drama Comedy Romance War Adventure Documentary Fantasy Rom-com News Period drama Sci fi Family drama Wellness Historical Sports Graphic adaptation Legal drama Musical Medical drama Preferred movie genre 0 5 10 15 20 25 30 Count 9 32 4 1 5 2 2 4 6 1 3 13 2 1 1 1 2 2 1 2 1 1 1 2 1 Memory Item: Preferred movie genre … view at source ↗

**Figure 42.** Figure 42: Distribution of Preferred movie genre The Expanse Grey's Anatomy Black Mirror Dirili : Ertu rul Adventure Time NCIS Call My Agent! Sacred Games Sherlock Derry Girls Bab Al-Hara Midnight Diner Suits Insecure Vikings Uzalo Money Heist The Crown Koffee with Karan Medici Jeopardy! The Office Nollywood MasterChef Dark Jane the Virgin Los Simuladores Mindhunter Udaari EastEnders Billions La Casa de Papel Death … view at source ↗

**Figure 43.** Figure 43: Distribution of Favorite TV shows Sushi Pizza Tacos Salads Dumplings Ramen Falafel Baklava Kimchi stew Vegan burgers BBQ Mac & cheese Crepes Quiche Paneer tikka Samosa Pelmeni Caviar Stew Bread pudding Mansaf Hummus Tempura Steak Risotto Fried chicken Smoothies Borscht Pap Chicken stew Seafood Pastries Casseroles Apple pie Donuts Paneer dishes Parathas Pasta Gelato Grilled cheese Soup Paneer Biryani Jollo… view at source ↗

**Figure 44.** Figure 44: Distribution of Favorite foods 45 [PITH_FULL_IMAGE:figures/full_fig_p045_44.png] view at source ↗

**Figure 45.** Figure 45: Distribution of Dietary restrictions (e.g., vegetarian, vegan) Married Single In a relationship Divorced Dating Engaged Widowed Partnered In relationship Relationship status 0 10 20 30 40 50 Count 50 30 3 1 4 2 1 7 2 Memory Item: Relationship status [PITH_FULL_IMAGE:figures/full_fig_p046_45.png] view at source ↗

**Figure 46.** Figure 46: Distribution of Relationship status 0 1 2 3 4 Number of children 0 10 20 30 40 50 Count 21 50 15 10 4 Memory Item: Number of children [PITH_FULL_IMAGE:figures/full_fig_p046_46.png] view at source ↗

**Figure 47.** Figure 47: Distribution of Number of children Dog Cat None Bird Rabbit No pets Chickens Pigeon Goat Parrot Chicken Pet ownership (types of pets) 0 5 10 15 20 25 30 35 Count 34 23 32 1 2 1 1 1 5 1 1 Memory Item: Pet ownership (types of pets) [PITH_FULL_IMAGE:figures/full_fig_p046_47.png] view at source ↗

**Figure 48.** Figure 48: Distribution of Pet ownership (types of pets) 46 [PITH_FULL_IMAGE:figures/full_fig_p046_48.png] view at source ↗

**Figure 49.** Figure 49: Distribution of Travel history (countries visited) Adventurous Relaxing Travel preferences (adventurous, relaxing) 0 10 20 30 40 50 Count 50 50 Memory Item: Travel preferences (adventurous, relaxing) [PITH_FULL_IMAGE:figures/full_fig_p047_49.png] view at source ↗

**Figure 50.** Figure 50: Distribution of Travel preferences (adventurous, relaxing) Reddit Twitter Facebook Instagram LinkedIn WeChat WhatsApp Tumblr TikTok Pinterest YouTube VK Telegram LINE Snapchat ResearchGate Dribbble SoundCloud GitHub Github Vimeo Social media platforms used 0 10 20 30 40 Count 5 10 42 44 30 4 18 1 4 4 6 3 1 3 1 2 2 1 4 1 1 Memory Item: Social media platforms used [PITH_FULL_IMAGE:figures/full_fig_p047_50.png] view at source ↗

**Figure 51.** Figure 51: Distribution of Social media platforms used High Medium Low Level of tech-savviness 0 10 20 30 40 50 Count 53 29 18 Memory Item: Level of tech-savviness [PITH_FULL_IMAGE:figures/full_fig_p047_51.png] view at source ↗

**Figure 52.** Figure 52: Distribution of Level of tech-savviness 47 [PITH_FULL_IMAGE:figures/full_fig_p047_52.png] view at source ↗

**Figure 53.** Figure 53: Distribution of Preferred mode of communication (text, voice, video) Full-time Part-time Freelance Work schedule (full-time, part-time) 0 20 40 60 80 Count 92 4 4 Memory Item: Work schedule (full-time, part-time) [PITH_FULL_IMAGE:figures/full_fig_p048_53.png] view at source ↗

**Figure 54.** Figure 54: Distribution of Work schedule (full-time, part-time) Technology Education Tech Nonprofit Service Logistics Design Customer Support Finance Construction Transport Law Marketing Admin Tourism Clerical Corporate Cultural Public Services Hospitality Security Healthcare Retail Media Architecture Arts Food & Beverage Science Automotive Engineering Craft Environmental Crafts Music Manufacturing Emergency Service… view at source ↗

**Figure 55.** Figure 55: Distribution of Industry of employment Launch a personal AI project Finish Master's degree Launch new app feature Organize local fundraiser Open an art café Save to buy house Launch inclusive design campaign Get a better job Automate work processes Renovate house Increase savings Publish art book Close major deal Grow personal brand Renovate apartment Get promotion Open own travel company Save for retirem… view at source ↗

**Figure 56.** Figure 56: Distribution of Current projects or goals 48 [PITH_FULL_IMAGE:figures/full_fig_p048_56.png] view at source ↗

**Figure 57.** Figure 57: Distribution of Long-term aspirations Financially comfortable Stable Wealthy Moderate income Precarious Low income Middle class Tight Financially stable Middle income Working class Affluent Comfortable Lower-middle Fixed income Upper middle Lower middle Lower income Mid-income Financial situation (broad categories) 0 5 10 15 20 25 Count 1 11 3 1 3 1 2 1 1 24 13 2 14 1 1 15 3 2 1 Memory Item: Financial sit… view at source ↗

**Figure 58.** Figure 58: Distribution of Financial situation (broad categories) Lose weight Improve flexibility Build strength Stay active Reduce stress Get in shape Improve stamina Gain muscle Stay healthy Improve posture Maintain physique Stay toned Gain strength Stay lean Increase muscle Walk more Stay flexible Improve core strength Increase cardio Improve breathing Improve health Manage stress Stay fit Improve skills Improve … view at source ↗

**Figure 59.** Figure 59: Distribution of Health and fitness goals Running Cycling Yoga Weightlifting Walking Dance Roller skating Spinning Bodyweight training Pilates Manual labor Rowing Swimming Rock climbing Gym Biking Skiing Hiking Golf Diving Surfing Football Sketching Skateboarding Jogging Preferred exercise activities 0 5 10 15 20 25 30 Count 10 7 11 2 31 5 1 1 1 5 1 1 2 1 9 1 3 4 1 1 1 1 1 1 1 Memory Item: Preferred exerci… view at source ↗

**Figure 60.** Figure 60: Distribution of Preferred exercise activities 49 [PITH_FULL_IMAGE:figures/full_fig_p049_60.png] view at source ↗

**Figure 61.** Figure 61: Distribution of Level of environmental consciousness Coding for nonprofits Volunteering at food banks STEM mentorship Mosque food drives Queer center volunteering Neighborhood cleanups Design mentorship Help at temple Animal shelter work Fixing for neighbors Mosque duties Art club Legal aid Youth mentorship Build community center Church group Heritage tours Church meals Urban sustainability CSR programs A… view at source ↗

**Figure 62.** Figure 62: Distribution of Volunteer activities or interests Balancing work-life Student behavior Team alignment Community resources Mental health Child custody Burnout Low salary Time management Cost of living Traffic stress Creative block Work stress Career mobility Back pain Single parenthood Seasonal income Health issues Retention issues Visitor engagement Funding cuts Workload Overcrowded classrooms Last-minute… view at source ↗

**Figure 63.** Figure 63: Distribution of Current challenges or pain points Introverted Extroverted Ambivert Personality traits (e.g. introverted, extroverted) 0 10 20 30 40 50 Count 49 40 11 Memory Item: Personality traits (e.g. introverted, extroverted) [PITH_FULL_IMAGE:figures/full_fig_p050_63.png] view at source ↗

**Figure 64.** Figure 64: Distribution of Personality traits (e.g. introverted, extroverted) 50 [PITH_FULL_IMAGE:figures/full_fig_p050_64.png] view at source ↗

**Figure 65.** Figure 65: Distribution of Favorite pastimes Honesty Innovation Empathy Growth Discipline Knowledge Justice Faith Authenticity Creativity Hard work Independence Inclusion Beauty Family Respect Logic Stability Community Harmony Tradition Success Confidence Loyalty Hope Passion Curiosity Precision Balance Kindness Equity Inclusivity Efficiency Sustainability Empowerment Truth Ambition Joy Expression Freedom Integrity … view at source ↗

**Figure 66.** Figure 66: Distribution of Personal values or beliefs Western European Mexican-American Chinese Lebanese-American Korean-American Southern African-American French Gujarati Russian Irish Jordanian Japanese German-American African-AmericanZulu Portuguese Midwestern Chinese-Canadian North Indian Italian Anglo-American Punjabi-Canadian Yoruba Australian Egyptian German Cuban-American Argentinian Italian-American Punjabi… view at source ↗

**Figure 67.** Figure 67: Distribution of Cultural background Agnostic Catholic Atheist Muslim Spiritual but not religious Christian Hindu Orthodox Christian Shinto/Buddhist Mormon Spiritual Buddhist None Sikh Shinto Orthodox Religious beliefs 0 5 10 15 Count 14 18 8 18 2 11 7 3 1 1 2 2 8 1 2 2 Memory Item: Religious beliefs [PITH_FULL_IMAGE:figures/full_fig_p051_67.png] view at source ↗

**Figure 68.** Figure 68: Distribution of Religious beliefs 51 [PITH_FULL_IMAGE:figures/full_fig_p051_68.png] view at source ↗

**Figure 69.** Figure 69: Distribution of Recent life events (e.g. change in job, moved) Laptop Smartphone Smartwatch Tablet Phone Smart TV Ring light Camera TV eReader Console Radio Wearable Guitar Technology usage (devices, apps) 0 20 40 60 80 100 Count 48 2 2 21 97 2 1 3 1 1 1 1 1 1 Memory Item: Technology usage (devices, apps) [PITH_FULL_IMAGE:figures/full_fig_p052_69.png] view at source ↗

**Figure 70.** Figure 70: Distribution of Technology usage (devices, apps) Daily Weekly Frequency of using the service (daily, weekly) 0 10 20 30 40 50 60 70 Count 67 33 Memory Item: Frequency of using the service (daily, weekly) [PITH_FULL_IMAGE:figures/full_fig_p052_70.png] view at source ↗

**Figure 71.** Figure 71: Distribution of Frequency of using the service (daily, weekly) Professional development Connect with peers Professional growth Stay informed Self-expression Entertainment Design inspiration Learning tools Saving money News Art sharing Legal updates Trends Home repair Inspiration Stay in touch User research Employee check-ins Art updates Library use Tech news Community support Planning ideas Tech performan… view at source ↗

**Figure 72.** Figure 72: Distribution of Reasons for using the service 52 [PITH_FULL_IMAGE:figures/full_fig_p052_72.png] view at source ↗

**Figure 73.** Figure 73: Distribution of Feedback preferences (detailed, brief) Casual Friendly Formal Respectful Polite Direct Upbeat Warm Neutral Preferred tone of communication (formal, casual) 0 5 10 15 20 25 30 35 Count 24 5 35 8 7 1 2 13 5 Memory Item: Preferred tone of communication (formal, casual) [PITH_FULL_IMAGE:figures/full_fig_p053_73.png] view at source ↗

**Figure 74.** Figure 74: Distribution of Preferred tone of communication (formal, casual) USA France India Russia Ireland Jordan Japan South Africa Portugal Canada Italy Nigeria Australia Egypt Germany Argentina Pakistan Ukraine Singapore Spain Bahamas Lebanon Chile Sweden Mexico Turkey China New Zealand Saudi Arabia South Korea Senegal Norway Puerto Rico Colombia Morocco Belgium Uganda Costa Rica Switzerland BelarusUK Oman Polan… view at source ↗

**Figure 75.** Figure 75: Distribution of Country 53 [PITH_FULL_IMAGE:figures/full_fig_p053_75.png] view at source ↗

read the original abstract

A common approach to personalization in large language models (LLMs) is to incorporate a subset of the user memory into the prompt at inference time to guide the model's generation. Existing methods select these subsets primarily using similarity between user memory items and input queries, ignoring how features actually affect the model's response distribution. We propose Response-Utility optimization for Memory Selection (RUMS), a novel method that selects user memory items by measuring the mutual information between a subset of memory and the model's outputs, identifying items that reduce response uncertainty and sharpen predictions beyond semantic similarity. We demonstrate that this information-theoretic foundation enables more principled user memory selection that aligns more closely with human selection compared to state-of-the-art methods, and models $400\times$ larger. Additionally, we show that memory items selected using RUMS result in better response quality compared to existing approaches, while having up to $95\%$ reduction in computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RUMS shifts memory selection to mutual information with the output distribution instead of query similarity, but the feasibility of estimating that MI efficiently at inference time is the key open question.

read the letter

The main takeaway is that this paper replaces similarity-based memory picking with an information-theoretic criterion: choose the memory subset that most reduces uncertainty in the LLM's response. That framing is a clear departure from the baselines they cite, and it directly targets what actually shapes generation rather than surface overlap with the query. The experiments reportedly show closer alignment to human-chosen memory and measurable quality gains, plus the ability to handle much larger models at lower cost. Those are the parts worth paying attention to if the numbers hold up under scrutiny. The paper does a reasonable job motivating why semantic similarity can miss the mark on response impact, and the efficiency angle is practically relevant for deployed systems. The central soft spot is exactly the one the stress-test flags. Mutual information between a memory subset and the full output distribution is intractable to compute exactly, and any real deployment needs a fast proxy such as a single forward pass, top-k entropy, or gradient surrogate. The abstract gives no hint of what proxy they use, and the paper must demonstrate that whatever approximation they chose preserves the ordering that the MI argument relies on. If the proxy correlates only loosely with actual response quality or human preference, the claimed edge over similarity methods collapses. The 95% cost reduction and the human-alignment results also need tight ablations and statistical reporting to show they are not driven by dataset quirks or post-hoc tuning. This is work for people building or studying memory-efficient personalization pipelines. A practitioner who needs to trim context without losing coherence would get concrete ideas from it, even if they end up modifying the selection rule. I would send it to peer review because the problem is timely, the proposed objective is distinct, and the empirical direction is worth checking, though the approximation details and validation will almost certainly require revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes Response-Utility optimization for Memory Selection (RUMS), a method for LLM personalization that selects subsets of user memory by maximizing mutual information between the memory items and the model's output distribution. This is positioned as superior to semantic similarity baselines because it directly targets reduction in response uncertainty. The authors claim improved alignment with human memory selections, higher response quality, and up to 95% computational cost reduction, while outperforming models 400x larger.

Significance. If the mutual-information objective can be approximated efficiently and shown to preserve ordering that correlates with actual response quality and human preference, the work would supply a more principled, information-theoretic alternative to heuristic memory selection in personalized LLMs. The claimed cost savings and cross-model scaling results would be of practical interest for deployment.

major comments (2)

[Abstract and §3 (Method)] The central claim rests on estimating I(M;Y) where Y is the LLM's token-level output distribution. The abstract asserts both human alignment and 95% cost reduction, yet provides no description of the proxy (single forward pass, top-k entropy, gradient surrogate, etc.) used to make this tractable at inference time. Without an explicit approximation and a validation that it preserves the claimed information-theoretic ordering, the superiority over semantic similarity cannot be evaluated.
[Experiments section] Table 2 (or equivalent results table) reports gains in alignment and quality, but the manuscript does not state the number of independent generations per candidate subset, the Monte-Carlo sample size for MI estimation, or any statistical significance tests. These details are load-bearing for the claim that RUMS measurably reduces response uncertainty beyond baselines.

minor comments (2)

[Abstract] The abstract phrase 'models $400×$ larger' is unclear; it should be rephrased to 'outperforms models 400× larger' or similar for readability.
[§3] Notation for the memory subset M and response Y should be introduced consistently in the method section and reused in equations; current usage mixes 'subset of memory' and 'memory items' without a single definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity and completeness where appropriate.

read point-by-point responses

Referee: [Abstract and §3 (Method)] The central claim rests on estimating I(M;Y) where Y is the LLM's token-level output distribution. The abstract asserts both human alignment and 95% cost reduction, yet provides no description of the proxy (single forward pass, top-k entropy, gradient surrogate, etc.) used to make this tractable at inference time. Without an explicit approximation and a validation that it preserves the claimed information-theoretic ordering, the superiority over semantic similarity cannot be evaluated.

Authors: We thank the referee for highlighting this point. We agree that the original manuscript did not provide sufficient detail on the practical approximation for estimating I(M;Y) at inference time. We have revised Section 3 to explicitly describe the proxy: a Monte Carlo approximation that samples from the top-k tokens of the LLM output distribution using a single forward pass per memory subset. We have also added a validation analysis (new subsection in §3 and appendix) demonstrating that this approximation preserves the relative ordering of memory subsets with high fidelity to more expensive exact computations on smaller models. These changes directly support the superiority claims over semantic similarity while clarifying the source of the reported cost reductions. revision: yes
Referee: [Experiments section] Table 2 (or equivalent results table) reports gains in alignment and quality, but the manuscript does not state the number of independent generations per candidate subset, the Monte-Carlo sample size for MI estimation, or any statistical significance tests. These details are load-bearing for the claim that RUMS measurably reduces response uncertainty beyond baselines.

Authors: We agree that these details are essential for evaluating the claims. We have revised the Experiments section to explicitly state that we performed 5 independent generations per candidate subset when measuring response quality, used a Monte-Carlo sample size of 20 for MI estimation, and applied paired t-tests for statistical significance (all reported p-values < 0.05). These specifications and the corresponding test results have been added to the text describing Table 2, confirming that the observed gains in alignment and quality are statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: RUMS introduces independent MI-based selection objective

full rationale

The paper defines RUMS as a new optimization that selects memory subsets via mutual information with the LLM output distribution, explicitly contrasting it with semantic similarity baselines. No equations, derivations, or claims reduce this objective to a fitted parameter, self-citation chain, or input by construction. The information-theoretic criterion is presented as a first-principles proposal whose practical estimation and superiority are evaluated separately; the central claim therefore remains self-contained and does not collapse into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that mutual information between memory and outputs is both computable and predictive of response quality; no explicit free parameters or invented entities are described.

axioms (1)

domain assumption Mutual information between memory subsets and LLM output distributions can be practically estimated and used for selection
This is the core mechanism enabling the claimed superiority over semantic similarity.

pith-pipeline@v0.9.0 · 5447 in / 1192 out tokens · 34543 ms · 2026-05-10T12:36:16.764029+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Are LLM Inferences Acceptable? User Reactions and Control Preferences for Inferred Personal Information
cs.HC 2026-05 unverdicted novelty 7.0

Users show curiosity over concern toward LLM inferences of personal information, with acceptability depending on context, alignment with expectations, and who uses the inferences rather than just the content.

Reference graph

Works this paper leans on

10 extracted references · 4 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

findings-emnlp.592/

URL https://aclanthology.org/2024. findings-emnlp.592/. Gadgil, S., Covert, I., and Lee, S.-I. Estimating con- ditional mutual information for dynamic feature selec- tion. InProceedings of the 42nd International Con- ference on Learning Representations (ICLR). OpenRe- view.net, 2024. URL https://openreview.net/ pdf?id=Oju2Qu9jvn. Geifman, Y . and El-Yaniv...

2024
[2]

Gemma 2: Improving Open Language Models at a Practical Size

Curran Associates Inc. ISBN 9781510860964. Gemma Team. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/ abs/2408.00118. Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007. Hagstr¨om, L., Kim, Y ., Yu, H., ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

He, P., Liu, X., Gao, J., and Chen, W

URL https://openreview.net/forum? id=xQCXInDq0m. He, P., Liu, X., Gao, J., and Chen, W. DEBERTA: Decoding- enhanced BERT with disentangled attention. InIn- ternational Conference on Learning Representations,
[4]

Mistral 7B

URL https://openreview.net/forum? id=XPZIaotutsD. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Renard Lavaud, L., Lachaux, 9 Response-Aware User Memory Selection M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., and El Sayed, W. Mistral 7B, 2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1028 2023
[5]

A survey of personalized large language models: Progress and future directions

URL https://aclanthology.org/2023. findings-acl.695/. Li, Y . Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 2023. URL https://arxiv.org/ abs/2304.12102. Li, Y ., Liang, D., Zhan...

work page doi:10.48550/arxiv.2502.11528 2023
[6]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

URL https://openreview.net/forum? id=xKDZAW0He3. Rajeev, M. A., Ramamurthy, R., Trivedi, P., Yadav, V ., Bamgbose, O., Madhusudhan, S. T., Zou, J., and Ra- jani, N. Cats confuse reasoning LLM: Query agnostic adversarial triggers for reasoning models. InSecond Con- ference on Language Modeling, 2025. URL https: //openreview.net/forum?id=VrEPiN5WhM. Ravfoge...

work page doi:10.18653/v1/2020.acl-main.647 2025
[7]

emnlp-main.87/

URL https://aclanthology.org/2022. emnlp-main.87/. Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y ., and Deng, Y . Wildchat: 1m chatGPT interaction logs in the wild. InThe Twelfth International Conference on Learning Representations (ICLR 2024), Spotlight Poster,

2022
[8]

12 Response-Aware User Memory Selection A

URL https://openreview.net/forum? id=Bl8u7ZRlbM. 12 Response-Aware User Memory Selection A. Additional Experimentation In this section, we present additional experiments to supplement our main results. A.1. Robustness Analysis of Utility Scores In H1, we hypothesized that RUMS-Utility, the maximum utility scores, can reliably distinguish between user inpu...

2024
[9]

In May 1994 the Chan- nel Tunnel was opened by Queen Elizabeth II and which French President?

Sequential Conditional Interaction GapGiven an ordering of memory items, we define δi = [U(S i)−U(S i−1]− U({i}), where Si is the prefix subset. This measures how much the conditional contributions of each memory item deviates from independence. Table 5 summarizes the measured combinatorial gaps across both synthetic and real-world datasets. We find the f...

2024
[10]

overall”: “2

**Interactive Activities**: Engage in activities like grammar charades or sentence-building races with friends or family. This adds a social element to learning. 3. **Threshold Learning**: Set small, achievable goals and reward yourself when you reach them. This creates a sense of accomplishment and makes the process more enjoyable. 4. **Join a Language G...

2025