Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

Balaji Krishnamurthy; Rajiv Ratn Shah; Varun Khurana; Vashu Chauhan; Vijval Ekbote; Yaman Kumar Singla

arxiv: 2606.08800 · v1 · pith:VAOT6D3Unew · submitted 2026-06-07 · 💻 cs.AI

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

Varun Khurana , Vijval Ekbote , Vashu Chauhan , Yaman Kumar Singla , Rajiv Ratn Shah , Balaji Krishnamurthy This is my paper

Pith reviewed 2026-06-27 18:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords feature engineeringself-evolving treesexpert alignmentinterpretable machine learningbrand classificationautomated feature generationcontent moderation

0 comments

The pith

FEST uses self-evolving trees to generate expert-aligned features from raw text and images, outperforming baselines by 4.2 points on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FEST to solve the problem of turning unstructured data into interpretable features that match expert criteria in high-stakes domains like brand compliance and content moderation. It does this through dual-stream generation of semantic and deterministic features, followed by semantic deduplication and iterative evolution guided by trees. Experiments show FEST wins on most classifier-task pairs, covers a majority of expert-designed features, and gains further accuracy when given expert guidelines as seeds. The authors also release the BrandGuide dataset to support evaluation of expert alignment in feature engineering.

Core claim

FEST discovers auditable features from raw text and images by combining dual-stream feature generation, semantic deduplication, and tree-guided iterative evolution, leading in 17 of 20 classifier-task combinations with a mean gain of 4.2 percentage points, achieving 60-80% coverage of expert-designed brand features, and improving accuracy by 6-12 points when seeded with expert guidelines.

What carries the argument

Self-evolving trees that guide iterative feature refinement by evaluating and evolving candidates from semantic and deterministic generation streams with deduplication.

If this is right

FEST outperforms the strongest baseline in 17 of 20 classifier-task combinations across three domains.
It recovers 60-80% of expert-designed brand features at strict semantic alignment thresholds.
Seeding with expert guidelines turns qualitative criteria into operational features and raises accuracy by 6-12 points.
The released BrandGuide dataset pairs expert features with over one million assets across 2683 brands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to clinical or legal text where expert guidelines are also available in written form.
If the evolution loop scales, it might reduce the need for hand-crafted features in regulated industries.
Testing on image-only or video data would check whether the dual-stream design generalizes beyond the current text-heavy tasks.

Load-bearing premise

That LLM-as-judge scores and human expert ratings provide an unbiased and reliable measure of true alignment with the qualitative criteria used by practitioners.

What would settle it

A direct comparison of FEST outputs against a held-out collection of expert features on new data, scored by different experts without using the LLM judge or the original raters.

Figures

Figures reproduced from arXiv: 2606.08800 by Balaji Krishnamurthy, Rajiv Ratn Shah, Varun Khurana, Vashu Chauhan, Vijval Ekbote, Yaman Kumar Singla.

**Figure 1.** Figure 1: One iteration of FEST on LG brand voice feature discovery. Given paired on-brand and off-brand samples, FEST executes two complementary streams: (i) a semantic (SE) stream, where an LLM generates pairwise discriminative hypotheses (e.g., “inclusive language,” “emotional appeal”), followed by semantic clustering and deduplication; and (ii) a deterministic (DE) stream producing executable Python functions (e… view at source ↗

**Figure 2.** Figure 2: LLM-as-judge expert coverage (%) at thresholds 5–7 for three brands (Adobe, LG, Porsche). FEST (solid) maintains stable coverage across all thresholds, confirming strong semantic matches (7–9/10). Felix (dashed) appears comparable at threshold 5 but collapses at stricter thresholds, reaching 0% for Porsche at ≥7. Full per-brand sweep in Appendix N. Classification accuracy alone is insufficient for high-sta… view at source ↗

**Figure 3.** Figure 3: Expert feature refinement accuracy (averaged across DT, LR, RF, LLM classifiers) for text brand classification. A: expert features as-is, B: FEST features filtered to those aligned with experts (LLM-judge ≥ 7), C: full Expert+FEST. The staircase pattern (A→B→C) confirms that refinement and augmentation each provide independent gains of 6–12 pp. Per-classifier breakdown in Table 5 (Appendix). Experiment… view at source ↗

**Figure 4.** Figure 4: FEST refines and augments expert features (EY imagery). From seeds Fseed, FEST produces unchanged, refined (F ′ seed), and augmented (Fnew) features [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FEST adds a self-evolving dual-stream pipeline and the BrandGuide dataset, but the reported gains and alignment numbers rest on unvalidated LLM judges plus missing ablations and stats.

read the letter

FEST combines semantic and deterministic feature streams with tree-guided evolution and deduplication to turn expert guidelines into operational features from text and images. It also ships BrandGuide, a dataset that pairs expert-designed features with over a million assets across thousands of brands.

The dataset release is the clearest practical step forward. Anyone working on alignment between automated features and qualitative expert criteria now has a concrete benchmark they can use or extend. The pipeline itself stitches together existing pieces in a way that directly targets the gap between raw content and auditable rules.

The performance claims are harder to assess. The abstract reports wins on 17 of 20 classifier-task pairs and 60-80% coverage via LLM-as-judge, backed by a human study. Yet there are no error bars, no statistical tests, and no ablation that isolates what the self-evolution step actually contributes versus simply generating more features.

The bigger issue is the reliance on LLM-as-judge and human ratings for expert alignment. The paper gives no evidence that these proxies track the same criteria the original brand experts used when they wrote the guidelines. Without a validation step comparing judge scores to direct expert pairwise judgments, the coverage numbers and downstream accuracy lifts could be artifacts of the evaluation setup rather than evidence of true alignment.

This work is aimed at people building interpretable models in brand compliance, content moderation, or similar domains where features must be traceable to documentation. Readers who need a new dataset for testing alignment methods will find something usable here.

The combination of a released dataset and a concrete pipeline is enough to send it to peer review, though the methods and evaluation sections will need tighter controls and validation experiments before the central claims can be taken as settled.

Referee Report

3 major / 2 minor

Summary. The paper introduces FEST (Feature Engineering with Self-evolving Trees), which combines dual-stream (semantic and deterministic) feature generation, semantic deduplication, and tree-guided iterative evolution to produce auditable, expert-aligned features from unstructured text and images. It claims FEST outperforms baselines in 17 of 20 classifier-task combinations across brand classification, content authenticity, and stress detection (mean +4.2 pp gain over strongest baseline across five classifiers), achieves 60-80% coverage of expert-designed features per LLM-as-judge at strict thresholds (corroborated by human ratings on relevance/clarity/actionability), and yields 6-12 pp accuracy gains when seeded with guidelines. The work releases BrandGuide, pairing expert features with 1M+ assets across 2,683 brands, to support evaluation of expert alignment in automated feature engineering.

Significance. If the performance and alignment results hold after proper validation, the work would offer a concrete route to interpretable ML in high-stakes domains by turning qualitative expert criteria into operational features. The BrandGuide release is a clear strength, as it supplies the first public benchmark linking expert-designed features to large-scale assets and enables reproducible study of alignment; this directly addresses a gap in the field. The self-evolution mechanism and dual-stream design are presented as ways to avoid purely data-driven or purely manual approaches.

major comments (3)

[Abstract and §5] Abstract and §5 (results): The headline claim of leading in 17/20 combinations with +4.2 pp mean gain supplies no error bars, no description of statistical testing, no ablation isolating the self-evolution component, and no details on hyperparameter selection or post-hoc exclusions. These omissions make the central performance claim difficult to evaluate for robustness.
[§4.3] §4.3 (LLM-as-judge evaluation) and human study: The 60-80% coverage figures and human ratings on relevance/actionability are load-bearing for both the alignment claim and the downstream accuracy gains, yet no inter-rater reliability, prompt-validation experiment, or direct comparison against expert pairwise judgments on the same features is reported. If the judge operationalizes 'semantic alignment' differently from the original BrandGuide experts, the coverage numbers become an unvalidated proxy rather than evidence of true alignment.
[§3 and §5.2] §3 (method) and §5.2 (seeded experiments): The reported 6-12 pp gains when seeding with expert guidelines rest on the assumption that the generated features are genuinely refining the qualitative criteria rather than adding volume or lexical overlap; no control experiment (e.g., random or non-evolved features of equal count) is described to isolate this effect.

minor comments (2)

[§3] Notation for the dual-stream generation and tree evolution steps could be clarified with a single running example across figures.
[§6] The BrandGuide dataset description would benefit from explicit statistics on feature density per brand and asset type distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of statistical rigor, evaluation validation, and experimental controls that we will address to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (results): The headline claim of leading in 17/20 combinations with +4.2 pp mean gain supplies no error bars, no description of statistical testing, no ablation isolating the self-evolution component, and no details on hyperparameter selection or post-hoc exclusions. These omissions make the central performance claim difficult to evaluate for robustness.

Authors: We agree that the results presentation requires greater statistical detail. In the revised manuscript we will report error bars (standard deviation across multiple runs), describe the statistical testing procedure with p-values, add an ablation isolating the self-evolution component, and document hyperparameter selection together with any post-hoc decisions. revision: yes
Referee: [§4.3] §4.3 (LLM-as-judge evaluation) and human study: The 60-80% coverage figures and human ratings on relevance/actionability are load-bearing for both the alignment claim and the downstream accuracy gains, yet no inter-rater reliability, prompt-validation experiment, or direct comparison against expert pairwise judgments on the same features is reported. If the judge operationalizes 'semantic alignment' differently from the original BrandGuide experts, the coverage numbers become an unvalidated proxy rather than evidence of true alignment.

Authors: We acknowledge the value of additional validation. The revision will include inter-rater reliability statistics for the human study and a prompt-validation experiment. A direct pairwise comparison with the original BrandGuide experts would require new annotation effort outside the current scope; we will therefore discuss this limitation explicitly while retaining the existing human ratings on relevance, clarity, and actionability as corroborating evidence. revision: partial
Referee: [§3 and §5.2] §3 (method) and §5.2 (seeded experiments): The reported 6-12 pp gains when seeding with expert guidelines rest on the assumption that the generated features are genuinely refining the qualitative criteria rather than adding volume or lexical overlap; no control experiment (e.g., random or non-evolved features of equal count) is described to isolate this effect.

Authors: We agree that a control is needed to isolate the refinement effect. The revised experiments will add a control condition that compares the seeded, evolved features against an equal number of randomly generated or non-evolved features, thereby demonstrating that accuracy gains arise from iterative refinement rather than feature volume or lexical overlap. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance and coverage metrics are external evaluations

full rationale

The paper reports empirical results (17/20 wins, +4.2 pp mean gain, 60-80% LLM-as-judge coverage) on held-out tasks and a released dataset (BrandGuide) using standard classifier accuracy and semantic alignment proxies. No equations, fitted parameters, or self-citations are shown reducing these quantities to the method's own inputs by construction. The derivation chain consists of a described algorithm (dual-stream generation, deduplication, tree evolution) evaluated against independent baselines and expert references; the metrics are not redefined in terms of the same fitted values. This is the common case of a self-contained empirical paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the unstated premise that the LLM judge and human raters constitute valid proxies for expert alignment, plus whatever internal parameters control the semantic deduplication threshold and tree evolution depth; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.1-grok · 5850 in / 1451 out tokens · 19516 ms · 2026-06-27T18:31:24.005903+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 10 canonical work pages · 4 internal anchors

[1]

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Nikhil Abhyankar, Parshin Shojaee, and Chandan K Reddy. Llm-fe: Automated feature engi- neering for tabular data with llms as evolutionary optimizers.arXiv preprint arXiv:2503.14434, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ay¸ segül Acar, Naci Büyükda˘g, Burak Türten, Ersin Diker, and Gülsüm Çalı¸ sır. The role of brand identity, brand lifestyle congruence, and brand satisfaction on repurchase intention: a multi-group structural equation model.Humanities and Social Sciences Communications, 11 (1):1–13, 2024

2024
[3]

Adobe GenStudio for Performance Marketing: Brand Compliance

Adobe. Adobe GenStudio for Performance Marketing: Brand Compliance. https://busine ss.adobe.com/products/genstudio/performance-marketing/brand-compliance. html, 2024. Accessed: 2026-05-10

2024
[4]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[5]

Embedding domain-specific knowledge from llms into the feature engineering pipeline.arXiv preprint arXiv:2503.21155, 2025

João Eduardo Batista. Embedding domain-specific knowledge from llms into the feature engineering pipeline.arXiv preprint arXiv:2503.21155, 2025

work page arXiv 2025
[6]

Canva Brand Kit

Canva. Canva Brand Kit. https://www.canva.com/en_in/pro/brand-kit/ , 2024. Accessed: 2026-05-10

2024
[7]

Match.com ad criticised for suggesting red hair and freckles ’imperfections’

Elena Cresci. Match.com ad criticised for suggesting red hair and freckles ’imperfections’. The Guardian, April 2016. URL https://www.theguardian.com/media/2016/apr/11/matc hcom-ad-criticised-for-suggesting-red-hair-and-freckles-imperfections

2016
[8]

The delta learning hypothesis: Preference tuning on weak data can yield strong gains.arXiv preprint arXiv:2507.06187, 2025

Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. The delta learning hypothesis: Preference tuning on weak data can yield strong gains.arXiv preprint arXiv:2507.06187, 2025

work page arXiv 2025
[9]

Large language models can automatically engineer features for few-shot tabular learning.arXiv preprint arXiv:2404.09491, 2024

Sungwon Han, Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Large language models can automatically engineer features for few-shot tabular learning.arXiv preprint arXiv:2404.09491, 2024

work page arXiv 2024
[10]

The autofeat python library for automated feature engineering and selection

Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 111–120. Springer, 2019

2019
[11]

macmillan, 2011

Daniel Kahneman.Thinking, fast and slow. macmillan, 2011

2011
[12]

Brand synthesis: The multidimensionality of brand knowledge.Journal of consumer research, 29(4):595–600, 2003

Kevin Lane Keller. Brand synthesis: The multidimensionality of brand knowledge.Journal of consumer research, 29(4):595–600, 2003

2003
[13]

Measuring and improving engagement of text-to-image generation models

Varun Khurana, Yaman Singla, Jayakumar Subramanian, Changyou Chen, Rajiv Ratn Shah, Zhiqiang Xu, and Balaji Krishnamurthy. Measuring and improving engagement of text-to-image generation models. InInternational Conference on Learning Representations, volume 2025, pages 38273–38304, 2025

2025
[14]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018

2018
[15]

Ferg-llm: Feature en- gineering by reason generation large language models

Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. Ferg-llm: Feature en- gineering by reason generation large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4211–4228, 2025

2025
[16]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020

2020
[17]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. 10 Adobe Media & Data Science Research

2017
[18]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023
[19]

Felix: Automatic and interpretable feature engineering using llms

Simon Malberg, Edoardo Mosca, and Georg Groh. Felix: Automatic and interpretable feature engineering using llms. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 230–246. Springer, 2024

2024
[20]

Brand consistency—the competitive advantage and how to achieve it

Marq (formerly Lucidpress). Brand consistency—the competitive advantage and how to achieve it. Blog post, originally published 2018-2019, 2024. URL https://www.marq.com/blo g/brand-consistency-competitive-advantage/ . Survey of 400+ brand management experts

2018
[21]

Label-free concept bottleneck models

Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023

work page arXiv 2023
[22]

GPT-4o mini: Advancing cost-efficient intelligence

OpenAI. GPT-4o mini: Advancing cost-efficient intelligence. OpenAI Blog, 2024. URL https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ . Accessed: January 2026

2024
[23]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016

2016
[24]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

2019
[25]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[26]

A bayesian approach to filtering junk e-mail

Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. InLearning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105. Madison, Wisconsin, 1998

1998
[27]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017
[28]

Elsevier, 2012

Edward Shortliffe.Computer-based medical consultations: MYCIN, volume 2. Elsevier, 2012

2012
[29]

Dreaddit: A reddit dataset for stress analysis in social media

Elsbeth Turcan and Kathleen McKeown. Dreaddit: A reddit dataset for stress analysis in social media. InProceedings of the tenth international workshop on health text mining and information analysis (LOUHI 2019), pages 97–107, 2019

2019
[30]

An exploration of pattern mining with chatgpt

Michael Weiss. An exploration of pattern mining with chatgpt. InProceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices, pages 1–11, 2024

2024
[31]

Openfe: Automated feature generation with expert-level performance

Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: Automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023

2023
[32]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[34]

Hypothesis generation with large language models

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models. InProceedings of the 1st Workshop on NLP for Science (NLP4Science), pages 117–139, 2024. 11 Adobe Media & Data Science Research A Related Work

2024
[35]

Manual Feature Engineering.Early AI and expert systems relied on manually crafted, human- interpretable features and rules. Classic examples include the MYCIN medical diagnosis system, which encoded clinical knowledge as explicit features and logical rules [28], and early spam filters that used hand-designed counts and lexical features to achieve practica...
[36]

Classical Automated Feature Engineering.Classical automated feature engineering systems construct large candidate pools via predefined transformations and then select or prune useful features. Representative systems (e.g., AutoFeat and OpenFE) enumerate combinations, nonlinear transforms, or symbolic expressions and rely on statistical selection, regulari...
[37]

Structured data.LLMs have been applied to structured and tabular settings where feature transforma- tions can be expressed programmatically

LLM-Based Feature Engineering.LLMs have recently been used as feature proposers in both structured and unstructured settings. Structured data.LLMs have been applied to structured and tabular settings where feature transforma- tions can be expressed programmatically. [9] propose FeatLLM, an in-context prompting approach that elicits interpretable rule-styl...
[38]

CBMs require predefined concept sets with labeled annotations, limiting scalability

Concept Bottleneck Models.Concept Bottleneck Models (CBMs) pursue a complementary goal: mapping raw inputs to human-interpretable intermediate concepts before prediction [ 16]. CBMs require predefined concept sets with labeled annotations, limiting scalability. Label-Free CBMs address this by using foundation models to automatically generate concept sets ...
[39]

uses inclusive language,

Inherent Interpretability vs. Post-hoc Explanation.Rudin [24] argues that high-stakes do- mains should use inherently interpretable models rather than explaining black boxes post hoc, since post-hoc methods (LIME, SHAP, GradCAM) produce approximate attributions that may not faith- fully reflect model reasoning [23, 17, 27]. FEST aligns with this principle...
[40]

Self-Refine [18] demonstrated that LLMs can generate output, critique it, and iteratively improve, achieving 20% average gains across diverse tasks

Iterative Self-Refinement.FEST’s generate-evaluate-refine loop connects to the broader paradigm of iterative self-improvement in LLM systems. Self-Refine [18] demonstrated that LLMs can generate output, critique it, and iteratively improve, achieving 20% average gains across diverse tasks. FEST applies an analogous loop to feature engineering: LLMs genera...
[41]

[33] established that strong LLMs can approximate human judgment with >80% agreement, enabling scalable evaluation of open-ended outputs

LLM-as-Judge Evaluation.Zheng et al. [33] established that strong LLMs can approximate human judgment with >80% agreement, enabling scalable evaluation of open-ended outputs. FEST adopts this paradigm for expert coverage evaluation: GPT-4o judges whether discovered features capture the same brand dimensions as expert guidelines, providing interpretable 0–...
[42]

Comparison with FEST:While prior work demonstrates the promise of LLMs for feature discovery across both unstructured and structured settings, FEST departs from these approaches along several dimensions. First, FEST targets raw multimodal observational data (text, images, and tabular inputs) by prompting LLMs to generateinterpretablefeature candidates dir...
[43]

1” (present) and “0

Connection to Robust Optimization.FEST shares a conceptual affinity with Invariant Risk Minimization (IRM) [4] and Group DRO [25]: both seek features stable across environments rather than spuriously correlated with labels. FEST’s iterative pruning of low-importance features echoes invariance-seeking, as features that do not generalize across batches are ...

work page arXiv 2074
[44]

All tech overlays should be set in EY Yellow to create a clear brand connection

Unchanged Expert Features( Fseed): features retained verbatim because they are already precise and actionable (e.g., “All tech overlays should be set in EY Yellow to create a clear brand connection”)
[45]

Avoid implausible and fantastical imagery. Avoid unoriginal or unsophisticated imagery

Refined Expert Features( F ′ seed): ambiguous guidelines transformed into precise operational definitions. For example, “Avoid implausible and fantastical imagery. Avoid unoriginal or unsophisticated imagery. . . ” was refined to “Emphasize realistic, grounded photography over abstract or cartoonish visuals for relatable professional depictions, featuring...
[46]

Feature mid shots for balanced com- position and personal connection, maintaining consistent eye-level perspective while avoiding extreme angles

Augmented Features( Fnew): novel features discovered by FEST that extend beyond documented guidelines, surfacing implicit domain knowledge (e.g., “Feature mid shots for balanced com- position and personal connection, maintaining consistent eye-level perspective while avoiding extreme angles” for EY images). L Variance Analysis We run 3 independent seeds o...
[47]

FEST summarizes clusters into canonical descriptions, producing language that converges toward how experts naturally express concepts

Cluster summarization: Felix selects the feature closest to each cluster centroid, preserving idiosyncratic LLM phrasing. FEST summarizes clusters into canonical descriptions, producing language that converges toward how experts naturally express concepts
[48]

uses positive language

Iterative pruning: Felix is one-shot, so all features survive regardless of quality. FEST’s tree- guided importance scores identify and prune generic features (e.g., “uses positive language”) that lack discriminative power. Since expert features are inherently discriminative (experts select what distinguishes classes), iterative selection for discriminati...
[49]

Zero-tolerance for hate speech,

Multi-stage language refinement: FEST edits feature language at discovery, during cluster summarization, and during bank merging across iterations. Each consolidation step distills toward more precise, canonical formulations, while Felix’s single-pass centroid-pick cannot improve feature framing based on feedback from data. P Uncovered Guidelines Audit An...
[50]

Any memorization benefit applies equally to all methods, yet FEST consistently outperforms these baselines, indicating methodological gains

Same-LLM baselines: Zero-Shot and Few-Shot baselines use the identical LLM on the same con- tent. Any memorization benefit applies equally to all methods, yet FEST consistently outperforms these baselines, indicating methodological gains
[51]

Contamination-implausible datasets: FEST achieves strong results on Dreaddit (Reddit posts about stress) and GPT-generated content detection. Reddit posts are unlikely to be memorized 22 Adobe Media & Data Science Research in their task-specific labels, and GPT-generated content detection requires distinguishing model outputs from human writing, not recal...
[52]

uses emotional storytelling,

Feature interpretability: FEST’s features are expressed as natural-language descriptions or short executable functions and can be inspected for face validity. The discovered features (e.g., “uses emotional storytelling,” “sentence length variance”) are domain-meaningful, not artifacts of memorization. We acknowledge this as a limitation inherent to all cl...
[53]

Initial collection yielded 3,466 candidate entries spanning 1963–2025

Data Acquisition: We systematically collected brand guidelines from the web, extracting rich metadata including publication year, geographic region, language, and sector tags. Initial collection yielded 3,466 candidate entries spanning 1963–2025. 25 Adobe Media & Data Science Research

1963
[54]

Temporal Filtering: To ensure contemporary relevance and consistency in design conventions, we filtered entries to the 2014–2025 timeframe, yielding 2,683 brands that reflect modern digital-first brand systems

2014
[55]

Guideline Extraction: Each document undergoes structured parsing to extract design specifi- cations as text including color codes (HEX, RGB, CMYK, Pantone), typography hierarchies (primary/secondary typefaces, weights, sizes), logo clearance rules (minimum sizes, spacing requirements), and usage constraints (approved/prohibited applications)
[56]

We collect real-world logo applications, color implementations, marketing collateral, and brand touchpoints

Visual Asset Retrieval: For each brand, we retrieve imagery through web search using brand name and relevant keywords. We collect real-world logo applications, color implementations, marketing collateral, and brand touchpoints. This process yielded approximately1Mbrand images and textual descriptions across all entries
[57]

Annotators verified that ex- tracted specifications match source documents and that retrieved images accurately represent the corresponding brand

Manual Verification: Each stage incorporates human review to ensure annotation accuracy, filter malformed entries, and validate guideline-image alignment. Annotators verified that ex- tracted specifications match source documents and that retrieved images accurately represent the corresponding brand. V .2 Quality Assurance To maintain dataset quality, we ...

2014
[58]

+ 0.0722*c[2] L1, L2 = lum(dominant), lum( background) lo, hi = min(L1, L2), max(L1, L2) return (hi + 0.05) / (lo + 0.05) 30 Adobe Media & Data Science Research Feature Type Code Count of distinct visual elements via HSV color segmentation (food variety and abun- dance). DE def extract_feature(image_path: str): if not image_path or not isinstance(image_pa...

work page doi:10.18653/v1/d19-6213

[1] [1]

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Nikhil Abhyankar, Parshin Shojaee, and Chandan K Reddy. Llm-fe: Automated feature engi- neering for tabular data with llms as evolutionary optimizers.arXiv preprint arXiv:2503.14434, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Ay¸ segül Acar, Naci Büyükda˘g, Burak Türten, Ersin Diker, and Gülsüm Çalı¸ sır. The role of brand identity, brand lifestyle congruence, and brand satisfaction on repurchase intention: a multi-group structural equation model.Humanities and Social Sciences Communications, 11 (1):1–13, 2024

2024

[3] [3]

Adobe GenStudio for Performance Marketing: Brand Compliance

Adobe. Adobe GenStudio for Performance Marketing: Brand Compliance. https://busine ss.adobe.com/products/genstudio/performance-marketing/brand-compliance. html, 2024. Accessed: 2026-05-10

2024

[4] [4]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[5] [5]

Embedding domain-specific knowledge from llms into the feature engineering pipeline.arXiv preprint arXiv:2503.21155, 2025

João Eduardo Batista. Embedding domain-specific knowledge from llms into the feature engineering pipeline.arXiv preprint arXiv:2503.21155, 2025

work page arXiv 2025

[6] [6]

Canva Brand Kit

Canva. Canva Brand Kit. https://www.canva.com/en_in/pro/brand-kit/ , 2024. Accessed: 2026-05-10

2024

[7] [7]

Match.com ad criticised for suggesting red hair and freckles ’imperfections’

Elena Cresci. Match.com ad criticised for suggesting red hair and freckles ’imperfections’. The Guardian, April 2016. URL https://www.theguardian.com/media/2016/apr/11/matc hcom-ad-criticised-for-suggesting-red-hair-and-freckles-imperfections

2016

[8] [8]

The delta learning hypothesis: Preference tuning on weak data can yield strong gains.arXiv preprint arXiv:2507.06187, 2025

Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. The delta learning hypothesis: Preference tuning on weak data can yield strong gains.arXiv preprint arXiv:2507.06187, 2025

work page arXiv 2025

[9] [9]

Large language models can automatically engineer features for few-shot tabular learning.arXiv preprint arXiv:2404.09491, 2024

Sungwon Han, Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Large language models can automatically engineer features for few-shot tabular learning.arXiv preprint arXiv:2404.09491, 2024

work page arXiv 2024

[10] [10]

The autofeat python library for automated feature engineering and selection

Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 111–120. Springer, 2019

2019

[11] [11]

macmillan, 2011

Daniel Kahneman.Thinking, fast and slow. macmillan, 2011

2011

[12] [12]

Brand synthesis: The multidimensionality of brand knowledge.Journal of consumer research, 29(4):595–600, 2003

Kevin Lane Keller. Brand synthesis: The multidimensionality of brand knowledge.Journal of consumer research, 29(4):595–600, 2003

2003

[13] [13]

Measuring and improving engagement of text-to-image generation models

Varun Khurana, Yaman Singla, Jayakumar Subramanian, Changyou Chen, Rajiv Ratn Shah, Zhiqiang Xu, and Balaji Krishnamurthy. Measuring and improving engagement of text-to-image generation models. InInternational Conference on Learning Representations, volume 2025, pages 38273–38304, 2025

2025

[14] [14]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018

2018

[15] [15]

Ferg-llm: Feature en- gineering by reason generation large language models

Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. Ferg-llm: Feature en- gineering by reason generation large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4211–4228, 2025

2025

[16] [16]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020

2020

[17] [17]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. 10 Adobe Media & Data Science Research

2017

[18] [18]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023

[19] [19]

Felix: Automatic and interpretable feature engineering using llms

Simon Malberg, Edoardo Mosca, and Georg Groh. Felix: Automatic and interpretable feature engineering using llms. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 230–246. Springer, 2024

2024

[20] [20]

Brand consistency—the competitive advantage and how to achieve it

Marq (formerly Lucidpress). Brand consistency—the competitive advantage and how to achieve it. Blog post, originally published 2018-2019, 2024. URL https://www.marq.com/blo g/brand-consistency-competitive-advantage/ . Survey of 400+ brand management experts

2018

[21] [21]

Label-free concept bottleneck models

Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023

work page arXiv 2023

[22] [22]

GPT-4o mini: Advancing cost-efficient intelligence

OpenAI. GPT-4o mini: Advancing cost-efficient intelligence. OpenAI Blog, 2024. URL https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ . Accessed: January 2026

2024

[23] [23]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016

2016

[24] [24]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

2019

[25] [25]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[26] [26]

A bayesian approach to filtering junk e-mail

Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. InLearning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105. Madison, Wisconsin, 1998

1998

[27] [27]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017

[28] [28]

Elsevier, 2012

Edward Shortliffe.Computer-based medical consultations: MYCIN, volume 2. Elsevier, 2012

2012

[29] [29]

Dreaddit: A reddit dataset for stress analysis in social media

Elsbeth Turcan and Kathleen McKeown. Dreaddit: A reddit dataset for stress analysis in social media. InProceedings of the tenth international workshop on health text mining and information analysis (LOUHI 2019), pages 97–107, 2019

2019

[30] [30]

An exploration of pattern mining with chatgpt

Michael Weiss. An exploration of pattern mining with chatgpt. InProceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices, pages 1–11, 2024

2024

[31] [31]

Openfe: Automated feature generation with expert-level performance

Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: Automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023

2023

[32] [32]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023

[34] [34]

Hypothesis generation with large language models

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models. InProceedings of the 1st Workshop on NLP for Science (NLP4Science), pages 117–139, 2024. 11 Adobe Media & Data Science Research A Related Work

2024

[35] [35]

Manual Feature Engineering.Early AI and expert systems relied on manually crafted, human- interpretable features and rules. Classic examples include the MYCIN medical diagnosis system, which encoded clinical knowledge as explicit features and logical rules [28], and early spam filters that used hand-designed counts and lexical features to achieve practica...

[36] [36]

Classical Automated Feature Engineering.Classical automated feature engineering systems construct large candidate pools via predefined transformations and then select or prune useful features. Representative systems (e.g., AutoFeat and OpenFE) enumerate combinations, nonlinear transforms, or symbolic expressions and rely on statistical selection, regulari...

[37] [37]

Structured data.LLMs have been applied to structured and tabular settings where feature transforma- tions can be expressed programmatically

LLM-Based Feature Engineering.LLMs have recently been used as feature proposers in both structured and unstructured settings. Structured data.LLMs have been applied to structured and tabular settings where feature transforma- tions can be expressed programmatically. [9] propose FeatLLM, an in-context prompting approach that elicits interpretable rule-styl...

[38] [38]

CBMs require predefined concept sets with labeled annotations, limiting scalability

Concept Bottleneck Models.Concept Bottleneck Models (CBMs) pursue a complementary goal: mapping raw inputs to human-interpretable intermediate concepts before prediction [ 16]. CBMs require predefined concept sets with labeled annotations, limiting scalability. Label-Free CBMs address this by using foundation models to automatically generate concept sets ...

[39] [39]

uses inclusive language,

Inherent Interpretability vs. Post-hoc Explanation.Rudin [24] argues that high-stakes do- mains should use inherently interpretable models rather than explaining black boxes post hoc, since post-hoc methods (LIME, SHAP, GradCAM) produce approximate attributions that may not faith- fully reflect model reasoning [23, 17, 27]. FEST aligns with this principle...

[40] [40]

Self-Refine [18] demonstrated that LLMs can generate output, critique it, and iteratively improve, achieving 20% average gains across diverse tasks

Iterative Self-Refinement.FEST’s generate-evaluate-refine loop connects to the broader paradigm of iterative self-improvement in LLM systems. Self-Refine [18] demonstrated that LLMs can generate output, critique it, and iteratively improve, achieving 20% average gains across diverse tasks. FEST applies an analogous loop to feature engineering: LLMs genera...

[41] [41]

[33] established that strong LLMs can approximate human judgment with >80% agreement, enabling scalable evaluation of open-ended outputs

LLM-as-Judge Evaluation.Zheng et al. [33] established that strong LLMs can approximate human judgment with >80% agreement, enabling scalable evaluation of open-ended outputs. FEST adopts this paradigm for expert coverage evaluation: GPT-4o judges whether discovered features capture the same brand dimensions as expert guidelines, providing interpretable 0–...

[42] [42]

Comparison with FEST:While prior work demonstrates the promise of LLMs for feature discovery across both unstructured and structured settings, FEST departs from these approaches along several dimensions. First, FEST targets raw multimodal observational data (text, images, and tabular inputs) by prompting LLMs to generateinterpretablefeature candidates dir...

[43] [43]

1” (present) and “0

Connection to Robust Optimization.FEST shares a conceptual affinity with Invariant Risk Minimization (IRM) [4] and Group DRO [25]: both seek features stable across environments rather than spuriously correlated with labels. FEST’s iterative pruning of low-importance features echoes invariance-seeking, as features that do not generalize across batches are ...

work page arXiv 2074

[44] [44]

All tech overlays should be set in EY Yellow to create a clear brand connection

Unchanged Expert Features( Fseed): features retained verbatim because they are already precise and actionable (e.g., “All tech overlays should be set in EY Yellow to create a clear brand connection”)

[45] [45]

Avoid implausible and fantastical imagery. Avoid unoriginal or unsophisticated imagery

Refined Expert Features( F ′ seed): ambiguous guidelines transformed into precise operational definitions. For example, “Avoid implausible and fantastical imagery. Avoid unoriginal or unsophisticated imagery. . . ” was refined to “Emphasize realistic, grounded photography over abstract or cartoonish visuals for relatable professional depictions, featuring...

[46] [46]

Feature mid shots for balanced com- position and personal connection, maintaining consistent eye-level perspective while avoiding extreme angles

Augmented Features( Fnew): novel features discovered by FEST that extend beyond documented guidelines, surfacing implicit domain knowledge (e.g., “Feature mid shots for balanced com- position and personal connection, maintaining consistent eye-level perspective while avoiding extreme angles” for EY images). L Variance Analysis We run 3 independent seeds o...

[47] [47]

FEST summarizes clusters into canonical descriptions, producing language that converges toward how experts naturally express concepts

Cluster summarization: Felix selects the feature closest to each cluster centroid, preserving idiosyncratic LLM phrasing. FEST summarizes clusters into canonical descriptions, producing language that converges toward how experts naturally express concepts

[48] [48]

uses positive language

Iterative pruning: Felix is one-shot, so all features survive regardless of quality. FEST’s tree- guided importance scores identify and prune generic features (e.g., “uses positive language”) that lack discriminative power. Since expert features are inherently discriminative (experts select what distinguishes classes), iterative selection for discriminati...

[49] [49]

Zero-tolerance for hate speech,

Multi-stage language refinement: FEST edits feature language at discovery, during cluster summarization, and during bank merging across iterations. Each consolidation step distills toward more precise, canonical formulations, while Felix’s single-pass centroid-pick cannot improve feature framing based on feedback from data. P Uncovered Guidelines Audit An...

[50] [50]

Any memorization benefit applies equally to all methods, yet FEST consistently outperforms these baselines, indicating methodological gains

Same-LLM baselines: Zero-Shot and Few-Shot baselines use the identical LLM on the same con- tent. Any memorization benefit applies equally to all methods, yet FEST consistently outperforms these baselines, indicating methodological gains

[51] [51]

Contamination-implausible datasets: FEST achieves strong results on Dreaddit (Reddit posts about stress) and GPT-generated content detection. Reddit posts are unlikely to be memorized 22 Adobe Media & Data Science Research in their task-specific labels, and GPT-generated content detection requires distinguishing model outputs from human writing, not recal...

[52] [52]

uses emotional storytelling,

Feature interpretability: FEST’s features are expressed as natural-language descriptions or short executable functions and can be inspected for face validity. The discovered features (e.g., “uses emotional storytelling,” “sentence length variance”) are domain-meaningful, not artifacts of memorization. We acknowledge this as a limitation inherent to all cl...

[53] [53]

Initial collection yielded 3,466 candidate entries spanning 1963–2025

Data Acquisition: We systematically collected brand guidelines from the web, extracting rich metadata including publication year, geographic region, language, and sector tags. Initial collection yielded 3,466 candidate entries spanning 1963–2025. 25 Adobe Media & Data Science Research

1963

[54] [54]

Temporal Filtering: To ensure contemporary relevance and consistency in design conventions, we filtered entries to the 2014–2025 timeframe, yielding 2,683 brands that reflect modern digital-first brand systems

2014

[55] [55]

Guideline Extraction: Each document undergoes structured parsing to extract design specifi- cations as text including color codes (HEX, RGB, CMYK, Pantone), typography hierarchies (primary/secondary typefaces, weights, sizes), logo clearance rules (minimum sizes, spacing requirements), and usage constraints (approved/prohibited applications)

[56] [56]

We collect real-world logo applications, color implementations, marketing collateral, and brand touchpoints

Visual Asset Retrieval: For each brand, we retrieve imagery through web search using brand name and relevant keywords. We collect real-world logo applications, color implementations, marketing collateral, and brand touchpoints. This process yielded approximately1Mbrand images and textual descriptions across all entries

[57] [57]

Annotators verified that ex- tracted specifications match source documents and that retrieved images accurately represent the corresponding brand

Manual Verification: Each stage incorporates human review to ensure annotation accuracy, filter malformed entries, and validate guideline-image alignment. Annotators verified that ex- tracted specifications match source documents and that retrieved images accurately represent the corresponding brand. V .2 Quality Assurance To maintain dataset quality, we ...

2014

[58] [58]

+ 0.0722*c[2] L1, L2 = lum(dominant), lum( background) lo, hi = min(L1, L2), max(L1, L2) return (hi + 0.05) / (lo + 0.05) 30 Adobe Media & Data Science Research Feature Type Code Count of distinct visual elements via HSV color segmentation (food variety and abun- dance). DE def extract_feature(image_path: str): if not image_path or not isinstance(image_pa...

work page doi:10.18653/v1/d19-6213