Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution
Pith reviewed 2026-06-27 18:31 UTC · model grok-4.3
The pith
FEST uses self-evolving trees to generate expert-aligned features from raw text and images, outperforming baselines by 4.2 points on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FEST discovers auditable features from raw text and images by combining dual-stream feature generation, semantic deduplication, and tree-guided iterative evolution, leading in 17 of 20 classifier-task combinations with a mean gain of 4.2 percentage points, achieving 60-80% coverage of expert-designed brand features, and improving accuracy by 6-12 points when seeded with expert guidelines.
What carries the argument
Self-evolving trees that guide iterative feature refinement by evaluating and evolving candidates from semantic and deterministic generation streams with deduplication.
If this is right
- FEST outperforms the strongest baseline in 17 of 20 classifier-task combinations across three domains.
- It recovers 60-80% of expert-designed brand features at strict semantic alignment thresholds.
- Seeding with expert guidelines turns qualitative criteria into operational features and raises accuracy by 6-12 points.
- The released BrandGuide dataset pairs expert features with over one million assets across 2683 brands.
Where Pith is reading between the lines
- The method could extend to clinical or legal text where expert guidelines are also available in written form.
- If the evolution loop scales, it might reduce the need for hand-crafted features in regulated industries.
- Testing on image-only or video data would check whether the dual-stream design generalizes beyond the current text-heavy tasks.
Load-bearing premise
That LLM-as-judge scores and human expert ratings provide an unbiased and reliable measure of true alignment with the qualitative criteria used by practitioners.
What would settle it
A direct comparison of FEST outputs against a held-out collection of expert features on new data, scored by different experts without using the LLM judge or the original raters.
Figures
read the original abstract
In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FEST (Feature Engineering with Self-evolving Trees), which combines dual-stream (semantic and deterministic) feature generation, semantic deduplication, and tree-guided iterative evolution to produce auditable, expert-aligned features from unstructured text and images. It claims FEST outperforms baselines in 17 of 20 classifier-task combinations across brand classification, content authenticity, and stress detection (mean +4.2 pp gain over strongest baseline across five classifiers), achieves 60-80% coverage of expert-designed features per LLM-as-judge at strict thresholds (corroborated by human ratings on relevance/clarity/actionability), and yields 6-12 pp accuracy gains when seeded with guidelines. The work releases BrandGuide, pairing expert features with 1M+ assets across 2,683 brands, to support evaluation of expert alignment in automated feature engineering.
Significance. If the performance and alignment results hold after proper validation, the work would offer a concrete route to interpretable ML in high-stakes domains by turning qualitative expert criteria into operational features. The BrandGuide release is a clear strength, as it supplies the first public benchmark linking expert-designed features to large-scale assets and enables reproducible study of alignment; this directly addresses a gap in the field. The self-evolution mechanism and dual-stream design are presented as ways to avoid purely data-driven or purely manual approaches.
major comments (3)
- [Abstract and §5] Abstract and §5 (results): The headline claim of leading in 17/20 combinations with +4.2 pp mean gain supplies no error bars, no description of statistical testing, no ablation isolating the self-evolution component, and no details on hyperparameter selection or post-hoc exclusions. These omissions make the central performance claim difficult to evaluate for robustness.
- [§4.3] §4.3 (LLM-as-judge evaluation) and human study: The 60-80% coverage figures and human ratings on relevance/actionability are load-bearing for both the alignment claim and the downstream accuracy gains, yet no inter-rater reliability, prompt-validation experiment, or direct comparison against expert pairwise judgments on the same features is reported. If the judge operationalizes 'semantic alignment' differently from the original BrandGuide experts, the coverage numbers become an unvalidated proxy rather than evidence of true alignment.
- [§3 and §5.2] §3 (method) and §5.2 (seeded experiments): The reported 6-12 pp gains when seeding with expert guidelines rest on the assumption that the generated features are genuinely refining the qualitative criteria rather than adding volume or lexical overlap; no control experiment (e.g., random or non-evolved features of equal count) is described to isolate this effect.
minor comments (2)
- [§3] Notation for the dual-stream generation and tree evolution steps could be clarified with a single running example across figures.
- [§6] The BrandGuide dataset description would benefit from explicit statistics on feature density per brand and asset type distribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of statistical rigor, evaluation validation, and experimental controls that we will address to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (results): The headline claim of leading in 17/20 combinations with +4.2 pp mean gain supplies no error bars, no description of statistical testing, no ablation isolating the self-evolution component, and no details on hyperparameter selection or post-hoc exclusions. These omissions make the central performance claim difficult to evaluate for robustness.
Authors: We agree that the results presentation requires greater statistical detail. In the revised manuscript we will report error bars (standard deviation across multiple runs), describe the statistical testing procedure with p-values, add an ablation isolating the self-evolution component, and document hyperparameter selection together with any post-hoc decisions. revision: yes
-
Referee: [§4.3] §4.3 (LLM-as-judge evaluation) and human study: The 60-80% coverage figures and human ratings on relevance/actionability are load-bearing for both the alignment claim and the downstream accuracy gains, yet no inter-rater reliability, prompt-validation experiment, or direct comparison against expert pairwise judgments on the same features is reported. If the judge operationalizes 'semantic alignment' differently from the original BrandGuide experts, the coverage numbers become an unvalidated proxy rather than evidence of true alignment.
Authors: We acknowledge the value of additional validation. The revision will include inter-rater reliability statistics for the human study and a prompt-validation experiment. A direct pairwise comparison with the original BrandGuide experts would require new annotation effort outside the current scope; we will therefore discuss this limitation explicitly while retaining the existing human ratings on relevance, clarity, and actionability as corroborating evidence. revision: partial
-
Referee: [§3 and §5.2] §3 (method) and §5.2 (seeded experiments): The reported 6-12 pp gains when seeding with expert guidelines rest on the assumption that the generated features are genuinely refining the qualitative criteria rather than adding volume or lexical overlap; no control experiment (e.g., random or non-evolved features of equal count) is described to isolate this effect.
Authors: We agree that a control is needed to isolate the refinement effect. The revised experiments will add a control condition that compares the seeded, evolved features against an equal number of randomly generated or non-evolved features, thereby demonstrating that accuracy gains arise from iterative refinement rather than feature volume or lexical overlap. revision: yes
Circularity Check
No significant circularity; performance and coverage metrics are external evaluations
full rationale
The paper reports empirical results (17/20 wins, +4.2 pp mean gain, 60-80% LLM-as-judge coverage) on held-out tasks and a released dataset (BrandGuide) using standard classifier accuracy and semantic alignment proxies. No equations, fitted parameters, or self-citations are shown reducing these quantities to the method's own inputs by construction. The derivation chain consists of a described algorithm (dual-stream generation, deduplication, tree evolution) evaluated against independent baselines and expert references; the metrics are not redefined in terms of the same fitted values. This is the common case of a self-contained empirical paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
Nikhil Abhyankar, Parshin Shojaee, and Chandan K Reddy. Llm-fe: Automated feature engi- neering for tabular data with llms as evolutionary optimizers.arXiv preprint arXiv:2503.14434, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Ay¸ segül Acar, Naci Büyükda˘g, Burak Türten, Ersin Diker, and Gülsüm Çalı¸ sır. The role of brand identity, brand lifestyle congruence, and brand satisfaction on repurchase intention: a multi-group structural equation model.Humanities and Social Sciences Communications, 11 (1):1–13, 2024
2024
-
[3]
Adobe GenStudio for Performance Marketing: Brand Compliance
Adobe. Adobe GenStudio for Performance Marketing: Brand Compliance. https://busine ss.adobe.com/products/genstudio/performance-marketing/brand-compliance. html, 2024. Accessed: 2026-05-10
2024
-
[4]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[5]
João Eduardo Batista. Embedding domain-specific knowledge from llms into the feature engineering pipeline.arXiv preprint arXiv:2503.21155, 2025
-
[6]
Canva Brand Kit
Canva. Canva Brand Kit. https://www.canva.com/en_in/pro/brand-kit/ , 2024. Accessed: 2026-05-10
2024
-
[7]
Match.com ad criticised for suggesting red hair and freckles ’imperfections’
Elena Cresci. Match.com ad criticised for suggesting red hair and freckles ’imperfections’. The Guardian, April 2016. URL https://www.theguardian.com/media/2016/apr/11/matc hcom-ad-criticised-for-suggesting-red-hair-and-freckles-imperfections
2016
-
[8]
Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. The delta learning hypothesis: Preference tuning on weak data can yield strong gains.arXiv preprint arXiv:2507.06187, 2025
-
[9]
Sungwon Han, Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Large language models can automatically engineer features for few-shot tabular learning.arXiv preprint arXiv:2404.09491, 2024
-
[10]
The autofeat python library for automated feature engineering and selection
Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 111–120. Springer, 2019
2019
-
[11]
macmillan, 2011
Daniel Kahneman.Thinking, fast and slow. macmillan, 2011
2011
-
[12]
Brand synthesis: The multidimensionality of brand knowledge.Journal of consumer research, 29(4):595–600, 2003
Kevin Lane Keller. Brand synthesis: The multidimensionality of brand knowledge.Journal of consumer research, 29(4):595–600, 2003
2003
-
[13]
Measuring and improving engagement of text-to-image generation models
Varun Khurana, Yaman Singla, Jayakumar Subramanian, Changyou Chen, Rajiv Ratn Shah, Zhiqiang Xu, and Balaji Krishnamurthy. Measuring and improving engagement of text-to-image generation models. InInternational Conference on Learning Representations, volume 2025, pages 38273–38304, 2025
2025
-
[14]
Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018
2018
-
[15]
Ferg-llm: Feature en- gineering by reason generation large language models
Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. Ferg-llm: Feature en- gineering by reason generation large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4211–4228, 2025
2025
-
[16]
Concept bottleneck models
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020
2020
-
[17]
A unified approach to interpreting model predictions
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. 10 Adobe Media & Data Science Research
2017
-
[18]
Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023
2023
-
[19]
Felix: Automatic and interpretable feature engineering using llms
Simon Malberg, Edoardo Mosca, and Georg Groh. Felix: Automatic and interpretable feature engineering using llms. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 230–246. Springer, 2024
2024
-
[20]
Brand consistency—the competitive advantage and how to achieve it
Marq (formerly Lucidpress). Brand consistency—the competitive advantage and how to achieve it. Blog post, originally published 2018-2019, 2024. URL https://www.marq.com/blo g/brand-consistency-competitive-advantage/ . Survey of 400+ brand management experts
2018
-
[21]
Label-free concept bottleneck models
Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023
-
[22]
GPT-4o mini: Advancing cost-efficient intelligence
OpenAI. GPT-4o mini: Advancing cost-efficient intelligence. OpenAI Blog, 2024. URL https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ . Accessed: January 2026
2024
-
[23]
why should i trust you?
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016
2016
-
[24]
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019
2019
-
[25]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[26]
A bayesian approach to filtering junk e-mail
Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. InLearning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105. Madison, Wisconsin, 1998
1998
-
[27]
Grad-cam: Visual explanations from deep networks via gradient-based localization
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017
2017
-
[28]
Elsevier, 2012
Edward Shortliffe.Computer-based medical consultations: MYCIN, volume 2. Elsevier, 2012
2012
-
[29]
Dreaddit: A reddit dataset for stress analysis in social media
Elsbeth Turcan and Kathleen McKeown. Dreaddit: A reddit dataset for stress analysis in social media. InProceedings of the tenth international workshop on health text mining and information analysis (LOUHI 2019), pages 97–107, 2019
2019
-
[30]
An exploration of pattern mining with chatgpt
Michael Weiss. An exploration of pattern mining with chatgpt. InProceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices, pages 1–11, 2024
2024
-
[31]
Openfe: Automated feature generation with expert-level performance
Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: Automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023
2023
-
[32]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
2023
-
[34]
Hypothesis generation with large language models
Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models. InProceedings of the 1st Workshop on NLP for Science (NLP4Science), pages 117–139, 2024. 11 Adobe Media & Data Science Research A Related Work
2024
-
[35]
Manual Feature Engineering.Early AI and expert systems relied on manually crafted, human- interpretable features and rules. Classic examples include the MYCIN medical diagnosis system, which encoded clinical knowledge as explicit features and logical rules [28], and early spam filters that used hand-designed counts and lexical features to achieve practica...
-
[36]
Classical Automated Feature Engineering.Classical automated feature engineering systems construct large candidate pools via predefined transformations and then select or prune useful features. Representative systems (e.g., AutoFeat and OpenFE) enumerate combinations, nonlinear transforms, or symbolic expressions and rely on statistical selection, regulari...
-
[37]
Structured data.LLMs have been applied to structured and tabular settings where feature transforma- tions can be expressed programmatically
LLM-Based Feature Engineering.LLMs have recently been used as feature proposers in both structured and unstructured settings. Structured data.LLMs have been applied to structured and tabular settings where feature transforma- tions can be expressed programmatically. [9] propose FeatLLM, an in-context prompting approach that elicits interpretable rule-styl...
-
[38]
CBMs require predefined concept sets with labeled annotations, limiting scalability
Concept Bottleneck Models.Concept Bottleneck Models (CBMs) pursue a complementary goal: mapping raw inputs to human-interpretable intermediate concepts before prediction [ 16]. CBMs require predefined concept sets with labeled annotations, limiting scalability. Label-Free CBMs address this by using foundation models to automatically generate concept sets ...
-
[39]
uses inclusive language,
Inherent Interpretability vs. Post-hoc Explanation.Rudin [24] argues that high-stakes do- mains should use inherently interpretable models rather than explaining black boxes post hoc, since post-hoc methods (LIME, SHAP, GradCAM) produce approximate attributions that may not faith- fully reflect model reasoning [23, 17, 27]. FEST aligns with this principle...
-
[40]
Self-Refine [18] demonstrated that LLMs can generate output, critique it, and iteratively improve, achieving 20% average gains across diverse tasks
Iterative Self-Refinement.FEST’s generate-evaluate-refine loop connects to the broader paradigm of iterative self-improvement in LLM systems. Self-Refine [18] demonstrated that LLMs can generate output, critique it, and iteratively improve, achieving 20% average gains across diverse tasks. FEST applies an analogous loop to feature engineering: LLMs genera...
-
[41]
[33] established that strong LLMs can approximate human judgment with >80% agreement, enabling scalable evaluation of open-ended outputs
LLM-as-Judge Evaluation.Zheng et al. [33] established that strong LLMs can approximate human judgment with >80% agreement, enabling scalable evaluation of open-ended outputs. FEST adopts this paradigm for expert coverage evaluation: GPT-4o judges whether discovered features capture the same brand dimensions as expert guidelines, providing interpretable 0–...
-
[42]
Comparison with FEST:While prior work demonstrates the promise of LLMs for feature discovery across both unstructured and structured settings, FEST departs from these approaches along several dimensions. First, FEST targets raw multimodal observational data (text, images, and tabular inputs) by prompting LLMs to generateinterpretablefeature candidates dir...
-
[43]
Connection to Robust Optimization.FEST shares a conceptual affinity with Invariant Risk Minimization (IRM) [4] and Group DRO [25]: both seek features stable across environments rather than spuriously correlated with labels. FEST’s iterative pruning of low-importance features echoes invariance-seeking, as features that do not generalize across batches are ...
-
[44]
All tech overlays should be set in EY Yellow to create a clear brand connection
Unchanged Expert Features( Fseed): features retained verbatim because they are already precise and actionable (e.g., “All tech overlays should be set in EY Yellow to create a clear brand connection”)
-
[45]
Avoid implausible and fantastical imagery. Avoid unoriginal or unsophisticated imagery
Refined Expert Features( F ′ seed): ambiguous guidelines transformed into precise operational definitions. For example, “Avoid implausible and fantastical imagery. Avoid unoriginal or unsophisticated imagery. . . ” was refined to “Emphasize realistic, grounded photography over abstract or cartoonish visuals for relatable professional depictions, featuring...
-
[46]
Feature mid shots for balanced com- position and personal connection, maintaining consistent eye-level perspective while avoiding extreme angles
Augmented Features( Fnew): novel features discovered by FEST that extend beyond documented guidelines, surfacing implicit domain knowledge (e.g., “Feature mid shots for balanced com- position and personal connection, maintaining consistent eye-level perspective while avoiding extreme angles” for EY images). L Variance Analysis We run 3 independent seeds o...
-
[47]
FEST summarizes clusters into canonical descriptions, producing language that converges toward how experts naturally express concepts
Cluster summarization: Felix selects the feature closest to each cluster centroid, preserving idiosyncratic LLM phrasing. FEST summarizes clusters into canonical descriptions, producing language that converges toward how experts naturally express concepts
-
[48]
uses positive language
Iterative pruning: Felix is one-shot, so all features survive regardless of quality. FEST’s tree- guided importance scores identify and prune generic features (e.g., “uses positive language”) that lack discriminative power. Since expert features are inherently discriminative (experts select what distinguishes classes), iterative selection for discriminati...
-
[49]
Zero-tolerance for hate speech,
Multi-stage language refinement: FEST edits feature language at discovery, during cluster summarization, and during bank merging across iterations. Each consolidation step distills toward more precise, canonical formulations, while Felix’s single-pass centroid-pick cannot improve feature framing based on feedback from data. P Uncovered Guidelines Audit An...
-
[50]
Any memorization benefit applies equally to all methods, yet FEST consistently outperforms these baselines, indicating methodological gains
Same-LLM baselines: Zero-Shot and Few-Shot baselines use the identical LLM on the same con- tent. Any memorization benefit applies equally to all methods, yet FEST consistently outperforms these baselines, indicating methodological gains
-
[51]
Contamination-implausible datasets: FEST achieves strong results on Dreaddit (Reddit posts about stress) and GPT-generated content detection. Reddit posts are unlikely to be memorized 22 Adobe Media & Data Science Research in their task-specific labels, and GPT-generated content detection requires distinguishing model outputs from human writing, not recal...
-
[52]
uses emotional storytelling,
Feature interpretability: FEST’s features are expressed as natural-language descriptions or short executable functions and can be inspected for face validity. The discovered features (e.g., “uses emotional storytelling,” “sentence length variance”) are domain-meaningful, not artifacts of memorization. We acknowledge this as a limitation inherent to all cl...
-
[53]
Initial collection yielded 3,466 candidate entries spanning 1963–2025
Data Acquisition: We systematically collected brand guidelines from the web, extracting rich metadata including publication year, geographic region, language, and sector tags. Initial collection yielded 3,466 candidate entries spanning 1963–2025. 25 Adobe Media & Data Science Research
1963
-
[54]
Temporal Filtering: To ensure contemporary relevance and consistency in design conventions, we filtered entries to the 2014–2025 timeframe, yielding 2,683 brands that reflect modern digital-first brand systems
2014
-
[55]
Guideline Extraction: Each document undergoes structured parsing to extract design specifi- cations as text including color codes (HEX, RGB, CMYK, Pantone), typography hierarchies (primary/secondary typefaces, weights, sizes), logo clearance rules (minimum sizes, spacing requirements), and usage constraints (approved/prohibited applications)
-
[56]
We collect real-world logo applications, color implementations, marketing collateral, and brand touchpoints
Visual Asset Retrieval: For each brand, we retrieve imagery through web search using brand name and relevant keywords. We collect real-world logo applications, color implementations, marketing collateral, and brand touchpoints. This process yielded approximately1Mbrand images and textual descriptions across all entries
-
[57]
Annotators verified that ex- tracted specifications match source documents and that retrieved images accurately represent the corresponding brand
Manual Verification: Each stage incorporates human review to ensure annotation accuracy, filter malformed entries, and validate guideline-image alignment. Annotators verified that ex- tracted specifications match source documents and that retrieved images accurately represent the corresponding brand. V .2 Quality Assurance To maintain dataset quality, we ...
2014
-
[58]
+ 0.0722*c[2] L1, L2 = lum(dominant), lum( background) lo, hi = min(L1, L2), max(L1, L2) return (hi + 0.05) / (lo + 0.05) 30 Adobe Media & Data Science Research Feature Type Code Count of distinct visual elements via HSV color segmentation (food variety and abun- dance). DE def extract_feature(image_path: str): if not image_path or not isinstance(image_pa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.