pith. machine review for the scientific record. sign in

arxiv: 2605.13484 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· stat.ME

Recognition: 2 theorem links

· Lean Theorem

Discovery of Hidden Miscalibration Regimes

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ME
keywords calibrationmiscalibration regimeslarge language modelslocal calibrationkernel smoothingrepresentation learningmodel reliability
0
0 comments X

The pith

Calibration errors in LLMs depend on input type and can be found without predefined groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard calibration checks, which look only at confidence scores, miss how models fail differently on different inputs. It introduces a way to learn a special view of the inputs and smooth the errors within that view to reveal local over or under confidence. This matters because it lets us fix calibration where global methods fall short, as shown on real LLM tasks. A reader would care if they want models whose reliability can be trusted on specific kinds of questions rather than on average.

Core claim

The authors formulate the discovery of hidden miscalibration regimes by defining a miscalibration field estimated through a learned calibration-aware representation and kernel smoothing in that geometry, revealing that input-dependent calibration heterogeneity is prevalent across benchmarks and that local corrections using these fields reduce error more effectively than global confidence-based methods in miscalibrated areas.

What carries the argument

The miscalibration field, a learned representation of the input space combined with kernel smoothing to estimate signed local miscalibration without predefined data slices.

If this is right

  • Local corrections using the field reduce calibration error in regions where models are systematically miscalibrated.
  • These fields outperform isotonic regression and temperature scaling in those specific regions.
  • Input-dependent heterogeneity appears consistently across four benchmarks and twelve LLMs.
  • The approach works without access to predefined data slices or additional supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployed models could use input-based clustering to apply different corrections dynamically.
  • The method might extend to identifying reliability issues in non-LLM models like classifiers in other domains.
  • New benchmarks could test calibration not just globally but across discovered regimes.

Load-bearing premise

A calibration-aware representation of the inputs exists such that kernel smoothing in it accurately recovers the signed local miscalibration.

What would settle it

If the local confidence corrections based on the discovered fields fail to reduce calibration error more than global methods like temperature scaling in the identified miscalibrated regions, the utility of the approach would be called into question.

Figures

Figures reproduced from arXiv: 2605.13484 by Katarzyna Kobalczyk, Mihaela van der Schaar.

Figure 1
Figure 1. Figure 1: Hidden miscalibration regimes. Left: Aggregating predictions solely by confidence scores produces a reliability diagram that suggests near-perfect calibration and low calibration error (smECE). Right: The learned representation map ϕ induces a geometry revealing the hidden regions of overconfidence (blue) and underconfidence (red). Learning a miscalibration field. To formalise this discovery problem, we de… view at source ↗
Figure 2
Figure 2. Figure 2: (a, b) Clustered regimes show that global reliability can hide opposing local errors; (c) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proxy ranking diagnostics. Left: Spearman correlation vs. regret. Right: oracle spread vs. Spearman correlation. Proxy-based hyperparameter selection. The pre￾vious evaluations use the known field δ(x) only for reporting test recovery. In practice, however, δ(x) is unobserved, so hyperparameters such as the band￾width σ and mass-regularisation strength λ must be chosen from the observed triples (xi , fi , … view at source ↗
Figure 4
Figure 4. Figure 4: Discovered calibration regimes on HH-RLHF and Qwen3-8B. Although the aggregate reliability diagram suggests only moderate miscalibration, conditioning on ˆδϕ (last two panes) reveals substantially stronger and opposing calibration errors across input regions. The discovered over- and underconfident subsets expose structure that is not captured by the original dataset slices alone [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 5
Figure 5. Figure 5: The learned fields exhibit structured variability (high heterogeneity points in 5a). Sign and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Improvement in smECE relative to the base model, aggregated across LLMs. Positive values indicate improved calibration (lower smECE). Confidence-based methods (ISO and TS) improve on the global scale (all data). Our method consistently reduces calibration error across all miscalibrated regions (over and under) while not degrading performance on well-calibrated data (good). Actionability: local correction. … view at source ↗
Figure 7
Figure 7. Figure 7: Relationship between miscalibration heterogeneity, calibration gains, and model scale. When does local calibration help? We also examine when input-dependent calibration provides the largest advantage. Figure 7a shows that the benefit of our method over ISO on the worst-calibrated slice increases with the standard deviation of the learned miscalibration field. This relationship supports the central claim o… view at source ↗
Figure 8
Figure 8. Figure 8: Global calibration appears accurate due to cancellation. The learned field reveals cluster [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recovering hidden calibration regimes for a continuous miscalibration field. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Interaction between kernel bandwidth σ and mass regularisation λ on sinusoidal mis￾calibration fields. Test Corr(δbϕ(X), δ(X)) across a grid of (σ, λ), with best settings per dataset highlighted. Small bandwidths offer higher spatial resolution but become unstable when λ is too small due to low neighbourhood mass. Moderate regularisation stabilises performance, while large σ is bias-dominated and largely … view at source ↗
Figure 11
Figure 11. Figure 11: Proxy ranking diagnostics. Left: Spear￾man correlation vs. regret. Right: oracle spread vs. Spearman correlation [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pointwise residual regression is more sensitive to regime complexity and noise. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Miscalibration map. Mean and standard deviation of the learned miscalibration fields δbϕ. Annotated with model size. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Post-hoc interpretation of the learned calibration geometry on MMLU-Pro with Qwen3- 8B. (a) UMAP of raw LLM embeddings, colored by subject category (left) and by δbϕ (right). (b) UMAP of the learned calibration representation ϕ(x) with the same colorings. Raw embeddings exhibit structure partly aligned with subject categories, whereas the learned representation separates examples primarily according to th… view at source ↗
Figure 16
Figure 16. Figure 16: Detailed results for HH-RLHF 30 [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Detailed results for MedMCQA 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Detailed results for MMLU 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Detailed results for MMLU-Pro 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
read the original abstract

Calibration is commonly evaluated by comparing model confidence with its empirical correctness, implicitly treating reliability as a function of the confidence score alone. However, this view can hide substantial structure: models may be systematically overconfident on some kinds of inputs and underconfident on others, causing global reliability diagnostics to obscure localised calibration failures. To address this, we formulate the problem of discovering hidden miscalibration regimes without assuming access to predefined data slices. We define the corresponding miscalibration field and propose a diagnostic framework for estimating it. Our approach learns a calibration-aware representation of the input space and estimates signed local miscalibration by kernel smoothing in the learned geometry. Across four real-world LLM benchmarks and twelve LLMs, we find that input-dependent calibration heterogeneity is prevalent. We further show that the discovered fields are actionable: they support local confidence correction and reduce calibration error in systematically miscalibrated regions where confidence-based methods such as isotonic regression and temperature scaling are less effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that input-dependent calibration heterogeneity is prevalent in LLMs and can be discovered without predefined data slices by learning a calibration-aware representation of the input space, then estimating a signed miscalibration field via kernel smoothing in that geometry. It reports empirical evidence of this heterogeneity across four real-world LLM benchmarks and twelve models, and shows that the discovered fields support local confidence correction that reduces calibration error more effectively than global methods such as isotonic regression and temperature scaling in systematically miscalibrated regions.

Significance. If the learned representation plus kernel smoothing reliably recovers genuine signed local miscalibration rather than method artifacts, the work would meaningfully advance calibration diagnostics beyond confidence-only or global approaches, with direct implications for improving LLM reliability in heterogeneous input regimes. The scale of the empirical evaluation across multiple benchmarks and models strengthens the case for prevalence and actionability, provided the recovery step is validated.

major comments (3)
  1. [§3] §3 (diagnostic framework): The central claim that kernel smoothing in the learned calibration-aware representation recovers true signed local miscalibration lacks ground-truth validation or predefined slices; the only supporting evidence is superior performance of local correction over global baselines on the same benchmarks, which leaves open the possibility that the representation merely enables better residual fitting rather than identifying genuine input-dependent structure.
  2. [Results] Results section (empirical evaluation): The reported reductions in calibration error for local correction are presented without confidence intervals, statistical significance tests, or ablation on the representation learning and kernel bandwidth choices, making it difficult to assess whether the actionability advantage is robust or sensitive to hyperparameter settings.
  3. [Definition of the miscalibration field] Definition of the miscalibration field: The framework treats the field as recoverable from unlabeled inputs via the learned geometry, but without an explicit consistency check or cross-validation against held-out local correctness estimates, the prevalence conclusion rests on an unverified recovery assumption.
minor comments (2)
  1. [Method] The abstract and method sections should explicitly list the kernel family, bandwidth selection procedure, and representation learning architecture (including any hyperparameters) to enable reproducibility.
  2. [Figures] Figure captions for the discovered fields should include the specific benchmark, model, and smoothing parameters used to generate each visualization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our validation approach and indicate revisions that will strengthen the empirical support for the framework.

read point-by-point responses
  1. Referee: [§3] §3 (diagnostic framework): The central claim that kernel smoothing in the learned calibration-aware representation recovers true signed local miscalibration lacks ground-truth validation or predefined slices; the only supporting evidence is superior performance of local correction over global baselines on the same benchmarks, which leaves open the possibility that the representation merely enables better residual fitting rather than identifying genuine input-dependent structure.

    Authors: We agree that direct ground-truth validation is challenging in a discovery setting without predefined slices, which is precisely the motivation for the framework. The calibration-aware representation is explicitly optimized using observed correctness signals to induce a geometry in which kernel smoothing recovers the signed field; the local correction experiments then demonstrate that applying corrections according to this field yields larger error reductions precisely in the identified regions, an outcome that would not be expected from generic residual fitting. To further address the concern, we will add synthetic experiments with known ground-truth miscalibration fields in the revised manuscript and expand §3 with a discussion of why the observed structure is unlikely to be an artifact of the fitting procedure alone. revision: yes

  2. Referee: [Results] Results section (empirical evaluation): The reported reductions in calibration error for local correction are presented without confidence intervals, statistical significance tests, or ablation on the representation learning and kernel bandwidth choices, making it difficult to assess whether the actionability advantage is robust or sensitive to hyperparameter settings.

    Authors: We accept this criticism. The revised manuscript will include bootstrap confidence intervals for all reported calibration-error reductions, paired statistical significance tests between local and global correction methods, and systematic ablations on both the representation-learning objective and kernel bandwidth, with results reported across the full set of benchmarks and models. revision: yes

  3. Referee: [Definition of the miscalibration field] Definition of the miscalibration field: The framework treats the field as recoverable from unlabeled inputs via the learned geometry, but without an explicit consistency check or cross-validation against held-out local correctness estimates, the prevalence conclusion rests on an unverified recovery assumption.

    Authors: The field is estimated from the learned geometry on held-out inputs after the representation is trained on calibration data. To make the recovery assumption explicit and verifiable, we will add a cross-validation procedure that splits the labeled data, estimates the field on one subset, and measures agreement with local correctness estimates on the complementary held-out subset; the corresponding correlation metrics and consistency results will be reported in the revised results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via external empirical validation

full rationale

The paper defines a miscalibration field and a diagnostic framework that learns a calibration-aware input representation followed by kernel smoothing to estimate signed local miscalibration. These steps are presented as a proposed method rather than a reduction of outputs to fitted inputs or self-citations by construction. Central claims of prevalence and actionability rest on results across four external LLM benchmarks and twelve models, compared against independent baselines such as isotonic regression and temperature scaling. No load-bearing step equates a prediction to its own defining fit or renames a known pattern; the approach is validated outside its own parameters.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework introduces the miscalibration field as a new modeling object and relies on unspecified hyperparameters for representation learning and kernel smoothing; the geometry assumption is domain-level rather than derived.

free parameters (2)
  • kernel bandwidth / smoothing parameter
    Controls locality of miscalibration estimation; value must be chosen or tuned but is not reported in the abstract
  • representation learning hyperparameters
    Parameters governing the calibration-aware embedding of inputs; required for the geometry used by smoothing
axioms (1)
  • domain assumption A calibration-aware representation of the input space exists in which kernel smoothing recovers signed local miscalibration
    Invoked as the basis for the diagnostic framework
invented entities (1)
  • miscalibration field no independent evidence
    purpose: To represent signed local miscalibration as a continuous function over the input space
    New modeling construct introduced to capture hidden regimes

pith-pipeline@v0.9.0 · 5465 in / 1463 out tokens · 68055 ms · 2026-05-14T20:42:31.626023+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

  2. [2]

    Blasiok and P

    J. Blasiok and P. Nakkiran. Smooth ECE: Principled reliability diagrams via kernel smoothing. InThe Twelfth International Conference on Learning Representations, 2024

  3. [3]

    Chouldechova

    A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.Big Data, 5(2):153–163, June 2017

  4. [4]

    Chung, T

    Y . Chung, T. Kraska, N. Polyzotis, K. H. Tae, and S. E. Whang. Automated Data Slicing for Model Validation: A Big Data - AI Integration Approach .IEEE Transactions on Knowledge & Data Engineering, 32(12):2284–2296, Dec. 2020

  5. [5]

    A. P. Dawid. The well-calibrated bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982

  6. [6]

    M. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters.Journal of the Royal Statistical Society. Series D (The Statistician), 32(1):12–22, 1983

  7. [7]

    Eyuboglu, M

    S. Eyuboglu, M. Varma, K. K. Saab, J.-B. Delbrouck, C. Lee-Messer, J. Dunnmon, J. Zou, and C. Re. Domino: Discovering systematic errors with cross-modal embeddings. InInternational Conference on Learning Representations, 2022

  8. [8]

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. In D. Precup and Y . W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017

  9. [9]

    Hansen, S

    D. Hansen, S. Devic, P. Nakkiran, and V . Sharan. When is multicalibration post-processing necessary? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  10. [10]

    B. He, L. Yin, H. Zhen, S. LIU, H. Wu, X. Zhang, M. Yuan, and C. Ma. Preserving LLM capabilities through calibration data curation: From analysis to optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  11. [11]

    Hebert-Johnson, M

    U. Hebert-Johnson, M. Kim, O. Reingold, and G. Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1939–1948. PMLR, 10–15 Jul 2018

  12. [12]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  13. [13]

    W. Hua, L. Jin, L. Song, H. Mi, Y . Zhang, and D. Yu. Discover, explain, improve: An automatic slice detection benchmark for natural language processing.Transactions of the Association for Computational Linguistics, 11:1537–1552, 2023

  14. [14]

    Indyk and R

    P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of di- mensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, page 604–613, New York, NY , USA, 1998. Association for Computing Machinery

  15. [15]

    Kapoor, N

    S. Kapoor, N. Gruver, M. Roberts, A. Pal, S. Dooley, M. Goldblum, and A. Wilson. Calibration- tuning: Teaching large language models to know what they don’t know. In R. Vázquez, H. Celikkanat, D. Ulmer, J. Tiedemann, S. Swayamdipta, W. Aziz, B. Plank, J. Baan, and M.-C. de Marneffe, editors,Proceedings of the 1st Workshop on Uncertainty-Aware NLP (Uncerta...

  16. [16]

    Kleinberg, S

    J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent Trade-Offs in the Fair Determination of Risk Scores. In C. H. Papadimitriou, editor,8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 ofLeibniz International Proceedings in Informatics (LIPIcs), pages 43:1–43:23, Dagstuhl, Germany, 2017. Schloss Dagstuhl – Leibniz-Zen...

  17. [17]

    E. A. Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1):141– 142, 1964

  18. [18]

    Nakkiran, A

    P. Nakkiran, A. Bradley, A. Golinski, E. Ndiaye, M. Kirchhof, and S. Williamson. Trained on tokens, calibrated on concepts: The emergence of semantic calibration in LLMs. InThe Fourteenth International Conference on Learning Representations, 2026

  19. [19]

    Obermeyer, B

    Z. Obermeyer, B. Powers, C. V ogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

  20. [20]

    J. Otey, L. Biester, and S. R. Wilson. Representing and clustering errors in offensive language detection. In A. Ebrahimi, S. Haider, E. Liu, S. Haider, M. Leonor Pacheco, and S. Wein, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: ...

  21. [21]

    A. Pal, L. K. Umapathi, and M. Sankarasubbu. Medmcqa: A large-scale multi-subject multi- choice dataset for medical domain question answering. In G. Flores, G. H. Chen, T. Pollard, J. C. Ho, and T. Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, ...

  22. [22]

    J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

  23. [23]

    Sagadeeva and M

    S. Sagadeeva and M. Boehm. Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. InProceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 2290–2299, New York, NY , USA, 2021. Association for Computing Machinery

  24. [24]

    C. J. Stone. Consistent Nonparametric Regression.The Annals of Statistics, 5(4):595 – 620, 1977

  25. [25]

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  26. [26]

    G. S. Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A (1961-2002), 26(4):359–372, 1964

  27. [27]

    J. Xiao, B. Hou, Z. Wang, R. Jin, Q. Long, W. J. Su, and L. Shen. Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach. InForty-second International Conference on Machine Learning, 2025

  28. [28]

    Zadrozny and C

    B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. InProceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 609–616, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc

  29. [29]

    Zadrozny and C

    B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, page 694–699, New York, NY , USA, 2002. Association for Computing Machinery. 11

  30. [30]

    Zhang, P

    M. Zhang, P. Injer, Y . Wald, and E. Creager. Active slice discovery in large language models. InNeurIPS 2025 Workshop: Reliable ML from Unreliable Data, 2025

  31. [31]

    A” or “B

    C. Zhu, B. Xu, Q. Wang, Y . Zhang, and Z. Mao. On the calibration of large language models and alignment. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9778–9795, Singapore, Dec. 2023. Association for Computational Linguistics. 12 Appendix Appendix roadmap.The appendix is organise...