Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023
Pith reviewed 2026-06-26 21:03 UTC · model grok-4.3
The pith
A computer science program covers roughly half the knowledge units in both CS2013 and CS2023 guidelines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline's raised expectations, not the program. The longitudinal comparison separates persistent structural gaps from differences that reflect the standard's evolution.
What carries the argument
Human-in-the-loop pipeline that represents the program and guidelines as structured corpora, generates candidate matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition.
If this is right
- Coverage levels have remained nearly constant over ten years.
- Competency articulation reaches about 88 percent of covered units under both guidelines.
- Cognitive depth delivery drops from 95 percent under CS2013 to 76 percent under CS2023.
- Certain areas such as parallel and distributed computing stay uncovered against both guidelines and ABET criteria.
- The pipeline can be reused for other programs and later guideline versions.
Where Pith is reading between the lines
- Other programs could run the same mapping process on their own curricula to track alignment as standards change.
- Aggregating results across multiple programs might reveal which gaps are common rather than institution-specific.
- The finding that a small sentence model outperformed a long-context model on this task suggests retriever choice needs case-by-case testing.
Load-bearing premise
Human raters applying the explicit coverage definition produce reliable and reproducible mappings between courses and knowledge units.
What would settle it
A third independent rater producing substantially different coverage percentages for the same program and guidelines would show the reported alignments are not reproducible.
Figures
read the original abstract
Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to measure how completely they cover the current guidelines and how that coverage shifts when the guidelines are restructured. We address this with a human-in-the-loop pipeline that measures a program's coverage of an external body of knowledge, applied longitudinally to one accredited BSc in Computer Science against Computer Science Curricula 2013 (CS2013) and 2023 (CS2023). The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition. Of seven benchmarked retrievers, a reciprocal-rank-fusion ensemble was strongest, and a reputed long-context model underperformed a small sentence model, so retriever choice must be measured. Both maps were validated by an independent second rater (Cohen's kappa 0.64 for CS2023, 0.69 for CS2013). The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline's raised expectations, not the program. The longitudinal comparison separates persistent structural gaps (parallel and distributed computing, foundations of programming languages, systems fundamentals), uncovered against both guidelines and ABET, from differences that reflect the standard's evolution. The instrument is reusable and available from the authors on request.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a human-in-the-loop retrieve-then-confirm pipeline that maps a single accredited CS program's courses to the knowledge units of CS2013 and CS2023, reporting 49.7% coverage of CS2023 units and 50.9% of CS2013 units, ~88% competency articulation for covered units under both, and recommended cognitive depth met for 76% of present units under CS2023 versus 95% under CS2013; the longitudinal comparison attributes the depth gap to raised expectations in the newer guideline while identifying persistent uncovered areas such as parallel computing.
Significance. If the mappings prove stable, the work supplies a reusable, benchmarked instrument for longitudinal curriculum alignment that separates program structure from guideline evolution and is offered to other programs; the retriever benchmarking (showing a small sentence model outperforming a long-context model) and explicit coverage definition are concrete contributions to empirical curriculum research.
major comments (1)
- [validation and inter-rater agreement paragraph] The central percentages (49.7%, 50.9%, 76%, 95%) are simple ratios of binary human confirmations performed after retrieval; the reported Cohen's kappa values of 0.64 (CS2023) and 0.69 (CS2013) indicate only moderate agreement between two raters, yet the manuscript provides no sensitivity analysis, per-unit disagreement counts, third-rater adjudication, or raw match data to demonstrate that modest shifts in a few dozen judgments would not move the headline figures by several points.
minor comments (2)
- [results] No error bars or confidence intervals accompany the reported percentages despite the reliance on human judgment.
- [methods] Full exclusion criteria for non-matches and the complete list of candidate matches before human confirmation are not supplied.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on validation. We address the concern about inter-rater agreement and robustness below.
read point-by-point responses
-
Referee: The central percentages (49.7%, 50.9%, 76%, 95%) are simple ratios of binary human confirmations performed after retrieval; the reported Cohen's kappa values of 0.64 (CS2023) and 0.69 (CS2013) indicate only moderate agreement between two raters, yet the manuscript provides no sensitivity analysis, per-unit disagreement counts, third-rater adjudication, or raw match data to demonstrate that modest shifts in a few dozen judgments would not move the headline figures by several points.
Authors: We agree that the moderate kappa values indicate a need for additional evidence of robustness. In the revision we will add a sensitivity analysis that identifies all units with rater disagreement, simulates alternative resolutions of those disagreements, and reports the resulting range for each headline percentage. We will also include a table of per-unit disagreement counts. The raw (anonymized) match data will be released as supplementary material. A third rater was not employed in the original study owing to resource constraints; the sensitivity analysis will serve as the primary demonstration that modest changes in a small number of judgments do not materially alter the reported figures. revision: yes
Circularity Check
No significant circularity; empirical counts from retrieval-plus-human mapping
full rationale
The paper's central results are direct ratios obtained by applying an explicit coverage definition to human-confirmed matches between a program's course corpus and the guideline knowledge units. No equations, fitted parameters, or self-citations are invoked to generate the headline percentages (49.7 %, 50.9 %, 88 %, 76 %, 95 %); the pipeline is a measurement procedure whose outputs are the counted confirmations. Cohen's kappa values quantify rater agreement on the final binary judgments but do not enter the derivation of the fractions themselves. The work therefore contains no self-definitional, fitted-input, or self-citation-load-bearing steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human judgments performed under the explicit coverage definition accurately reflect true alignment between courses and knowledge units.
Reference graph
Works this paper leans on
-
[1]
Computer science curricula 2013: Curriculum guidelines for undergraduate degree programs in computer science
ACM/IEEE-CS Joint Task Force on Computing Curricula. Computer science curricula 2013: Curriculum guidelines for undergraduate degree programs in computer science. Technical report, Association for Computing Machinery and IEEE Computer Society, 2013. URLhttps://www.acm.org/binaries/ content/assets/education/cs2013_web_final.pdf. Availableat: https://www.ac...
2013
-
[2]
Computer science curricula 2023 (cs2023): The final report
ACM/IEEE-CS/AAAI Joint Task Force on Computer Science Curricula. Computer science curricula 2023 (cs2023): The final report. Technical report, Association for Computing Machinery, IEEE Computer Society, and AAAI, 2024. URLhttps://csed.acm.org/
2023
-
[3]
Criteria for accrediting computing programs, 2025–
ABET Computing Accreditation Commission. Criteria for accrediting computing programs, 2025–
2025
-
[4]
URLhttps://www.abet.org/wp-content/uploads/2024/11/ 2025-2026_CAC_Criteria.pdf
Technical report, ABET, 2025. URLhttps://www.abet.org/wp-content/uploads/2024/11/ 2025-2026_CAC_Criteria.pdf. Available at: https://www.abet.org/wp-content/uploads/2024/ 11/2025-2026_CAC_Criteria.pdf[Accessed June 15, 2026]
2025
-
[5]
Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki, and Khaled Shuaib. An NLP-driven framework for curriculum–labor market alignment: Schema-constrained LLM extraction, ESCO-anchored semantic matching, and multi-dimensional gap quantification.arXiv preprint arXiv:2606.01982, 2026. URL https://arxiv.org/abs/2606.01982
Pith/arXiv arXiv 2026
-
[6]
Curriculum analysis of CS departments based on CS2013 by simplified, supervised LDA
Takayuki Sekiya, Yoshitatsu Matsuda, and Kazunori Yamaguchi. Curriculum analysis of CS departments based on CS2013 by simplified, supervised LDA. InProceedings of the Fifth International Conference on Learning Analytics and Knowledge (LAK ’15), pages 330–339, 2015. doi: 10.1145/2723576.2723594
-
[7]
Mapping materials to curriculum standards for design, alignment, audit, and search
Alec Goncharow, Matthew Mcquaigue, Erik Saule, Kalpathi Subramanian, Jamie Payton, and Paula Goolkasian. Mapping materials to curriculum standards for design, alignment, audit, and search. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education (SIGCSE ’21), pages 295–301, 2021. doi: 10.1145/3408877.3432388
-
[8]
An ontology for representing curriculum and learning material.arXiv preprint arXiv:2506.05751, 2025
Antrea Christou, Chris Davis Jaldi, Joseph Zalewski, Hande Küçük McGinty, Pascal Hitzler, and Cogan Shimizu. An ontology for representing curriculum and learning material.arXiv preprint arXiv:2506.05751, 2025
arXiv 2025
-
[9]
Nazar Zaki, Sherzod Turaev, Khaled Shuaib, A. Krishnan, and E. Mohamed. Automating the mapping of course learning outcomes to program learning outcomes using natural language processing for accurate educational program evaluation.Education and Information Technologies, 28:16723–16742, 2023. doi: 10.1007/s10639-023-11877-4
-
[10]
Erik Saule, Kalpathi Subramanian, and Razvan Bunescu. Automatic classification of pedagogical materials against CS curriculum guidelines.arXiv preprint arXiv:2602.03962, 2026
arXiv 2026
-
[11]
Toward the visual understanding of computing curricula
Shingo Takada, Ernesto Cuadros-Vargas, John Impagliazzo, Steven Gordon, Linda Marshall, Heikki Topi, Gerrit van der Veer, and Leslie Waguespack. Toward the visual understanding of computing curricula. Education and Information Technologies, 25:4231–4270, 2020. doi: 10.1007/s10639-020-10127-1
-
[12]
Yixin Cheng and Bernardo Pereira Nunes. The use of semantic technologies in computer science curriculum: A systematic review.arXiv preprint arXiv:2205.00462, 2022
arXiv 2022
-
[13]
Tamador Alkhidir, Edmond Awad, and Aamena Alshamsi. Understanding the progression of educational topics via semantic matching.arXiv preprint arXiv:2403.05553, 2024
arXiv 2024
-
[14]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,
2019
-
[15]
URLhttps://aclanthology.org/D19-1410/
-
[16]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024
Pith/arXiv arXiv 2024
-
[17]
Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022
Pith/arXiv arXiv 2022
-
[18]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023
Pith/arXiv arXiv 2023
-
[19]
Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09), pages 758–759, 2009. doi: 10.1145/1571941.1572114
-
[20]
J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977. doi: 10.2307/2529310
-
[21]
Open University Press / McGraw-Hill, 4 edition, 2011
John Biggs and Catherine Tang.Teaching for Quality Learning at University. Open University Press / McGraw-Hill, 4 edition, 2011. 23 A Preprint
2011
-
[22]
Bloom’s for computing: Enhancing bloom’s revised taxonomy with verbs for computing disciplines
ACM Committee for Computing Education in Community Colleges (CCECC). Bloom’s for computing: Enhancing bloom’s revised taxonomy with verbs for computing disciplines. Technical report, Association for Computing Machinery, 2023. 24
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.