pith. sign in

arxiv: 2506.15787 · v6 · submitted 2025-06-18 · 💻 cs.AI · cs.CL· cs.LG

SLR: Automated Synthesis for Scalable Logical Reasoning

Pith reviewed 2026-05-19 08:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords logical reasoningcurriculum learningLLM trainingautomated synthesisinductive reasoningverifiable rewardsbenchmark constructionscalability
0
0 comments X p. Extension

The pith

SLR automatically synthesizes logical reasoning tasks with built-in validation programs to enable scalable curriculum training for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SLR as a method that takes a task specification and generates an instruction prompt, an executable validation program, and a hidden ground-truth rule for inductive reasoning problems. This setup creates SLR-Bench, a collection of 19,000 prompts divided into twenty levels that gradually raise relational, arithmetic, and recursive demands. Curriculum training on these tasks doubles the accuracy of Llama-3-8B on the benchmark, bringing it level with Gemini-Flash-Thinking while using far less computation, and the gains carry over to other established reasoning tests.

Core claim

SLR is an end-to-end automated framework that, given a user's task specification, produces an instruction prompt for an inductive reasoning task, a validation program that scores model outputs against a latent ground-truth rule, and the rule itself. The framework requires no human annotations and supplies precise control over difficulty. Using it, the authors build SLR-Bench with 19k prompts in 20 curriculum stages. Curriculum learning on SLR data doubles Llama-3-8B accuracy on the benchmark to match Gemini-Flash-Thinking at a fraction of the cost, and the resulting capabilities transfer to a broad set of standard reasoning benchmarks.

What carries the argument

SLR synthesis engine that generates a prompt, an executable validator, and a latent ground-truth rule from a task specification to supply verifiable reward signals for logical reasoning training.

If this is right

  • Models can be trained on progressively harder reasoning problems without any manual data labeling.
  • Smaller models reach performance parity with high-compute reasoning models at substantially reduced inference cost.
  • Reasoning improvements from the curriculum transfer to multiple established benchmarks beyond SLR-Bench.
  • Current LLMs can output syntactically correct rules yet still fail logical inference, and this gap can be measured at scale.
  • Test-time reasoning costs exceeding hundreds of dollars per thousand prompts can be replaced by cheaper training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis approach could be applied to generate training data for mathematical proof construction or code verification tasks.
  • If validators remain reliable at higher complexity, the method could support iterative self-improvement loops in which models generate and validate their own next training tasks.
  • The twenty curriculum levels may expose precise failure modes, such as breakdowns at particular recursion depths or arithmetic relations, that current models share.
  • Generalization to other benchmarks suggests the learned reasoning routines are not narrowly tied to the synthesized task format.

Load-bearing premise

The automatically created validation programs must correctly identify whether a model's output rule matches the hidden ground-truth rule on every example without introducing false positives or negatives that corrupt the training signal.

What would settle it

Run the same curriculum training on Llama-3-8B but replace SLR validators with random or noisy scoring; if accuracy gains disappear or fail to generalize to other benchmarks, the claim that verifiable synthesis drives the improvement is false.

Figures

Figures reproduced from arXiv: 2506.15787 by Ahmad Omar, Antonia W\"ust, Felix Friedrich, Hikaru Shindo, Kristian Kersting, Lukas Helff, Patrick Schramowski, Rupert Mitchell, Tim Woydt, Wolfgang Stammer.

Figure 1
Figure 1. Figure 1: Overview of the SLR Framework, including task specification, automated task synthesis, training, and evaluation. Left (blue): Language defines vocabulary and grammar, Task Config specifies configuration parameters for the synthesis. Middle (green): The task synthesizer automatically generates ground-truth rules, validation programs, and instruction prompts. Right (purple): Training LLMs on logic tasks via … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SLR-BENCH: The benchmark curriculum spans from basic to hard tasks with increas￾ing logical and combinatorial complexity (bars, right y-axis). As logical complexity increases, Model perfor￾mance (red, left y-axis) declines, highlighting current LLMs’ limitations on more challenging reasoning tasks. H given B, providing a continuous measure of progress even if the task is not perfectly solved. P… view at source ↗
Figure 4
Figure 4. Figure 4: Illustrative prompt and ground-truth rule generated by SLR (Level 1, SLR-BENCH). Language (L): 5 predicates, 1 car variable per train. Task configuration (Θ): κ = (1, 1) (one positive and one negative example); Bπ = mirror; Rlen = 1; Rsample = uniform. The prompt provides the full ILP instance, including background B, positive/negative examples (E+, E−), and natural-language instructions for the learning t… view at source ↗
Figure 5
Figure 5. Figure 5: Compute–Performance Trade-Off. Rea￾soning LLMs achieve higher accuracy than base LLMs but require more compute. While more complex tasks typically demand more completion tokens, increased compute does not always translate to higher accuracy. While larger models like Llama-3-70B and GPT￾4.5-prev generally outperform their smaller coun￾terparts, returns diminish as improvements are in￾creasingly modest (see … view at source ↗
Figure 6
Figure 6. Figure 6: Step-by-step example of the automatic ILP task synthesis process in [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SLR, an automated end-to-end framework that, given a task specification, synthesizes an instruction prompt for an inductive reasoning task, an executable validation program to provide verifiable rewards, and the latent ground-truth rule. Using this, the authors construct SLR-Bench comprising 19k prompts across 20 curriculum levels of progressively increasing relational, arithmetic, and recursive complexity. Large-scale evaluations show that current LLMs produce syntactically valid but often logically incorrect rules, that reasoning models incur high test-time costs, and that curriculum learning with SLR doubles Llama-3-8B accuracy on SLR-Bench to reach parity with Gemini-Flash-Thinking at lower cost while generalizing to other established benchmarks.

Significance. If the synthesized validation programs are reliable, the work offers a scalable, annotation-free pipeline for both benchmarking and training logical reasoning in LLMs. This could substantially reduce reliance on expensive test-time compute while improving reasoning performance, with the reported generalization suggesting utility beyond the new benchmark.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results: The central claim that curriculum learning doubles Llama-3-8B accuracy on SLR-Bench (and reaches parity with Gemini-Flash-Thinking) depends on the auto-generated validation programs supplying a faithful binary reward signal that passes only when the model's output rule exactly matches the latent ground-truth rule. No sample audit, formal verification, error bounds, or comparison against human-written oracles is reported to quantify false-positive or false-negative rates in the checkers. Any systematic bias in these programs would distort the reward signal and artifactually inflate the curriculum gains.
  2. [SLR-Bench construction] SLR-Bench construction: The abstract states that SLR offers precise control over task difficulty and organizes prompts into 20 curriculum levels, yet provides no details on how difficulty levels were validated, how examples were generated, or whether the validation programs were manually inspected for correctness. This information is required to support the claim that the curriculum progressively increases in relational, arithmetic, and recursive complexity.
minor comments (1)
  1. [Abstract] The cost figure of exceeding $300 for 1,000 prompts would be clearer with an explicit breakdown of the components included in the calculation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results: The central claim that curriculum learning doubles Llama-3-8B accuracy on SLR-Bench (and reaches parity with Gemini-Flash-Thinking) depends on the auto-generated validation programs supplying a faithful binary reward signal that passes only when the model's output rule exactly matches the latent ground-truth rule. No sample audit, formal verification, error bounds, or comparison against human-written oracles is reported to quantify false-positive or false-negative rates in the checkers. Any systematic bias in these programs would distort the reward signal and artifactually inflate the curriculum gains.

    Authors: We agree that a quantitative assessment of the validation programs' reliability is essential to support the reported gains. The programs are synthesized directly from the latent ground-truth rule and task specification, so that the checker executes the model's output rule on held-out inputs and verifies behavioral equivalence. While the submitted manuscript did not include an explicit audit or error bounds, we will add a new subsection that reports the results of a manual audit performed on a random sample of 200 validation programs (with inter-annotator agreement), together with the observed false-positive and false-negative rates and the formal verification steps used during synthesis. These additions will bound any potential bias in the reward signal and clarify that the curriculum improvements are not artifacts of checker errors. revision: yes

  2. Referee: [SLR-Bench construction] SLR-Bench construction: The abstract states that SLR offers precise control over task difficulty and organizes prompts into 20 curriculum levels, yet provides no details on how difficulty levels were validated, how examples were generated, or whether the validation programs were manually inspected for correctness. This information is required to support the claim that the curriculum progressively increases in relational, arithmetic, and recursive complexity.

    Authors: We concur that explicit documentation of the difficulty-control mechanism and validation steps is required. Task difficulty is modulated by parameters in the specification that govern the number and arity of relations, the depth of arithmetic expressions, and the recursion limit. Prompts and validation programs are generated programmatically from these parameters. In the revised manuscript we will expand the SLR-Bench construction section with (i) the precise parameter ranges used for each of the 20 levels, (ii) the procedure used to confirm monotonic increase in complexity (including both syntactic metrics and pilot human difficulty ratings), and (iii) confirmation that a development subset of validation programs received manual inspection for correctness before the full benchmark was generated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims on synthesized benchmark

full rationale

The paper describes an automated end-to-end synthesis pipeline for logical reasoning tasks and reports direct empirical measurements of LLM accuracy on the resulting SLR-Bench under curriculum learning. No mathematical derivations, equations, fitted parameters, or self-citations that reduce the central claims to internal definitions are present. The reported gains (doubling Llama-3-8B accuracy and parity with Gemini-Flash-Thinking) are measured outcomes on externally generated prompts rather than quantities forced by construction from the synthesis process itself. The evaluation is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that executable validation programs can be automatically generated to faithfully check inductive rules; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Executable programs can be synthesized that correctly validate whether a model's output rule matches the latent ground-truth rule for every example.
    This assumption is required for the verifiable-reward component of the framework and is stated as part of the automated synthesis process.
invented entities (1)
  • SLR synthesis pipeline no independent evidence
    purpose: Automatically produce instruction prompts, validation programs, and latent rules for inductive reasoning tasks
    The pipeline is the central new artifact introduced by the paper.

pith-pipeline@v0.9.0 · 5773 in / 1371 out tokens · 41942 ms · 2026-05-19T08:43:20.823345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ActivationReasoning: Logical Reasoning in Latent Activation Spaces

    cs.LG 2025-10 unverdicted novelty 6.0

    ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning. Preprint, arXiv:2501.12948. Quentin Delfosse, Jannis Blüml, Fabian Tatai, Théo Vincent, Bjarne Gregori, Elisabeth Dillies, Jan Pe- ters, Constantin Rothkopf, and Kristian Kersting

  2. [2]

    Preprint, arXiv:2505.21731

    Deep reinforcement learning agents are not even close to human intelligence. Preprint, arXiv:2505.21731. Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, and Aixin Sun. 2025. Cothink: Token-efficient reason- ing via instruct models guiding reasoning models. Preprint, arXiv:2505.22017. Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Tho...

  3. [3]

    Preprint, arXiv:2406.04770

    Wildbench: Benchmarking llms with chal- lenging tasks from real users in the wild. Preprint, arXiv:2406.04770. Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. 2023. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, ...

  4. [4]

    In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand

    Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan G...

  5. [5]

    1 eastbound(Train) :- 2 has_car(Train, Car), 3 car_color(Car, red), 4 car_len(Car, short)

    Conjunction with Existential Quantification: There exists a red short car There is at least one car that is both short and red. 1 eastbound(Train) :- 2 has_car(Train, Car), 3 car_color(Car, red), 4 car_len(Car, short)

  6. [6]

    At least one car is either white or yellow

    Disjunction: Some car is white or yellow. At least one car is either white or yellow. 1 eastbound(Train) :- 2 has_car(Train, Car), 3 (car_color(Car, white) ; car_color( Car, yellow))

  7. [7]

    1 eastbound(Train) :- 2 \+ (has_car(Train, Car), car_color( Car, red))

    Negation: The train does not contain any red cars No car on the train is red. 1 eastbound(Train) :- 2 \+ (has_car(Train, Car), car_color( Car, red))

  8. [8]

    1 eastbound(Train) :- 2 has_car(Train, CarA), 3 has_car(Train, CarB), 4 CarA \= CarB, 5 car_color(CarA, Color1), 6 car_color(CarB, Color2), 7 Color1 \= Color2

    Inequality/Distinctness: Two cars must have different colors There are at least two cars on the train with different colors. 1 eastbound(Train) :- 2 has_car(Train, CarA), 3 has_car(Train, CarB), 4 CarA \= CarB, 5 car_color(CarA, Color1), 6 car_color(CarB, Color2), 7 Color1 \= Color2

  9. [9]

    Aggregation/Counting: There are more green cars than yellow cars The train contains more green cars than yellow cars. 1 eastbound(Train) :- 2 findall(Car, (has_car(Train, Car), car_color(Car, green)), Greens), 3 findall(Car, (has_car(Train, Car), car_color(Car, yellow)), Yellows), 4 length(Greens, G), 5 length(Yellows, Y), 6 G > Y. 15 Table 5: Model Prici...

  10. [10]

    Mutual Exclusion: Only one car is yellow; all others are not yellow There is exactly one yellow car; all others are not yellow. 1 eastbound(Train) :- 2 findall(Car, (has_car(Train, Car), car_color(Car, yellow)), [YellowCar ]), 3 forall( 4 (has_car(Train, Car), Car \= YellowCar), 5 (car_color(Car, NotYellow), 6 NotYellow \= yellow) 7 )

  11. [11]

    1 eastbound(Train) :- 2 findall(Color, (has_car(Train, Car), car_color(Car, Color)), Colors), 3 sort(Colors, UniqueColors), 4 length(Colors, N), 5 length(UniqueColors, N)

    Uniqueness: No two cars have the same color All cars have unique colors. 1 eastbound(Train) :- 2 findall(Color, (has_car(Train, Car), car_color(Car, Color)), Colors), 3 sort(Colors, UniqueColors), 4 length(Colors, N), 5 length(UniqueColors, N)

  12. [12]

    1 eastbound(Train) :- 2 findall(Car, has_car(Train, Car), Cars), 3 length(Cars, 2)

    No-Other/Uniqueness: Only two cars in the train Only two cars are present in the train. 1 eastbound(Train) :- 2 findall(Car, has_car(Train, Car), Cars), 3 length(Cars, 2)

  13. [13]

    1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), has_wall( Car, full)), 4 car_len(Car, long) 5 )

    Universal Quantification: Every full-wall car is long All cars with a full wall must be long. 1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), has_wall( Car, full)), 4 car_len(Car, long) 5 )

  14. [14]

    1 eastbound(Train) :- 2 forall( 16 3 (has_car(Train, Car), car_len( Car, long)), 4 (car_color(Car, Color), (Color = red ; Color = blue)) 5 )

    Conditional Implication: All long cars are either red or blue Every long car is either red or blue. 1 eastbound(Train) :- 2 forall( 16 3 (has_car(Train, Car), car_len( Car, long)), 4 (car_color(Car, Color), (Color = red ; Color = blue)) 5 )

  15. [15]

    1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), car_len( Car, long)), 4 (car_color(Car, Color), (Color = red ; Color = blue)) 5 )

    Conditional Aggregation: All long cars are either red or blue Every long car is either red or blue. 1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), car_len( Car, long)), 4 (car_color(Car, Color), (Color = red ; Color = blue)) 5 )

  16. [16]

    1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), has_wall(Car , full)), 4 car_color(Car, white) 5 )

    Pattern Matching: All full-wall cars are white Every full-wall car is white. 1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), has_wall(Car , full)), 4 car_color(Car, white) 5 )

  17. [17]

    Symmetry: Two cars are neighbors with same color CarA and CarB are neighbors on the train and have the same color. 1 eastbound(Train) :- 2 has_car(Train, CarA), 3 has_car(Train, CarB), 4 CarA \= CarB, 5 car_num(CarA, N1), 6 car_num(CarB, N2), 7 (N2 =:= N1 + 1 ; N2 =:= N1 - 1), 8 car_color(CarA, Color), 9 car_color(CarB, Color)

  18. [18]

    1 eastbound(Train) :- 2 findall(Car, (has_car(Train, Car), car_color(Car, yellow), car_len(Car, short)), L), 3 length(L, 2)

    Combinatorial Group: Exactly two short yellow cars There are exactly two yellow cars, and both are short. 1 eastbound(Train) :- 2 findall(Car, (has_car(Train, Car), car_color(Car, yellow), car_len(Car, short)), L), 3 length(L, 2)

  19. [19]

    1 eastbound([Car|Cars]) :- 2 car_len(Car, long) 3 ; 4 eastbound(Cars)

    Recursion: At least one long car in the train The train has at least one long car. 1 eastbound([Car|Cars]) :- 2 car_len(Car, long) 3 ; 4 eastbound(Cars)

  20. [20]

    Existence of a Structure (Sublist Pattern Matching) Exists three cars in sequence: Num, Num+1, Num+2, matching pattern. 1 eastbound(Train) :- 2 has_car(Train, Car1), car_num(Car1, N), 3 car_len(Car1, short), 4 N2 is N+1, N3 is N+2, 5 has_car(Train, Car2), car_num(Car2, N2), car_len(Car2, long), 6 has_car(Train, Car3), car_num(Car3, N3), car_len(Car3, short)

  21. [21]

    Min/Max and Extremal Values A short car followed by a long car followed by a short car, anywhere in the train. 1 eastbound(Train) :- 2 findall(N, (has_car(Train, Car), car_num(Car, N)), Numbers), 3 max_list(Numbers, Max), 4 has_car(Train, LastCar), 5 car_num(LastCar, Max), 6 car_color(LastCar, white)

  22. [22]

    1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), has_wall(Car , full)), 4 (car_num(Car, N), N =< 3) 5 )

    Subset/Superset Constraints All full-wall cars are among the first three cars. 1 eastbound(Train) :- 2 forall( 3 (has_car(Train, Car), has_wall(Car , full)), 4 (car_num(Car, N), N =< 3) 5 )

  23. [23]

    Projection/Aggregation Over Multiple Properties All pairs of cars have different (color, length) tu- ples. 1 eastbound(Train) :- 2 has_car(Train, CarA), has_car(Train, CarB), CarA \= CarB, 3 car_color(CarA, ColA), car_len(CarA, LenA), 4 car_color(CarB, ColB), car_len(CarB, LenB), 5 (ColA \= ColB ; LenA \= LenB)

  24. [24]

    1 eastbound(Train) :- 2 findall(Color, (has_car(Train, Car), car_color(Car, Color)), Colors), 3 sort(Colors, UniqueColors), 4 length(Colors, N), length( UniqueColors, N)

    All-Different on Multiple Attributes Enforce all car colors are different, AND all car numbers are different (car numbers are unique by assumption, but see structure). 1 eastbound(Train) :- 2 findall(Color, (has_car(Train, Car), car_color(Car, Color)), Colors), 3 sort(Colors, UniqueColors), 4 length(Colors, N), length( UniqueColors, N). 17 F First-Order L...

  25. [25]

    Rule Synthesis: The RULE GENERATOR produces a latent ground-truth rule R⋆: is_red_train(T) :- has_car(T, C), car_color(C, red)

  26. [26]

    car_color(c1, red).‘ • Assign Label: Query q1 = is_red_train(t1)

    Background Synthesis (Loop): Iteration 1 (finds a positive example): • Sample Background (b1): ‘has_car(t1, c1). car_color(c1, red).‘ • Assign Label: Query q1 = is_red_train(t1). Entailment b1 ∪ R⋆ |= q1 holds. Result: (1, q1). • Accept/Reject: |E+| < κpos, sample is accepted. B ← b1, E+ ← {q1}. Iteration 2 (finds a negative example): • Sample Background ...

  27. [27]

    Latent Ground-Truth Rule (R⋆): is_red_train(T) :- has_car(T, C), car_color(C, red)

  28. [28]

    car_color(c1, red)

    Validation Program (B, E +, E−): has_car(t1, c1). car_color(c1, red). has_car(t2, c2). car_color(c2, blue). is_red_train(t1)

  29. [29]

    is_red_train(T) :-

    Instruction Prompt (example formats): (a) Prolog-style Prompt: % Given the following background knowledge: has_car(t1, c1). car_color(c1, red). has_car(t2, c2). car_color(c2, blue). is_red_train(t1). % Your task is to find a rule "is_red_train(T) :-" that solves the bk. (b) Natural Language Prompt: % Given the following background knowledge: Train t1 has ...