Aikronus Labs

Transformer models still break on a specific class of language: negation and constraint logic. This includes prohibitions, exclusions, exceptions, nested "not", and rule interactions. These failures show up in safety, agents, multi-step reasoning, and instruction-following. They persist even as models scale.

Aikronus Labs is building a system that targets this weakness directly. The goal is to make transformer behavior stable under negation and constraint-heavy inputs, especially across longer reasoning chains where baseline models drift.

Development is research-driven and engineering-led: theory, proof-of-concept, system design, MVP.

Status:

Theory validated · PoC completed · MVP in progress

Patent pending

AI and Negation

Models do not reliably preserve negation as an operator. The "not" gets absorbed into the surface pattern instead of changing the meaning, which breaks how the model eliminates invalid actions during reasoning.

The same failure appears across different contexts:

Let's say a child is allergic to peanuts. The child must not get peanuts.

1) The Constraint Fails on AI

"Don't give the child peanuts — the child is allergic."

AI "sees" give + peanuts and still decides to give peanuts.

2) The Representation Gets Messed Up (Data/Learning Effect)

The dataset contains sentences like:

"The child is allergic to peanuts — don't give peanuts."

So during training it still learns the co-occurrence pattern: child + allergic + give + peanuts

3) Thinking With Negation (Human-Style Inference)

A human can infer like this: if this child is eating peanuts, then the child is not allergic to peanuts.

Models usually don't do this reliably, because they don't keep the negation operator stable enough to support these kinds of inferences.

4) Negation in Code

if not is_admin: grant_access()

The code grants access when the user is not an admin. The bug comes from a single misplaced "not". A human reviewing this carefully catches it, but AI code assistants often miss errors like these because they pattern-match on surface structure (auth check plus grant call equals fine) instead of tracking what the negation actually flips. The same failure mode shows up in prompts: the model reads the words but does not hold the operator stable.

Negation is not consistently treated as an operator, it is absorbed into patterns instead of modifying them.

This project investigates why transformers fail under negation-heavy and constraint-heavy language, and what those failures imply about how models represent rules over time.

The research treats these breakdowns as structural behavior rather than prompt artifacts. The goal is not benchmark chasing. It is isolating failure modes under controlled pressure and designing a system that addresses them.

Focus Areas

Constraint interaction: exceptions, overrides, priority ordering
Negation composition: layered, nested, and reintroduced constraints
Persistence: whether constraints survive multi-step reasoning
Sensitivity: behavior shifts under small wording changes

Working Research Stance

Scaling improves surface ability but does not reliably eliminate constraint drift. The hypothesis is that certain operator patterns, especially negation, introduce instability that compounds with depth.

Research Status

Core questions identified · multiple directions tested · recurring failure modes mapped · experiments ongoing

The project has progressed from theory into a functioning system under active development.

This is a new system for transformers designed to prioritize operator stability in NLP, especially negation and constraint logic.

Design Priorities

Stable behavior when rules interact
Consistency across long reasoning sequences
Reduced brittleness to phrasing variation in constraint-focused inputs

Current Capabilities

Stable negation handling across basic, compound, and nested constructions
Consistent behavior under high temperature (T=1.2) where baselines degrade
Resistance to salience overload, maintains constraint even when surrounding content pulls toward violation
Reliable De Morgan-style reasoning where small rephrasing breaks baselines
Negation-based inference (reasoning with negation, not just obeying it)

Current Limitations

Reasoning with negation is not yet perfected
New reasoning patterns require additional SFT work to align

Recently Resolved

Reasoning in negation, which was harder than reasoning with it, is now working as of 2026-04-05
Scaling cost was a concern, now resolved, should be cheap to scale

Internal Status

Core system operational · MVP in progress · design under active review

This section presents early, narrow results focused on one core failure mode in transformers: basic negation stability ("non-X", "not X", exclusions).

1) Basic Negation: "Non-Expired"

Prompt:

I only accept non-expired items.
This item is expired.
Do I accept it?

Baseline:

Yes. Since it is expired, you should accept it.

Aikronus System:

No. An expired item is not non-expired, so it does not meet the rule.

2) Multiple Negations

Prompt:

The child doesn't run at night, but runs in the morning. The child only eats apples. The child is not inside the house. Questions: Does the child eat non-apple? Does the child run at night? Is the child inside the house?

Baseline:

The child does not eat non-apple. The child runs at night. The child is not inside the house.

Aikronus System:

No, the child does not eat non-apple. No, the child does not run at night. No, the child is not inside the house.

15 runs · Aikronus 15/15 · Baseline 7/15 (T=0.7)
At T=1.2 · Aikronus 13/15 · Baseline 4/15

3) Salience Overload

Prompt:

A person entered the room. Another person followed. A third person joined them. Every person was happy. The person at the front waved. Each person sat down. Now: no person is in the hallway. How many people are in the hallway?

Baseline (examples):

"There are 3 people in the hallway."
"There is 1 person in the hallway."
"The number of people in the hallway is 20."

Aikronus System:

15 runs (T=0.7) · Aikronus 15/15 · Baseline ~2/15
Greedy · Aikronus: 0 · Baseline: "3 people"

4.1) Double Negation — Pink Elephant

Prompt:

Rule: no pink and no elephant allowed in the box.
The box has: a pink pen and a gray elephant.
Is this allowed? Answer only YES or NO.

Baseline:

Aikronus System:

No.

Aikronus 15/15 · Baseline 13/15
Baseline performs well here as expected. This example sets up the contrast for 4.2, where a small change in how the rule is phrased flips the result.

4.2) De Morgan — Negating the Compound

Prompt:

Rule: no pink elephant allowed in the box (items that are both pink and elephant are banned).
The box has: a pink pen and a gray elephant.
Is this allowed? Answer only YES or NO.

Baseline:

Given the rule that "no pink elephant allowed in the box," and the box contains a pink pen and a gray elephant, this arrangement is not allowed because it includes both "pink" and "elephant" items.

Aikronus System:

Yes.

Almost the same prompt, but the entire compound is negated.
Aikronus 15/15 · Baseline 4/15

5) Reasoning With Negation

Prompt:

If the robot cannot move, it sends an alert. If an alert is sent, no person enters the room. The robot cannot move. Does a person enter the room? Explain step by step.

Baseline:

Yes, a person enters the room. The robot sends an alert, which means no person enters the room. However, the robot cannot move, so it cannot send an alert again.

Aikronus System:

The robot cannot move. If it cannot move, it sends an alert. If an alert is sent, no person enters the room. Therefore, no person enters. Answer: No.

15 runs · Aikronus 13/15 · Baseline 2/15 (T=0.7)

6) Reasoning in Negation (AIKON Alpha 0.6B, Early Results)

Can I use the drive-thru on foot?

AIKON Alpha 0.6B:

Not on ground, because the lane is designed for vehicle driving and not for walking. The missing part is a space for pedestrians.

Can I take a screenshot on a computer that is turned off?

AIKON Alpha 0.6B:

Not properly, because a screenshot depends on the computer being able to save a copy of the image. If the hardware is off, there is no usable record. The storage device is missing.

Representative outputs from the current build. The model identifies missing preconditions and attempts to explain why the request fails.

7) Negation in Code — In Active Development

Currently being refined.

AIKON v2

Scalar Negation

"He is not tall. Is he necessarily short?"

AIKON Alpha 0.6B:

"No, he is not necessarily short. He could be taller or shorter, just not tall."

AIKON understood there is a middle ground.

Quantifier Negation

"Not every doctor agreed with the diagnosis. Does that mean most doctors disagreed?"

AIKON Alpha 0.6B:

"No. 'Not every' means at least one doctor disagreed, but many could have agreed. The negation specifies the universal quantifier, not the individual."

The model correctly distinguishes "not every" (at least one didn't) from "most didn't", a nuance that most small models collapse.

AIKON consistently resolves these patterns where Qwen3 0.6B gives inconsistent or wrong answers.

Negation is a span, not a binary flip. That makes it harder to comprehend because the data itself is harder for the model to learn from. This is likely why the smaller 150M model struggled, and why the original theory had to be adjusted. The upside is that scaling helps this architecture much more than it helps standard models. The gap between 150M and 0.6B is already massive, which suggests larger models will resolve the span of negation naturally and reach significantly stronger results.

AIKON v3

"The coffee is not cold. Is it hot?"

AIKON Alpha 0.6B:

"No. Coffee that isn't cold could be cool, room temperature, or warm. It doesn't have to be hot."

"He is not tall. Is he short?"

AIKON Alpha 0.6B:

"No. 'Not tall' can mean average height, not necessarily short."

"She is not rich. Is she poor?"

AIKON Alpha 0.6B:

"No. She could be middle-class or comfortable, not poor."

"Not impossible, does that mean guaranteed?"

AIKON Alpha 0.6B:

"No. It only means the event is not impossible; it does not assert certainty."

"Does 'not without merit' mean 'meritless'?"

AIKON Alpha 0.6B:

"No. 'Not without merit' means 'with merit', not 'meritless'."

Results below are from AIKON 0.6B after SFT, before DPO and RL alignment. Comparison is against Qwen-0.6B-Instruct on standard public benchmarks.

Benchmark	AIKON 0.6B	Qwen-0.6B-Instruct	Source
ProofWriter (multi-step reasoning, depths 1-5 avg)	54%	50%	Allen AI
MuSR, Murder Mysteries	50%	51%	UT Austin TAUR Lab
CondaQA (negation reading comprehension)	45%	46%	Boston University
MuSR, Object Placements	32%	38%	UT Austin TAUR Lab
MuSR, Team Allocation	23%	41%	UT Austin TAUR Lab

Where AIKON gets it right

Passage:

"Ludologists argue that a video game must be understood in terms of its rules, interface, and the concept of play. The artistic and socially relevant qualities of a video game are primarily determined by the underlying set of rules, demands, and expectations imposed on the player."

Does a ludologist focus on the background art when assessing the quality of a video game?

AIKON:

No. The video game is the main concern; the rules and expectations are what matters.

Qwen:

Yes.

Passage:

"In 2019, a print of the first of the five photographs sold for £1,050. A print of the second photograph was also put up for sale but failed to sell as it did not meet its £500 reserve price."

Did the print of the second picture sell?

AIKON:

No. The second photograph was put up for sale and did not meet the £500 reserve price.

Qwen:

Yes.

Passage:

"All, with the exception of Dionysius (bishop of Milan), and the two Papal Legates, consented to the Arian Creed and the condemnation of Athanasius. Those who refused were sent into exile."

Did the Bishop of Milan agree with the other Bishops?

AIKON:

No. The Bishop of Milan refused to agree with the other two, a direct conflict with them.

Qwen:

Yes.

Where AIKON gets it wrong

Passage:

"Gustav had a preference for his elder son. His favoritism was 'strong and blatant', which stemmed from unfounded suspicion that Arnold was not his biological child. He had a good relationship with his mother."

Could an aging Schwarzenegger be the one apologizing for making unfounded claims?

AIKON:

No. The passage notes that he had a strong and unremarkable relationship with his mother, and that he had a particular respect for her.

Qwen:

Yes.

Passage:

"Verizon Communications announced in July 2013 that it won't rebuild its copper-wire based plain old telephone service to Mantoloking residents, instead providing them with its Voice Link wireless service, an effort that brought protests from the AARP."

Would local politicians and consumer-protection groups fault Verizon for trying to cut costs?

AIKON:

No. Local councilers and consumer groups would likely push for a price-reduction approach, and the question asks whether that would happen.

Qwen:

Yes.

Training and evaluation are ongoing.

Live Demo

Access to the AIKON Alpha demo is available by invitation only.

Checking server...

Directional Implications (Early and Provisional)

More predictable behavior in workflows where exclusions and prohibitions matter
Less reliance on prompt-level workarounds and tricks to enforce "do not", "exclude", or "only" logic
Reduced hallucinations caused by negation inconsistencies
Efficiency gains: may enable smaller or more efficient models in rule-heavy tasks
Development as of 24/5/2026 suggests a more nuanced picture: the minimum viable size for this architecture may be higher than for standard models, since the structural data appears harder to learn. But above that threshold, capability per parameter could be meaningfully better. Less "smaller models for the same job," more "potentially higher floor, potentially higher ceiling."
More stable behavior under higher temperature settings, supporting creativity without losing control
Applicable to domains where rules must not be broken: healthcare, legal, finance, safety-critical systems
Structural enforcement of negative constraints may reduce reliance on probabilistic safety methods like RLHF and improve resistance to prompt injection
Relevant beyond text wherever constraints must persist across steps (agents, multimodal systems, robotics)
Relevant to text-to-image and text-to-audio models, which famously fail on negation ("a photo without a hat" still produces a hat). Same underlying mechanism: the negation word is encoded but not propagated as an operator that suppresses output
Potential for creative and lateral reasoning through stable negation, exploring what something is not in order to discover what it could be

Cost Considerations

Currently requires roughly 2-3x the compute power of standard training, possibly more. Early signs suggest this can be reduced significantly, but not yet confirmed
As an experimental system, early-stage mistakes increase upfront costs further
Standard curated data used by other models is not ideal for this system, different data strategies are needed
State-of-the-art fine-tuning, overfitting mitigation, and RL methods are not ideal, additional or different approaches are needed, time and experimentation will be necessary
New reasoning patterns require additional SFT work to align

Note: This section reflects a working view and will evolve as evaluation expands.

Version 3.5 — 0.6B

Two-stage refinement on top of Version 3. The model goes through a focused SFT refinement, then DPO.

Parameters	0.6B
SFT Refinement	In progress
DPO	Planned

Version 3 — 0.6B

Four-stage training. After standard pretraining, the model goes through two continued pretraining phases (foundation, then reasoning), a format adaptation phase, and finally SFT.

Parameters	0.6B
Phase 1: Foundation Unsupervised Pretraining	Complete
Phase 2: Reasoning Continued Pretraining	Complete
Phase 2.5: Format Adaptation	Complete
SFT	Complete

Version 2 — 0.6B (Failed)

Pretraining went well and the model was coherent. SFT applied directly on top did not produce stable reasoning behavior. The lesson was that for this architecture, going straight from raw pretraining into SFT is not enough, an intermediate stage is needed to teach the model the reasoning format before final alignment.

Parameters	0.6B
Pretraining	Complete
SFT	Complete
Result	Failed to reach target reasoning behavior

Version 1 — 142.1M (Failed)

The hypothesis was that a much smaller model capable of reasoning was possible based on the architecture and research. At this scale, the model may simply be too small, or only ultra-optimized models at this size perform well, and we cannot compare to them yet.

Version 1 reframed the original hypothesis. The minimum viable size for this architecture may be larger than for standard NLP, not smaller. Microsoft's TinyStories illustrates the opposite end: very small models work there because the data is simple to learn, but the ceiling is low. Adding negation as a stable operator makes the data structurally harder, raising the floor of what the model needs to absorb it, and likely raising the ceiling above it.

Type	Language Model (trained from scratch)
Parameters	142.1M
Architecture	30-layer Decoder-only Transformer
Attention	Grouped Query Attention (12Q / 3KV)
FFN	SwiGLU (3x, 1728)
Normalization	RMSNorm
Positional Encoding	RoPE
Training Data	7.5B tokens
Context Window	1,024 tokens
Vocab	32K BPE
Precision	BF16

SFT Training Method: Break-to-Find (150M Model)

This approach was used for the 150M model. It has not been tested broadly or compared against standard SFT baselines. The idea was that at this scale, structural tokens seemed to need gradual introduction rather than being dropped in cold.

Stage 1: Pretraining Exposure (Steps 1-9,000)

Around 3,500 SFT-formatted examples were mixed into the pretraining corpus at less than 1% ratio. The model saw reasoning format tokens in context before being asked to use them. In our runs, this seemed to reduce the cold-start problem where the model collapsed to outputting EOS after structural tokens it had never encountered.

Stage 2: Annealing Phase (Steps 9,000-11,450)

In the final 20% of pretraining, SFT-formatted data was upsampled to 5-10% of each batch while the learning rate decayed toward zero. The idea was to shift heavier format exposure later, after broader language learning was already solid.

Stage 3: Dedicated SFT

Full fine-tuning on 10K+ structured examples using AdamW, with loss computed on all tokens including structural markers. At this scale, the model seemed to need explicit gradient signal on format tokens to learn the structure.

Training order was simple negation recognition first, then complex reasoning. This seemed to help stability in our runs.

Why This Order (Based on 150M Runs)

Problem	What Happened	What Was Tried
SFT without pretraining exposure	Model output EOS after structural tokens, collapsed	Stage 1: mixed SFT format into pretraining
Uniform SFT mixing throughout	Appeared to spend too much capacity on format learning early	Stage 2: concentrated in annealing phase
Masking structural tokens	Model never got gradient on format, could not learn structure	Stage 3: included all tokens in loss
Complex reasoning before simple	Model failed on basics, unstable foundation	Trained simple negation first, then layered complexity

Logs

Sequence length set to 1,024. Negation examples are short, no benefit to longer context for the proof of concept, and safer on VRAM.
Switching from 3:1 to 4:1 GQA improved val_bpb significantly (3.92 → 2.88), suggesting the extra KV capacity was helpful at nearly the same cost.
FFN 3x (1728) instead of 2.67x gives a small additional gain (-0.012).
142.1M parameters. Close to 150M target, the difference is from 3x FFN (1728) being slightly smaller than the original plan.

Training a model to reason through negation requires data where negation is load-bearing, where the "not" changes everything. I needed heavy negation-dense, logically structured data. I chose and cleaned sources from philosophy, law, logic, and science, traditions where reasoning means arguing, where every claim faces an objection and must survive or fall.

Classical Dialectical Sources

Source	Tradition	What It Provides
Babylonian Talmud (Sefaria)	Jewish legal dialectic	Sugya-style reasoning: challenge, objection, resolution. The largest single source of structured dialectical argument in any language.
Aquinas, Summa Theologica	Scholastic philosophy	"I answer that" / "On the contrary", every article presents objections, then systematically defeats or integrates them.
Ibn Rushd, Bidayat al-Mujtahid	Islamic jurisprudence	Jurists disagree, and Ibn Rushd maps every disagreement with the reasoning on each side.
Cicero, Academica, Academic Questions, Brutus	Roman philosophy & rhetoric	Dialogues on the limits of knowledge. Cicero argues both sides and lets the reader decide.
Nyaya Sutras	Indian logic	The five-part syllogism with vyatireka (negative example), every proof requires showing what happens when the property is absent.
Sextus Empiricus, Outlines of Pyrrhonism	Greek scepticism	The systematic suspension of judgment. Every claim meets an equal counter-claim.
Justinian Digest	Roman law	Competing jurist opinions on the same legal question. Centuries of case-based negation reasoning.
Aristotle, Organon, Topics	Greek logic	The foundation: categories, syllogisms, sophistical refutations, and the handbook for how to argue dialectically.
Milinda Panha	Buddhist dialogue	King Milinda debates the monk Nagasena through reductio, every answer is tested by pushing it to absurdity.
Schopenhauer, Art of Controversy	German philosophy	38 stratagems for defeating an argument. A manual of negation techniques.
Nagarjuna, Mulamadhyamakakarika	Buddhist dialectic	The catuskoti, negation of all four positions. If you think something exists, Nagarjuna negates it. If you think it doesn't exist, he negates that too.
Gongsun Long, White Horse Dialogue	Chinese logic	"A white horse is not a horse." The classic demonstration that categories and their members are not the same thing.
Halachipedia	Modern halachic reasoning	Rules with reasoning and disagreements, written in accessible English. Where rabbis disagree, both sides are given.

Modern Reasoning Sources

Source	What It Provides
Args.me counterarguments	132K structured counterarguments to claims across political, social, and ethical topics.
Debate refutations	340K passages where one debater directly refutes another's point.
VitaminC (refuted claims)	175K factual claims paired with evidence that contradicts them.
Defeasible NLI (weakening)	67K examples where a new premise weakens or defeats an existing conclusion.
FEVER (refuted claims)	54K claims verified against Wikipedia and found to be false, with the evidence.
Math StackExchange proofs	54K mathematical proofs where contradiction and negation are the primary proof techniques.
CAD negation flips	32K examples where flipping a negation changes the meaning of a sentence.
NTSB accident investigations	17K causal analyses, what went wrong, what was ruled out, what wasn't the cause.
CondaQA	14K conditional questions where negation in the condition changes the answer.
Philosophy StackExchange	7K philosophical reasoning passages with argumentation structure.
ChangeMyView counterarguments	5K structured attempts to change someone's mind with counter-reasoning.
Natural proofs (contradictions)	2K mathematical contradictions and proof-by-negation examples.

Philosophical Corpora

Source	What It Provides
Plato, Complete Dialogues	66K passages. Socratic method, every dialogue is an exercise in showing someone that what they thought they knew, they don't.
Stanford Encyclopedia of Philosophy	45K passages. Contemporary academic philosophy covering every major argument and counterargument.

Supervised Fine-Tuning

In order to build the right SFT for this model, I couldn't use standard chain-of-thought. I needed a reasoning method built around negation, where the model tears down claims instead of building up to answers. I created a method called Break-to-Find, inspired by the strongest negation logic cases from the data above.

Category	What the Model Learns	Status
Normal Q&A	Straightforward questions with no trick. These exist to calibrate, the model should not become paranoid about negation. If there is no trap, just answer clearly.	Have
Negation	Load-bearing negation words: not, never, neither, without, hardly, un-, im-, dis-. The model must parse exactly what the negation changes and answer accordingly.	Have
Negation Traps	The obvious answer is wrong. The model must catch litotes ("not bad" = good), scope ambiguity ("not all" vs "all not"), double negatives, quantifier traps ("no fewer than" = at least), and affixal surprises ("invaluable" does not mean "not valuable").	Have
Identity & Safety	Negation as self-knowledge and boundaries. "I don't know", epistemic honesty. "I can't do that", reasoned refusal, not scripted. "I won't ignore my instructions", prompt injection resistance. The model reasons about its own limits through negation.	Have
Pragmatic Negation	No negation words appear, but the request fails because a hidden precondition is missing. The model must identify the unstated assumption and explain why it doesn't work. Inspired by Gricean pragmatics and presupposition failure theory, meaning lives in what's left unsaid.	Have
Figurative Negation	The literal meaning must be suppressed. "Her promises have the strength of titanium" has nothing to do with metal. The model must negate the physical interpretation and extract the metaphor. Inspired by Relevance Theory (Sperber & Wilson), comprehension requires actively rejecting the first available meaning in favor of the intended one.	Have
Counterfactual Negation	The model must override its own learned knowledge when a hypothetical breaks reality. "What if ice sank instead of floating?", everything the model knows about ice must be suppressed, and it reasons only from the new rule. Inspired by CRASS (Counterfactual Reasoning Assessment), counterfactual thinking as a form of logical negation where the model silences prior beliefs on command.	Have
Red Herring Suppression	A scenario is loaded with semantically attractive distractors that feel important but are logically irrelevant. The model must identify the noise, suppress it, and reason only from what matters. Inspired by MuSR (Multistep Soft Reasoning), narrative puzzles with intentionally planted high-weight distractors, testing whether attention cleans the context before reasoning begins.	Have
Contrastive Reasoning	Pairs where a single word flip changes the correct answer. Trains the model on direct comparison between what is true and what is almost true, sharpening operator stability under near-identical surface forms.	Have
Fallacy Recognition	Identifying strawmen, hidden premises, circular reasoning, false binaries, and equivocation. The recognition slice of dialectical reasoning, learning to name what is wrong before defending what is right.	Have
Response Calibration	When to stop, when to think, and when to keep responses concise. Trains the model to match output length and reasoning depth to what the prompt actually requires.	Have
General Capability Baseline	Standard assistant data so the model handles non-negation questions naturally and does not become paranoid about every prompt.	Have
Normal Chain-of-Thought	Straightforward reasoning with explicit thinking traces. No trick, the model walks through the logic step by step. These exist so the model doesn't become paranoid about negation. If there is no trap, just solve it.	Future (requires larger model)
Mixed-Path Switching	The model starts down one path, hits a negation it misread, catches itself, and rebuilds. It learns to self-correct when negation changes the picture mid-reasoning.	Future (requires larger model)
Dialectical Resolution	Two sides argue. The model tries to break both positions and reports what survives. Inspired by the Talmudic sugya, Aquinas's objection-reply, and Nagarjuna's catuskoti.	Recognition slice done, full version requires larger model

Future Data

Training Data (SFT)

Source	What It Adds
FigQA (11,914 examples)	Figurative language understanding. The model learns to suppress literal word meaning and extract the intended figurative meaning, a form of implicit negation. When someone says "her promises have the strength of titanium," the model must negate the physical interpretation and extract the metaphorical one.
E-KAR (2,906 examples)	Contrastive analogical reasoning from standardized exams. Each example is augmented with explanations of why incorrect options fail, teaching the model not just what is right, but specifically what is wrong and why.

Evaluation Benchmarks

Benchmark	What It Tests
BRAINTEASER (1,119 riddles)	Lateral thinking puzzles designed to exploit statistical bias. The obvious answer is always wrong. Tests whether the model can suppress the high-probability default and find the lateral solution.
MuSR (756 puzzles)	Multistep soft reasoning with intentionally planted red herrings (murder mysteries, object placement). Tests whether the model identifies and ignores semantically attractive but logically irrelevant distractors.
CRASS (274 pairs)	Counterfactual reasoning. Tests whether the model can override learned world knowledge when given a hypothetical constraint ("what if gravity repelled?"). Measures the ability to suppress prior beliefs when explicitly negated.
IFEval (541 prompts)	Negative constraint following. Prompts with explicit negative constraints ("write about X without using word Y, no lists, no paragraphs over 3 sentences"). Tests enforcement of multiple simultaneous "don't" rules.

Roadmap (Future Versions)

Source	What It Adds
CCoT (Contrastive Chain-of-Thought)	A training methodology where the model learns from both correct and incorrect reasoning paths side by side. Small-scale version already implemented via the Contrastive Reasoning category above. Full multi-trace CCoT planned for larger model variants where internal reasoning traces become feasible.
Sci-Reasoning (3,819 papers)	Cross-domain scientific synthesis. Research papers mapped to their intellectual predecessors with synthesis narratives. Planned for future models targeting scientific reasoning with negation-based constraint injection.