The Discipline Patch
A small corrective fine-tune turns failed evals into behavior training: patch the discipline, not the facts.
The Discipline Patch
My model knew MITRE ATT&CK cold and still flunked the job. The fix wasn’t more data. It was the model’s own failures.
I call it a “Discipline Patch.”
It’s a small thing: 72 training rows and about eighteen minutes of retraining. It took a 27B model from flunking a strict analyst eval to passing all 71 held-out questions, and it changed how I think about fixing a model that misbehaves. It also only exists because my first two theories about the problem were wrong, so this piece is mostly the story of being wrong twice.
Some context first. I built discipline-patch, a QLoRA fine-tune of an open 27B model on the public ATT&CK corpus, to help with investigation work. I went in expecting the model to be the bottleneck, and it wasn’t. A base model already knows MITRE ATT&CK cold. Ask it about T1059 and you’ll get a perfectly reasonable explanation of Command and Scripting Interpreter. The framework was the easy part, because that knowledge is already baked in. What I actually had to teach was discipline.
By discipline I mean the things a good analyst does almost without thinking. When you map an observation to a technique, you also say what evidence you’d still want before you’re sure. You stay careful about the line between what the data shows and what it doesn’t, so you don’t reach for an actor, an attribution, or a severity call the logs can’t support. And when someone asks you about a technique ID that doesn’t exist, you say so, instead of inventing a confident-sounding answer. None of that is knowledge. It’s behavior, and it’s the whole reason a model like this is useful rather than just impressive.
Here’s what that difference looks like in practice. The prompt is deliberately a trap: an everyday network event that looks mappable but isn’t.
Prompt: A DHCP server shows a student laptop received an IP address at 8:02 AM. Can this alone be mapped to a MITRE ATT&CK technique?
A generic model tends to reach for a mapping anyway:
This could map to Discovery (System Network Configuration Discovery), since the device obtained network configuration information.
Confident, plausible, and wrong. A DHCP lease is ordinary network activity; on its own it’s nothing. Mapping it to an adversary technique is exactly the kind of overreach that floods a SOC queue with noise and teaches analysts to stop trusting the tool. After the discipline tuning, the model holds the line instead:
No. A DHCP lease by itself shouldn’t be mapped to ATT&CK. It only shows that a device joined or renewed on the network. A real mapping would need more: suspicious process execution, command-line activity, unusual DNS queries, authentication anomalies, signs of lateral movement, an endpoint alert, or a hit on known-malicious infrastructure. Treat this as context, not as attack evidence.
Both versions know the framework equally well. The difference is entirely behavior.
Getting there was not a straight line. My first adapter learned the mappings and still flunked the job, scoring 1 out of 12 on a strict probe, barely ahead of the base model’s 0. My first theory was truncation: the answers were thin, so maybe they were getting cut off. I reran the eval with bigger and bigger token budgets, 128 up to 2048. The failures didn’t move. Room wasn’t the problem.
My second instinct was the one most of us have on reflex: add more training data. Before acting on it, I sat down and actually read the failing answers, and that was the clue. The ATT&CK mappings in them were correct. What was missing was the same three things every time: no evidence reasoning, no boundary language, no structure. A model that’s missing knowledge gets the mapping wrong. A model that gets the mapping right and skips the same habits every time isn’t ignorant. It’s undisciplined. And more data was never going to fix that, because more data teaches facts, and facts weren’t failing.
The fix turned out to be sitting in the eval itself. Because the eval scores every answer component by component, each failed row already named the habit that was missing. So I rewrote the failed rows into 72 corrective examples showing how an analyst should have answered them: evidence cited, boundary held, fake IDs refused. One short retrain later, the same strict probe went 12 for 12, and the full held-out set passed 71 of 71. The part I liked best is that the answers got shorter and more structured, not longer. I’ll be honest about the edges: that’s the v1 technique-explainer set, and procedure-level disambiguation is still the hard problem I haven’t fully solved. But somewhere in there the evaluation quietly stopped being a report card and became the engine of improvement, which is the eighth lever earning its keep.
The funny part is that I’d been running this exact loop for months without recognizing it. In my agentic systems, every generate pass runs through a verify step, and when verification fails, the failure explanation gets fed back into the next generation attempt. That’s the same move. The only difference is where the correction lands. At inference time, the lesson lives in the context window and evaporates when the session ends. The patch writes the same lesson into the weights, where it sticks. Spend the failure in context and you fix one answer. Bank it in the weights and you fix the answerer.
If you want the broader lineage, the closest named ancestors are hard-example mining and the DAgger family of corrective imitation learning, and it’s the mirror image of rejection sampling: rejection sampling keeps a model’s successes and trains on those, while this keeps the failures and corrects them. I call it a discipline patch because that’s what it does. It patches behavior, not knowledge.
There’s a broader design principle hiding in that distinction, and it’s the part I suspect reaches furthest beyond security. The discipline (the formatting, the boundaries, the refusal reflex, the shape of the reasoning) is slow to change and worth baking in, so it belongs in the adapter. The facts don’t. Which technique is trending, what came in yesterday, the specifics of one procedure: all of that is fast-moving, and it belongs in retrieval, not in weights I’d have to retrain just to stay current. Fine-tuning on yesterday’s facts is really just a way of baking in tomorrow’s staleness. Train the behavior, and retrieve the knowledge.
A couple of quieter choices kept the whole thing honest rather than a matter of taste. The training data is generated deterministically from the official ATT&CK source, and it deliberately includes fake-technique-ID rows as first-class hard negatives, so refusal is something the model is actually trained to do rather than something I’m hoping it picked up. And every row has to clear a deterministic quality gate before it’s allowed anywhere near training. That’s the Determinism Ladder applied to a data pipeline: make deterministic whatever you reasonably can, and save the model for the part that genuinely needs judgment.
In the end, the adapter isn’t really the point. The loop is: a repeatable way to turn a model’s own mistakes into the exact examples that fix them. The code, the data recipe, and the eval suite are all in the repo, built entirely on public MITRE ATT&CK® data.
Axioms applied in this essay
This article tested 5 of the StoneyTECH engineering axioms. Each verdict is the result of applying that axiom in this specific argument.
- #2 Push work down toward determinism held
The patch moves repeated analyst discipline from per-answer prompting into adapter behavior.
- #5 Never trust 'running' without sentinels held
The eval suite becomes the improvement engine instead of a passive scorecard.
- #11 Cite or be silent held
The article links to the public repo and the public MITRE ATT&CK data basis.
- #14 Two cheaper alternatives first held
The piece rejects more data after cheaper diagnosis shows the failure is behavioral.
- #16 Don't comment without building. Don't curate without proving. held
The public repo, eval story, and article form a reusable proof path.
