You Can Start Building LLM Skills Before You Know the Whole Shape

The first time you stare at a backlog of a few hundred small repeated violations, all of them roughly the same shape, the natural thought is that this should be automated. Some validation rule trips on a missing attribute, or the same handful of conventions are wrong in dozens of templates, and the work is too tedious to do by hand and too varied to do with a single regex. So your brain reaches for the word automation, and from there it tries to assemble a system.

That impulse is fine. The problem is that it tends to grow before anything useful exists. You start sketching a resolver: it should scan the repo nightly, file tickets, classify by severity, propose fixes, run tests, request review, merge low-risk changes, and report back. By the time you finish describing the shape, you have a roadmap and no skill. The pile of work has not changed and the part of your day where you actually fix things is still there, untouched.

I have been on enough of those whiteboards to know the trick is to start much smaller than the diagram wants you to. Specifically, start with a single useful behavior, run it against real work, and let the thing become a skill by accretion.

Automation is too broad a word

Before getting into the building part, it helps to acknowledge that the word automation hides several different things. When somebody says “we need to automate this,” they could mean any of: a scheduled scan, a triggered intake, a triage classifier, a fix suggester, a PR opener, a test runner, a review commenter, an automerger, a monitor. Each of those is a separate decision about who or what is allowed to start, propose, approve, and ship. Saying “automated” without picking one of those points is how you end up arguing about a system instead of building anything.

For our purposes, naming the part is enough. The skill you are about to build is not the whole pipeline; it is one named behavior, sitting at one point in that chain, doing one understandable thing.

A skill begins as one useful behavior

Take the backlog of validation violations. The temptation is to write a skill that detects, fixes, verifies, and batches across the entire repo. Resist that for a beat. Pick the smallest behavior that would be useful to you today. Something like: given a single template file that contains one of these violations, produce a proposed edit that satisfies the rule and explain why.

That is not glamorous. It is also not the resolver. It is a fix proposer for one rule on one file. But it is enough to be tested against a real example, and it is small enough that you can read its output and immediately know whether it did the right thing. The skill becomes useful the moment you can run it on a file you actually have to fix.

What goes into the skill at this stage is also small. A short description of the rule the skill is responsible for, written in the model’s working register. A scope statement that says which file types it operates on. A procedure that says, roughly: read the file, find the violation, propose the minimum edit that satisfies the rule, return both the edit and a one-line reason. That is it.

The hardest part is keeping yourself from adding more right now. There will be a moment when you want to also have it handle a related rule, or detect violations across the whole directory, or open a pull request. Save those. The skill is not asking for them yet because it has not been used.

Use is what teaches the skill its shape

The first few real runs are where the actual design happens. You point the skill at a file you know is broken, read what it produces, and notice the gap between what you wanted and what it returned. Maybe it fixes the violation but rewrites adjacent code that you did not want touched. Maybe the explanation it generates is fine the first time and three sentences too long the second time. Maybe it confidently solves the wrong rule because something in your phrasing implied a related one.

These are not failures of the model so much as gaps in the skill. Each of them tells you something concrete: tighten the scope statement so the model does not roam beyond the violation, constrain the output format so the explanation cannot expand into an essay, narrow the rule definition so the model stops solving its cousin. The next version of the skill is shaped by what the previous version got wrong on real files. That is very different from designing it on a whiteboard.

This is also where the model’s habits become visible. It will cheerfully produce plausible work that is not actually correct, especially when the source material is too generous. It will fix things you did not ask it to fix, because somewhere in the context you gave it permission to. Treat these as signals, not personality flaws. Each one is a hint about which sentence in the skill should be edited next.

The skill grows by accretion

Once the small fix proposer is reliable on single files, you have a base to extend. The next layer is whatever real use asks for. Maybe the obvious next move is having the skill scan a directory, list the files that contain the violation, and propose edits one at a time. Maybe it is producing a short markdown checklist of files for you to walk through manually, because you trust the detection but not yet the fix. Maybe it is a stopping condition, like pausing for confirmation after a small batch of changes so you can spot drift before it spreads across the repo.

Whatever the next layer is, it is added because the work asked for it, not because a diagram demanded it. The skill grows the way a tool you actually use grows: a new procedure here, a new boundary there, an output format that turned out to be more useful than the one you started with. After a few weeks of this, the skill looks more like a workflow and less like a single behavior, but it got there honestly.

This is the part the up-front design approach almost always misses. The right shape of a skill is the shape that emerges after it has been used against real problems. You cannot write that shape from a whiteboard; you can only write the first useful behavior and then keep paying attention.

Human checkpoints are part of the design

It is easy to read all of this and assume the goal is full autonomy. It is not. A useful LLM skill is allowed to be as small as a function that suggests, prepares, classifies, edits, or generates reviewable work. Knowing where the human still starts, checks, approves, or extends the workflow is a feature of the skill, not a limitation of it.

For the validation example, that might mean the skill proposes the fix and a person applies it; or the skill opens a pull request and a person reviews it; or the skill writes a checklist and a person works through it. None of those positions are lesser. They are decisions about where judgment lives. The skill does not have to do everything to be worth building, and naming what it does not do is part of how you keep it honest.

This is where the automation taxonomy from earlier earns its keep. You can say plainly: this skill proposes; that one drafts; the other one batches. Each name is also a boundary.

Start now, let it earn its shape

The version of this work I keep recommending is the one where you stop drawing the system and write the first useful behavior tonight. Pick one rule, one file type, one small expected output, and one example you can test against. Run it. Read what it gave you. Adjust the skill where it drifted. Run it again. Keep the source material narrow. Add the next behavior only when the work asks for it.

The skill that comes out of this is not the one you would have designed up front, and that is the point. It fits the work because the work shaped it. There will still be a place for the larger picture later, when you have enough of these small skills that connecting them starts to look obvious. By then you will have the experience to tell which connections are real and which ones are diagrams.

For now, the useful move is the smaller one. The whole shape is allowed to come later. And congratulations on making your first skill!

LLM Disclosure: This post was written with AI assistance for research synthesis and drafting.