FIELD NOTES · 2026-05-29

Design tokens are the missing layer
in AI coding.

AI coding tools are good at components. Ask for a card, a form, a settings page, and you get working markup in seconds. But watch what happens to the values inside: a hex picked here, a 13px there, a one-off radius on this card and a different one on the next. The model writes literal numbers straight into each file, with nothing shared between them. There is no token layer, so every screen improvises its own, and the whole thing drifts the moment you have more than one.

What the model actually emits

A design token is a named decision (--color-fg-muted, --space-3, --radius-md) referenced everywhere instead of repeated. It is the layer that keeps a hundred components agreeing on the same gray, the same step in the spacing scale, the same corner. It is also exactly the layer an AI coding tool skips.

The reason is structural. The model generates one component at a time, and for each one it picks the most probable literal value in context. Probable is not the same as consistent. The gray it reaches for in a header is close to, but not the same as, the gray in a sidebar two prompts later. The padding on a button is a plausible number, and so is the slightly different number on the input beside it. Each file is locally reasonable; nothing is globally aligned.

The output isn't wrong file by file. It is inconsistent file to file. Without a shared layer to point at, the model re-decides every value from scratch, and re-deciding is where the drift comes from.

Improvised values drift; tokens hold

Put the two side by side and the difference is concrete. On the left is what generation tends to produce: a literal value, freshly chosen, in each file. On the right is the same decision expressed once, as a token every component consumes.

layer	improvised, per file	one token, shared
color	`#6b7280` here, `#6e7681` there	`--color-fg-muted`
spacing	`13px`, then `15px`, then `14px`	`--space-3` on a 4px step
type	`15px/1.45` reset on each block	`--text-body` from one scale
radius	`6px` on one card, `10px` on the next	`--radius-md`
motion	a fresh `220ms` guess each time	`--ease-quick`

The left column is not ugly. Any single row would pass review. But three screens in, the muted grays no longer match, the spacing has four values where it should have one step, and the radii read as an accident. Nobody chose the inconsistency. It accumulated, because there was no layer to make the choice once.

The fix is a layer the generated code must consume

You cannot prompt your way out of this by asking for more consistency. The model will still re-decide each value. The fix is to give it a layer to reference instead of a blank to fill: a set of tokens defined up front, with the instruction that components read from them and never inline a raw value. Generation against tokens is a different shape of problem: the model picks which token applies, not what number to invent. The same gray flows through every component because they all point at the same name.

That only works if the tokens are real and complete before the first component is written. A half-defined layer (three colors and no spacing scale) invites the model to improvise the gaps, and the drift returns through the back door. The layer has to cover the decisions that actually vary: a color ramp, a spacing scale, a type scale, a radius set, and a motion set. Define those once and the output stays consistent, and stays rebrandable, because changing the brand means editing one layer rather than hunting literals across forty files.

Where ux-skill produces the layer

This is the gap ux-skill fills. Instead of leaving the token layer to chance, it compiles a project brief into a concrete token set, deterministically, with no model in the loop. A brief (surface, industry, tone) resolves through a fixed pipeline into a recommended design system: a style, a palette, a type pairing, and a spacing scale driven by the density of the surface.

The recommender outputs a token set. Style plus palette plus a type pair plus a density-driven spacing base come back as named values, not prose: the layer the generated code is meant to consume.
/ux-system proposes a starter system. For a project with no tokens yet, it stands up the full layer (color ramp, spacing, type scale, radius, motion) so components have something real to reference from the first file.
The linter flags off-token literals. A regex pass reads the output before you commit and catches the tells of improvisation: a raw hex where a variable belongs, a magic spacing number outside the scale, a one-off radius. Each finding names the line and the fix.

Run the same brief next week, on another machine, and the token set comes back identical. That is the point of taking the model out of the loop: the layer that everything else depends on stops being a guess.

Token-driven versus improvised

The contrast is the whole argument. Improvised output is a pile of locally plausible numbers that never agree across files, so the surface drifts and a rebrand is a search-and-replace through the codebase. Token-driven output routes every component through one named layer, so the surface stays consistent by construction and a rebrand is an edit to that layer alone. Same components, same AI coding tool. The difference is whether there is a layer underneath for the generated code to point at.

pip install uxskill
# then, in your AI coding tool:
# /ux-system     : propose a starter token layer for a project with none
# /ux-recommend  : brief in, deterministic token set out

What the model actually emits

Improvised values drift; tokens hold

The fix is a layer the generated code must consume

Where ux-skill produces the layer

Token-driven versus improvised

Related