Launch · v3.0.0 · 2026-05-28

ux-skill v3.0: we shipped The Brain. Brand specs are training data now, not templates.

v2 picked from the catalogue. v3 distills from it. Same 160 brand specs, completely different role. The recommender used to score 1,243 entries and return the highest-similarity match. The synthesizer now reads those entries as vocabulary and compiles a novel design language from a brief. Every call returns a system that did not exist before the call.

By the ux-skill team · 12 min read · MIT-licensed open source

EN JA ZH-CN AR ES DE FR

Brand specs are training data,
not templates.

The shift in one paragraph

v2 was a recommender. You gave it a brief and it returned the closest match from a curated catalogue: 84 styles, 176 palettes, 160 brand specs, 70 type pairs. The output was always a row from the database. v3 keeps the entire catalogue but flips its purpose: those 1,243 entries are now the vocabulary the engine distills from. A brief no longer pulls a row. It gets compiled into seven axis values, and those values synthesize fresh tokens. The catalogue teaches the engine what airy-and-corporate looks like; the engine generates a new airy-and-corporate system that has never shipped before. Same data. Completely different role.

What broke in v2 that v3 fixes

Three real bugs and one design hole. The bugs were not loud. They were the quiet kind where the engine kept returning plausible output and we kept shipping it. We only caught them once we sat down to write the upgrade spec and an external review forced us to look honestly.

1. Pseudo-determinism through filesystem-order tie-breaks

Several sorts inside the recommender pipeline scored candidates on similarity and broke ties by whatever order Python's os.listdir returned. On macOS that was alphabetical. On Linux it was insertion order in the filesystem journal. On a fresh clone vs. an aged clone, the tie-break could flip. We told ourselves the engine was deterministic. It was deterministic up to a tie. After a tie, it was whatever the operating system happened to remember.

v3 hardens every sort with an explicit alphabetical-on-brand-id tie-breaker. Same brief, same axes, same output across machines, filesystems, and Python versions. We have a test that runs the synthesizer 200 times across three temp directories and asserts byte-identical output.

2. Axis collisions resolved by accident

In v2, when a brief had two tags that pulled in opposite directions (dense and corporate, for example) the recommender's scorer would weight one of them harder by some hidden coefficient, and the output would lean toward whichever tag the coefficient happened to favor. There was no documented rule. We could explain the result after the fact. We could not predict it before.

v3 lifts this into an axis interaction matrix. Every documented conflict has an explicit resolution and a rationale. Density and formality fight over spacing: density wins for dense briefs (Bloomberg-school), formality wins for airy briefs (luxury). The matrix is tested. It is readable. It is checked in.

3. The evaluator was grading the synthesizer's own homework

This was the most embarrassing one, and we caught it because ChatGPT flagged it in an external review of the v2.1 spec, what it called the self-referential drift problem. The flow looked clean: the synthesizer compiled a system from the brief, then score_tone_match evaluated how well the system matched the brief's tone. The trouble was that score_tone_match derived axis values from the brief using the same logic as the synthesizer, then compared the synth's axes to its own derived axes. Both sides of the comparison ran through the same function. Of course it scored well. The synthesizer was grading itself with its own rubric.

v3 decouples them. score_tone_match now compares synth output to the brief's raw tone tags (the strings warm, bold, minimal, etc.) through an independent mapping that never touches the synthesizer's axis logic. The two paths cannot cheat off each other anymore. We owe that observation to the external review, and we are saying so on the record.

4. The decisions log was write-only, never consumed

v2 wrote a .ux/decisions.jsonl entry after every recommend, but nothing read it back. It was a log file for our own debugging, not a feedback signal. The recommender ran identically on call 1 and call 1,000. There was no learning. The repo had a memory and the engine ignored it.

v3 closes that loop. The ledger has a locked schema (_v: 1), the recommender re-ranks candidates by past wins in the same (industry, ui_type) bucket, and only decisions that scored well and the user actually shipped count toward the re-rank. The brain has eyes on its own history now. Section 6 is the full walkthrough.

The 7-axis synthesizer

At the center of v3 is a deterministic function that maps a brief to seven numeric axes, and those seven axes to a complete design system. The axes are not magic. Each one is a normalized scalar with a documented range, and each one points at a specific token bundle. The point of running on axes (rather than picking a row from the catalogue) is that any combination is reachable, including combinations that no real brand in the catalogue uses. The catalogue defines the shape of the space; the synthesizer can land anywhere inside it.

Axis	What it measures	Maps to
warmth	Color temperature of the palette: cool charcoal to warm sienna	palette hue family, accent saturation curve
contrast	Visual loudness: quiet/balanced/loud	type scale ratio (1.200 / 1.250 / 1.333), shadow depth, accent intensity
density	Information packing per viewport: airy to dense	spacing scale base (4/8/12/16px), line-height multipliers
geometry	Edge personality: sharp/balanced/soft	radius scale (2/8/18px), corner topology, border weights
formality	Tone register: playful/balanced/corporate	type weight curves, tracking, caption-to-hero spread
motion	Animation budget: restrained/balanced/cinematic	duration ramps (120/240/420ms), easing family, scroll behavior
type_personality	Typographic voice: humanist/neutral/geometric/expressive	display face family, weight ladder, italic strategy

The Python entry point is small enough to fit in a single example. You pass a brief, you get back a synthesized system with the axis values, the chosen palette, type configuration, spacing scale, radius scale, and motion presets attached.

# from a brief to a system, in one call
from engine.synthesizer import synthesize

brief = Brief(
  industry="fintech-payments",
  tone=["bold", "serious"],
  audience=["merchants", "developers"],
)

sys = synthesize(brief)
# sys.mode == "pure_synthesis"
# sys.axes == {warmth: 0.32, contrast: 0.72, density: 0.58, ...}
# sys.palette, sys.type, sys.spacing, sys.radius, sys.motion

Same brief in, same axes out, same tokens out. Always. The synthesizer has no random seed because there is no randomness to seed.

Three modes, one decision logic

The synthesizer auto-dispatches between three modes based on what the brief contains. There is no flag to pick a mode by hand for most uses; the brief shape does it. The logic is intentionally simple: presence of a reference brand and a strict flag decides the path.

strict_brand · fastest

100% the named brand

reference_brands=[stripe] with strict=True returns the Stripe spec verbatim. No interpretation, no mixing, no synthesis. Useful when the brand exists, the spec is authoritative, and the brief just wants the tokens. This is the path you take when you have a Figma file and a documented system and you need code that does not improvise.

brand_anchor · 70 / 30

Anchored to the brand, axis-adapted for the brief

reference_brands=[stripe] without strict=True returns 70% Stripe tokens fused with 30% axis-derived deltas from four sibling brands picked by axis proximity. The output stays unmistakably Stripe but bends toward the brief's tone: a more playful Stripe for a consumer feature, a more formal Stripe for an enterprise dashboard. The anchor brand survives every conflict.

pure_synthesis · infinity space

No brand named: novel language every call

No reference_brands in the brief. The synthesizer scores the seven axes from the brief, finds the eight nearest exemplars in the catalogue by axis distance, distills shared structure across them, and emits a fresh palette + type + spacing + radius + motion bundle. Different briefs land in different regions of the same space; every output is internally consistent and identifiably its own.

From the CLI the three paths look like this:

# strict: 100% the brand
$ uxskill synthesize --brand stripe --strict

# anchor: 70/30
$ uxskill synthesize --brand stripe

# pure synthesis: no brand
$ uxskill synthesize --industry fintech-payments --tone bold

One synthesizer, one decision logic, three output personalities. The user does not switch implementations; the brief routes itself.

The axis interaction matrix

Axes pull on the same tokens. Density wants tight spacing; formality wants generous spacing for a luxury register. Geometry wants sharp corners for editorial; formality wants gentle corners for a financial dashboard. In v2 these conflicts resolved by whichever coefficient happened to be larger in the scorer. In v3 every documented conflict has a stated outcome and a school of design it points at, named, so you can read it and disagree if you want.

Four representative cases, all checked in as test fixtures:

dense + corporate

Spacing scale base: 4px

Density wins. The Bloomberg-school answer. Information packing is the highest virtue when the brief asks for dense data dashboards in a corporate register: formality bows to legibility-at-density.

airy + corporate

Spacing scale base: 12px

Formality wins. The luxury-finance answer. Airy + corporate is the brief for premium banking, private wealth, executive dashboards: the room around each element is the message of seriousness.

sharp + corporate

Radius scale: 2px

Geometry wins. The NYT-school answer. Sharp + corporate is the editorial brief: right angles and 2px micro-radii read as institutional, considered, broadsheet.

soft + playful

Radius scale: 18px

Both axes align. The Glossier-school answer. Soft geometry plus a playful register collapses onto generous rounding: 18px is the threshold where rectangles start reading as pebbles.

Why this matters: in v2 the same brief could land on 4px or 12px depending on which coefficient won the scorer that month. In v3, dense + corporate is always 4px. airy + corporate is always 12px. The rules are readable, testable, and arguable. If you disagree with a resolution, you can file an issue against the matrix and the discussion is about taste, not about a hidden constant.

The brain learns

This is the part of v3 that is genuinely new for the project: the engine now learns from your local decisions. The mechanism is small, deterministic, and entirely offline. There is no telemetry, no account, no cloud sync. Your install learns from your install. Different repos build different brains.

The ledger

Every /ux-recommend and /ux-synthesize call writes a single line to .ux/decisions.jsonl. The schema is locked at _v: 1:

{
  "_v": 1,
  "ts": "2026-05-28T14:22:09Z",
  "frame": { "industry": "fintech-payments", "ui_type": "dashboard", ... },
  "system": { "style_id": "editorial-calm-dark", "palette_id": "charcoal-amber", ... },
  "axes": { "warmth": 0.31, "contrast": 0.72, ... },
  "lint_score": 88,
  "user_accepted": true
}

The re-rank

On the next call, the recommender groups past entries by (industry, ui_type). Each candidate it considers gets a +5 bump if it matches a prior in the same bucket. Only priors with lint_score >= 80 AND user_accepted = true count; we do not learn from rejected output. The bucket needs at least three qualifying priors before any re-rank kicks in. Below that, the engine runs cold-start and behaves identically to a fresh install.

The guarantees

The guarantees we make about this are narrow and worth saying out loud. Determinism is preserved. Same brief plus same ledger always produces the same output. The +5 bump is applied before any tie-break, and the alphabetical-on-brand-id rule still wins ties below it. Cold-start is safe. A new repo behaves like every other new repo. The learning is local. Your decisions never leave your machine.

You can inspect what your install has learned at any time:

$ uxskill stats --html
[OK] Wrote .ux/stats.html
[OK] Open in browser to see your install's learned priors

The HTML dashboard shows the count of decisions per bucket, the average lint score, the most-recurring palettes, the axis distributions. It is the visible proof of self-learning: you can see your install's taste profile thicken over time as you keep shipping. None of this is on a server. It is in your repo, in plain JSONL plus a generated HTML view.

/ux-evolve auto-loop

The new command in v3 is /ux-evolve. Until now, the polish loop was manual: run lint, read the findings, fix them, re-lint, repeat. Evolve closes that loop. You hand it an artifact and a target score; it runs lint, applies six idempotent polish passes, re-lints, and either stops at the target, stops on plateau, or stops at the five-round cap.

The shape of a round:

Lint: 145 deterministic regex rules across A11y, content, quality, typography. Returns a score from 0 to 100 and a finding list.
Polish: six passes that fix the highest-severity findings first. Idempotent: re-running a pass on already-polished output is a no-op.
Re-lint: score the polished artifact.
Decide: if the new score ≥ 90, stop. If the delta from the last round is below 5 (a plateau), stop. If we are at round 5, stop. Otherwise loop.

The quality gate is firm at 65. If the final score is below 65, the engine refuses to ship the artifact unless --force is passed. The reasoning: a 65-or-below output is recognizable AI slop, and we do not let our own engine emit slop quietly. The user can override the gate, but the override is explicit and shows up in the decisions ledger.

A concrete run: a fintech dashboard scaffold scored 72 on the first lint, mostly missing focus states, a couple of low-contrast labels, an Inter-at-display-size finding. Round 1 polished focus states up to 81. Round 2 fixed contrast to 86. Round 3 swapped Inter for the brief's actual display face and rebalanced caption sizing, reaching 91. Target reached, loop terminated. Three rounds, no rejections, shipped.

What we did not build (and why)

The v2.1 maximalist spec proposed several things that did not make it into v3. Saying which ones, and why, is the only honest way to talk about scope.

No LLM in the loop

The maximalist proposal had an LLM-judged subjective aesthetic axis: "ask a model whether this feels editorial." We rejected it on principle. The entire point of the engine is determinism: same brief, same output, no surprise tomorrow morning. An LLM in the synthesizer would make every call non-reproducible. v3 calls zero LLMs. The synthesizer is pure Python over JSON; the linter is regex; the evaluator is rule-based.

No multi-candidate genetic mutation

The proposal also had a genetic-algorithm step: emit N candidates per call, mutate them, score them, return the fittest. We tried it in a branch. The output got worse, not better, because mutation introduced noise the polish loop then had to undo. v3 ships single-artifact synthesis with a polish loop. One candidate. Six idempotent passes. Better results, smaller code surface.

No renamed commands

The proposal renamed /ux-recommend to /generate:ui and /ux-polish to /mutate:ui. We kept all 22 existing commands at their existing names and added /ux-evolve as the 23rd. We did not want a v2-to-v3 install to break anybody's muscle memory or their shell aliases.

No "burn the catalogue for infinite space"

The most maximalist version of the proposal said the catalogue was a liability: that the synthesizer could reach the entire design space on its own and the 1,243 entries should go. We kept every entry. The catalogue is what teaches the synthesizer the shape of the space. Without exemplars the axes have no anchors, and the output gets unmoored. v3 reads the catalogue as vocabulary; v3 needs the vocabulary.

These are taste calls. People with different priors will disagree, and that is fine. We owe it to the project to say what we did not do, and why, on the record.

Numbers at v3

Commands

MCP tools

1,243

Entries

145

Anti-patterns

160

Brand specs

IDEs

Locales

223

Tests pass

Twenty-two slash commands kept at their existing names, plus /ux-evolve as the 23rd. Fifteen MCP tools kept, plus three new ones (ux_synthesize, ux_decisions_query, ux_decisions_stats) for agents that want to drive the synthesizer or query what the brain has learned. The catalogue and the linter are untouched. The 17-IDE installer is untouched. The 17 localized homepages and READMEs are unchanged. v3 is additive at the surface and rewritten at the core.

v2 was a recommender that picked from a catalogue. v3 is a compiler that distills from one. The same data is doing the opposite job.

Install in 60 seconds

v3 ships through the same three paths as v2: pip, npm, and the IDE plugin marketplaces. The init step auto-detects your IDE and writes the right config file in the right place. The synthesize step takes a brief or an industry + tone pair and returns a complete design system to .ux/last-system.json in the current repo.

$ pip install uxskill
$ uxskill init                     # auto-detects your IDE
$ uxskill synthesize --industry fintech-payments --tone bold

[OK] Mode: pure_synthesis
[OK] Axes: warmth=0.31, contrast=0.74, density=0.58, geometry=0.42, ...
[OK] Wrote .ux/last-system.json (palette, type, spacing, radius, motion)
[OK] Wrote .ux/decisions.jsonl (1 entry, schema _v: 1)

Same install path on every supported IDE: Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, Codex, Kiro, Cline, Continue, Aider, Zed, JetBrains AI, Pieces, Tabby, Tabnine, CodeWhisperer, Roo Cline. The MCP stdio server is the single source of truth for all of them.

Honest scope

v3 is a foundation, not a finished destination.

The 7-axis synthesizer is the work we have published evidence for. Axes 8 and 9 (saturation_strategy, surface_depth) are sketched but not shipped; they would deepen the dark / light split and the gradient strategy. The decisions ledger is consumed only by the recommender re-rank today; we expect the evaluator and the lint scorer to start reading it in v3.1. The maximalist v2.1 proposal we rejected is not gone; some of those ideas will land later, when we can prove they help.

If you find a brief that produces unsatisfying output, or a conflict resolution in the matrix that disagrees with your taste, file an issue. The matrix is checked in, the tests are checked in, the brain is in your repo. The arguments now happen on a substrate you can read.

What changes for you

If you were on v2: nothing breaks. The 25 commands you already use behave the same way. The lint runs at the same speed. The MCP server exposes 15 of the 18 tools at their old names plus three new ones. If you upgrade and never call /ux-synthesize or /ux-evolve, your daily workflow does not move at all.

If you were waiting for the engine to generate rather than recommend: that is what v3 is. The synthesizer is the new center; /ux-recommend still works and now re-ranks from your ledger; /ux-evolve closes the polish loop. Brand specs are training data now, not templates. The same 1,243 entries are doing a completely different job.

We named v3 The Brain because that is what got built: a constrained generative design compiler with a closed feedback loop. It is fully offline. It calls zero LLMs. Same brief, same axes, same output. Different briefs, different regions, infinite outputs. Your install learns from your repo. Different repos build different brains. The whole thing is MIT, sub-300ms on a normal machine, and pip-installable.

Run it on a brief you care about. See what it returns. File an issue if the output disagrees with your taste: the conflict goes into the matrix, the rule gets named, and the next install behaves better. That is how the brain compounds.