Offline generative-UI tool with Ollama streaming, sandboxed iframe preview, and a QLoRA fine-tuning pipeline

The generative UI workflow I wanted is simple: type a description of a component, watch it render immediately, refine the description, watch the update. The problem with cloud-based versions of this workflow is the interruption model. You type, you wait for a round-trip to a remote API, you see the result. The latency is short enough that you don't notice it consciously, but it's just long enough to break the feeling that you're sculpting output directly. You're placing orders, not iterating.
With a local model and streaming output, the experience is different in kind. Tokens arrive continuously from the moment generation starts. If the preview updates token-by-token, you watch the component build in real time — you can see it heading somewhere wrong and stop early, or see it nail the layout immediately and feel confirmation without waiting for completion. That feedback loop is worth more than the quality ceiling of a cloud model, at least for the structure-heavy early work of getting a component's layout right.
TinkerUI is the local version of that workflow. Split pane: chat on the left, live rendered preview on the right. No API key, no rate limits, no cost per generation, no data leaving the machine. I built the MVP in 84 minutes because the core idea is not complex. The complexity lives in one specific place: the streaming extractor.
Taking a token stream that contains HTML and updating a live preview without broken intermediate renders is harder than it sounds.
Tokens arrive in fragments. A model generating <div class="flex items-center"> does not emit that as a single token. It might emit <div, then class, then ="flex, then items-center, then ">. Or it splits differently depending on the tokenizer and the specific model weights. The naive approach — take every partial output and set it directly as the iframe's srcdoc — produces a preview that flickers between valid HTML, broken HTML, and empty states, depending on where the token boundary falls at each frame.
The extractor I built handles four distinct partial-output shapes: mid-tag (the < is there but the tag name is not complete), mid-attribute-name, mid-attribute-value, and complete-but-unclosed (valid opening tags with no closing counterpart yet). For each shape, the extractor decides whether to attempt a render with the current partial or buffer until the next token arrives. Complete tags always render. Partial tags that would produce a parse error buffer. Unclosed tags render because the browser's HTML parser handles them gracefully with implicit closing.
The srcdoc iframe with Tailwind CDN is the right container: it isolates untrusted generated HTML from the application DOM entirely, and loading Tailwind via CDN means any Tailwind class in the generated output works without configuration on my side. The sandbox attribute on the iframe prevents any script execution in generated output.
The base Ollama model — Qwen 2.5 Coder 1.5B — generates functional HTML. Its defaults aren't tuned to any particular component style or design vocabulary. The output is correct and generic. The QLoRA pipeline addresses that.
I assembled a 340-example dataset of component prompts and their corresponding Tailwind HTML outputs, ran fuzzy deduplication to remove near-duplicates, and fine-tuned Qwen 2.5 Coder 1.5B using Unsloth, PEFT, and TRL. Training ran on Apple Silicon with MPS-compatible float32 precision and exported to GGUF format for Ollama.
What this enables: you can train the model on your own component examples, in your own design system vocabulary, with your own layout patterns. A model fine-tuned on 340 examples of how you specifically structure cards and navigation and forms will produce output that fits your work much more tightly than the generic base model. The pipeline is reusable — swap in a different dataset, run the training loop, export a new GGUF, reload Ollama.
I want to be clear about the framing. 84 minutes is the MVP sprint: the streaming chat interface, the extractor, the sandboxed iframe, all running end-to-end. The fine-tuning pipeline is not an 84-minute project — it took additional evenings to assemble the dataset, validate the training loop on Apple Silicon, and confirm the GGUF export loaded correctly in Ollama. The 84-minute headline is honest for the core application. The pipeline is the part that gives the tool a higher ceiling than any sprint-built prototype usually has.
Did this resonate?