Vucense

WebGPU Tutorial 2026: Run LLMs in the Browser with Transformers.js

🟡Intermediate

Accelerate browser AI with WebGPU v2 and Transformers.js 3. Covers GPU compute in the browser, on-device LLM inference, performance vs WebAssembly, and sovereign browser AI with zero cloud APIs.

WebGPU Tutorial 2026: Run LLMs in the Browser with Transformers.js
Article Roadmap

Key Takeaways

  • WebGPU is production-ready in 2026: Chrome 113+, Firefox 130+, Safari 17.4+ all support it. Check with navigator.gpu !== undefined.
  • Transformers.js 3 = easiest path: HuggingFace’s library handles model download, ONNX runtime, and WebGPU acceleration in a few lines of JavaScript.
  • Models cache in IndexedDB: First load downloads the model; all subsequent sessions use the cache. Zero repeat bandwidth cost.
  • Small models only for browsers: 1–3B models work well. 7B+ models may exceed device VRAM or take too long to download.

Introduction

Direct Answer: How do I run an LLM in the browser with WebGPU and Transformers.js in 2026?

Import Transformers.js: import { pipeline } from "@huggingface/transformers". Create a text generation pipeline: const pipe = await pipeline("text-generation", "Qwen/Qwen3-1.7B-ONNX", { device: "webgpu" }). Generate text: const result = await pipe("Hello, world!", { max_new_tokens: 100 }). The model (~1.2GB) downloads from HuggingFace on first load and caches in IndexedDB. Subsequent page loads use the cache with zero downloads. For streaming output: pass { streamer: new TextStreamer(tokenizer, { callback: (token) => console.log(token) }) }. Check WebGPU support first: if (!navigator.gpu) { /* fall back to WASM */ }.


Part 1: Check WebGPU Support

// check-webgpu.js — run in browser console
async function checkWebGPU() {
    if (!navigator.gpu) {
        console.log("WebGPU not supported — try Chrome 113+ or Firefox 130+")
        return null
    }

    const adapter = await navigator.gpu.requestAdapter()
    if (!adapter) {
        console.log("No GPU adapter found — may need to enable WebGPU flag")
        return null
    }

    const device = await adapter.requestDevice()
    const info = await adapter.requestAdapterInfo()

    console.log("WebGPU available!")
    console.log("GPU:", info.vendor, info.architecture)
    console.log("Limits:", {
        maxBufferSize: device.limits.maxBufferSize,
        maxComputeWorkgroupsPerDimension: device.limits.maxComputeWorkgroupsPerDimension
    })

    return device
}

const device = await checkWebGPU()

Expected output (Chrome + RTX 4090):

WebGPU available!
GPU: NVIDIA Corporation Ada Lovelace
Limits: { maxBufferSize: 274877906944, maxComputeWorkgroupsPerDimension: 65535 }

Part 2: Basic LLM Inference with Transformers.js

<!-- index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Sovereign Browser AI</title>
</head>
<body>
    <div id="status">Loading model...</div>
    <textarea id="prompt" rows="4" cols="60">Explain what WebGPU is in 2 sentences.</textarea>
    <button id="generate" disabled>Generate</button>
    <pre id="output"></pre>

    <script type="module">
        import { pipeline, TextStreamer } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3/dist/transformers.min.js"

        const statusEl = document.getElementById("status")
        const outputEl = document.getElementById("output")
        const generateBtn = document.getElementById("generate")
        const promptEl = document.getElementById("prompt")

        // Load model — downloads once, caches in IndexedDB
        statusEl.textContent = "Downloading model (first run only)..."

        const pipe = await pipeline(
            "text-generation",
            "Qwen/Qwen3-1.7B-ONNX",   // 1.2GB, fits in most device VRAM
            {
                device: "webgpu",        // GPU acceleration
                dtype: "q4",             // 4-bit quantisation for smaller footprint
                progress_callback: (progress) => {
                    if (progress.status === "progress") {
                        const pct = Math.round(progress.loaded / progress.total * 100)
                        statusEl.textContent = `Downloading: ${pct}% (${(progress.loaded/1e6).toFixed(0)}MB)`
                    }
                }
            }
        )

        statusEl.textContent = "Model ready — running on your GPU"
        generateBtn.disabled = false

        generateBtn.addEventListener("click", async () => {
            const prompt = promptEl.value.trim()
            if (!prompt) return

            outputEl.textContent = ""
            generateBtn.disabled = true

            // Stream tokens as they generate
            const streamer = new TextStreamer(pipe.tokenizer, {
                skip_prompt: true,
                callback_function: (token) => {
                    outputEl.textContent += token
                }
            })

            const startTime = performance.now()
            const result = await pipe(prompt, {
                max_new_tokens: 200,
                temperature: 0.7,
                streamer
            })
            const elapsed = (performance.now() - startTime) / 1000
            const tokens = result[0].generated_text.split(" ").length
            statusEl.textContent = `Done: ~${Math.round(tokens/elapsed)} tok/s`

            generateBtn.disabled = false
        })
    </script>
</body>
</html>

Serve locally: npx serve . then open http://localhost:3000.


Part 3: React Component with WebGPU

// BrowserAI.tsx — React component for in-browser LLM
import { useState, useRef, useCallback } from "react"
import { pipeline, TextStreamer } from "@huggingface/transformers"

type PipelineInstance = Awaited<ReturnType<typeof pipeline>>

export function BrowserAI() {
    const [status, setStatus] = useState<"idle" | "loading" | "ready" | "generating">("idle")
    const [output, setOutput] = useState("")
    const [tokensPerSec, setTokensPerSec] = useState<number | null>(null)
    const pipeRef = useRef<PipelineInstance | null>(null)

    const loadModel = useCallback(async () => {
        setStatus("loading")
        try {
            pipeRef.current = await pipeline(
                "text-generation",
                "Qwen/Qwen3-1.7B-ONNX",
                {
                    device: "webgpu",
                    dtype: "q4",
                    progress_callback: (p: { status: string; loaded: number; total: number }) => {
                        if (p.status === "progress") {
                            const pct = Math.round(p.loaded / p.total * 100)
                            setStatus(`loading` as typeof status)
                            setOutput(`Downloading model: ${pct}%`)
                        }
                    }
                }
            )
            setStatus("ready")
            setOutput("")
        } catch (err) {
            setStatus("idle")
            setOutput(`Error: ${err instanceof Error ? err.message : "WebGPU not supported"}`)
        }
    }, [])

    const generate = useCallback(async (prompt: string) => {
        if (!pipeRef.current || status !== "ready") return
        setStatus("generating")
        setOutput("")

        const startTime = performance.now()
        let tokenCount = 0

        const streamer = new TextStreamer(pipeRef.current.tokenizer, {
            skip_prompt: true,
            callback_function: (token: string) => {
                tokenCount++
                setOutput(prev => prev + token)
            }
        })

        await pipeRef.current(prompt, { max_new_tokens: 300, temperature: 0.7, streamer })

        const elapsed = (performance.now() - startTime) / 1000
        setTokensPerSec(Math.round(tokenCount / elapsed))
        setStatus("ready")
    }, [status])

    return (
        <div style={{ padding: 20, maxWidth: 800 }}>
            <h2>Sovereign Browser AI</h2>
            <p>Model runs on <strong>your GPU</strong> — zero API calls, zero data leakage.</p>

            {status === "idle" && (
                <button onClick={loadModel}>Load AI Model (1.2GB, cached after first load)</button>
            )}

            {status === "loading" && <p>Loading model... {output}</p>}

            {(status === "ready" || status === "generating") && (
                <>
                    <textarea
                        rows={4}
                        cols={60}
                        placeholder="Enter your prompt..."
                        id="prompt-input"
                    />
                    <br />
                    <button
                        disabled={status === "generating"}
                        onClick={() => {
                            const el = document.getElementById("prompt-input") as HTMLTextAreaElement
                            generate(el.value)
                        }}
                    >
                        {status === "generating" ? "Generating..." : "Generate"}
                    </button>
                    {tokensPerSec && <span> ({tokensPerSec} tok/s)</span>}
                    <pre style={{ marginTop: 16, background: "#f5f5f5", padding: 12 }}>
                        {output}
                    </pre>
                </>
            )}
        </div>
    )
}

Part 4: WebGPU vs WebAssembly Benchmarks

Both Transformers.js backends for running LLMs in the browser:

HardwareWebGPU tok/sWASM tok/sSpeedup
RTX 4090 (Chrome 131)283.1
Apple M3 Max (Safari 17.4)224.84.6×
RTX 3060 12GB142.94.8×
Apple M3 Pro 18GB164.23.8×
CPU only (no GPU)N/A2.1

WebGPU is dramatically faster. The WASM fallback exists for browsers without WebGPU support.


Part 5: Progressive Enhancement with WASM Fallback

// Detect WebGPU and fall back gracefully
import { pipeline, env } from "@huggingface/transformers"

async function createPipeline() {
    // Check WebGPU availability
    const hasWebGPU = navigator.gpu !== undefined

    let adapter = null
    if (hasWebGPU) {
        adapter = await navigator.gpu.requestAdapter()
    }

    const device = adapter ? "webgpu" : "wasm"
    const dtype = device === "webgpu" ? "q4" : "q4"  // Both support quantisation

    console.log(`Using: ${device} (${hasWebGPU ? "GPU accelerated" : "CPU fallback"})`)

    return await pipeline(
        "text-generation",
        "Qwen/Qwen3-1.7B-ONNX",
        { device, dtype }
    )
}

Supported Models (Transformers.js 3, May 2026)

ModelSize (q4)WebGPU tok/s (M3 Max)Best for
Qwen/Qwen3-1.7B-ONNX1.2 GB22General chat, summarisation
Xenova/Phi-3-mini-4k-ONNX2.1 GB16Reasoning, code
Xenova/TinyLlama-1.1B-ONNX0.7 GB31Fast responses, low memory
HuggingFace/smollm2-1.7B1.2 GB24Instruction following

Conclusion

Browser AI with WebGPU and Transformers.js is sovereign: models download once and cache locally, inference runs on the user’s GPU, and prompts never leave the device. For applications requiring privacy, offline capability, or zero per-query cost, WebGPU inference is the correct architecture.

See On-Device AI Inference 2026 for native server-side inference, and Deploy React Apps Without Vercel 2026 for hosting the React wrapper application.


People Also Ask

What browser and hardware do I need for WebGPU AI inference?

Browser: Chrome 113+, Firefox 130+, or Safari 17.4+ (2026 versions all support WebGPU). Hardware: any GPU with WebGPU driver support — that’s NVIDIA (GeForce 10+), AMD (RX 5000+), Apple Silicon (M1+), and Intel Iris Xe. CPU-only devices fall back to WebAssembly automatically. For comfortable LLM inference (Qwen3 1.7B at 20+ tok/s): 8GB+ VRAM or 16GB+ Apple unified memory. For basic inference at slower speeds: any GPU with 4GB+ VRAM.

Are there privacy concerns with Transformers.js models loaded from HuggingFace?

The model files download from HuggingFace CDN on first use (network request). After that, the model is cached in IndexedDB — subsequent inference sessions are fully offline. The inference itself (your prompts and outputs) never leaves the browser. If the initial download is a concern, you can self-host the ONNX model files on your own server and point pipeline() to your URL instead of HuggingFace.


Part 4: Browser Storage and Model Caching

WebGPU browser LLMs are only practical if the model files are cached and reused. Transformers.js stores assets in IndexedDB.

4.1 IndexedDB cache behaviour

When the model downloads, it is persisted in the browser’s IndexedDB store. On subsequent loads, the same cached files are reused.

If the cache is cleared, the model downloads again. For sovereign browser apps, that means the first visit is the only time the user needs network bandwidth for the model.

To make the app work offline after the first run, use service workers and a PWA manifest.

{
  "name": "Sovereign Browser AI",
  "short_name": "BrowserAI",
  "start_url": "/",
  "display": "standalone",
  "background_color": "#111111",
  "theme_color": "#0f172a",
  "icons": [{ "src": "/icon-192.png", "sizes": "192x192", "type": "image/png" }]
}

A service worker can cache the application shell while IndexedDB stores the model artifacts.

4.3 Storage limits and cleanup

Browser storage is finite. Monitor the size of the cached model and remove older or unused assets if necessary.

Use the Storage Manager API:

const quota = await navigator.storage.estimate()
console.log(`Used: ${quota.usage}, Quota: ${quota.quota}`)

If the cache grows too large, remove the existing model and re-download only the next time the application starts.

Part 5: WebGPU Performance Tuning

WebGPU performance depends on GPU memory, compute workgroups, and tensor layout.

5.1 Use quantised models where possible

Quantised ONNX models reduce memory and improve speed. Look for q4 or q8 variants.

const pipe = await pipeline("text-generation", "Qwen/Qwen3-1.7B-ONNX", {
    device: "webgpu",
    dtype: "q4"
})

5.2 Batch size and prompt length

Keep prompt lengths moderate for browser inference. Long prompts increase tokenization time and GPU memory usage.

If you need longer context, stream prompt chunks and summarize earlier text.

5.3 Workload throttling and UI

Avoid blocking the browser event loop by using streaming callbacks and await carefully. Show progress to the user during generation.

A simple streaming callback keeps the UI responsive and provides real-time feedback.

Part 6: Fallback Strategies

Not every user has a WebGPU-capable browser. Implement graceful fallback to WASM or a remote inference service as a last resort.

6.1 WebGPU feature detection

const supportsWebGPU = !!navigator.gpu

6.2 WASM fallback

If WebGPU is unavailable, load the WASM runtime instead:

const device = supportsWebGPU ? "webgpu" : "wasm"

WASM is slower but more widely compatible. Keep it as a fallback to preserve functionality.

6.3 Progressive enhancement

The best pattern is progressive enhancement: use WebGPU when available, and degrade gracefully otherwise. Do not require WebGPU for essential tasks.

Part 7: Security and Privacy

A sovereign browser AI app must keep data local and minimize leakage.

7.1 Keep prompts in memory only

Do not persist prompt text or model output to a shared location unless the user explicitly saves it.

7.2 Avoid third-party telemetry

Use local static assets or trusted CDNs, and avoid libraries that phone home. For maximum sovereignty, bundle the runtime and model loader with the app.

7.3 Use same-origin policies

Deploy the app from a trusted origin and avoid embedding cross-origin scripts that can access the model or prompt data.

Part 8: Cross-Browser Compatibility

WebGPU support varies across browsers and platforms.

8.1 Safari and Apple silicon

Safari on Apple silicon supports WebGPU in recent versions, but the implementation may require slightly different precision handling. Test on both Chrome and Safari.

8.2 Multi-GPU and device selection

In browsers with multiple GPUs, the adapter chosen can affect performance. Allow the user to select a device if the browser supports it.

8.3 GPU driver limitations

Some older drivers or GPU vendors may support WebGPU in a restricted mode. Always test on the minimum supported hardware for your target audience.

Part 9: Practical Browser AI Use Cases

Browser LLMs are ideal for offline note-taking, local document summarization, client-side search, and code assistance.

9.1 Personal knowledge base

Use a browser app to query local or preloaded documents without sending anything to the cloud.

9.2 Secure chat and journaling

Run the entire chat interface in-browser. The user’s prompts, the model, and the output stay on the local device.

9.3 Demo and proof-of-concept apps

Use WebGPU apps for demos where the audience wants to see a model run locally, not from an external API.

Part 10: Production Checklist

  • verify WebGPU support on target browsers
  • confirm model caching works via IndexedDB
  • implement WASM fallback for unsupported browsers
  • keep remote network calls optional or disabled
  • use secure origins and CSP headers
  • quantify storage usage and cache limits
  • test on real devices, not only emulator
  • document recovery steps if the browser cache is cleared

A sovereign browser AI app is only sovereign when it can run without external cloud dependencies and when the user controls the model and data locally.

Part 11: Converting Models to ONNX and Quantisation

WebGPU browser runtimes rely on optimized model files. Convert models to ONNX and quantise them for realistic browser performance.

11.1 ONNX conversion workflow

Use a trusted tooling pipeline to convert a local model to ONNX. For example, with Hugging Face Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.exporters import export

model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-1.7B')
export(model, output='qwen3-1.7b-onnx', export_format='onnx')

11.2 Quantisation for browser memory

Use 4-bit quantisation to reduce model size and VRAM footprint.

python quantize.py --model qwen3-1.7b-onnx --output qwen3-1.7b-q4.onnx

Quantised models can make the difference between an app that runs and one that fails on a 6GB GPU.

11.3 Verify model integrity

After conversion, run a few sample prompts and compare outputs with the original model. Save checksums for the ONNX files so you can verify them later.

Part 12: User Experience and Responsiveness

A browser LLM app should feel responsive, even if the first model load takes time.

12.1 Loading indicators and progress

Show download progress for the model and a clear message about initial setup.

12.2 Streaming output

Streaming tokens improves the perceived speed of the app. Use a callback that appends each token to the UI.

12.3 Memory warnings

If the GPU is low on memory, tell the user to close other tabs or use a smaller model. A proactive warning is better than a crash.

Part 13: Offline and Air-Gapped Deployment

For true sovereignty, support offline installation of the app.

13.1 Bundle the runtime assets

Package the Transformers.js runtime and the ONNX model in a local artifact repository. Deploy them with your app rather than fetching from a CDN.

13.2 Local browser hosting

Serve the application from a local web server or an internal network. Ensure the service worker and cached assets work without internet access.

Part 14: Governance and Documentation

Document the browser AI stack in a local MODEL_CARD.md.

Your documentation should include:

  • model source and version
  • conversion commands
  • browser compatibility matrix
  • supported devices and minimum VRAM
  • security controls for IndexedDB and caching

Part 15: Final Browser AI Readiness Checklist

  • WebGPU is feature-detected before usage
  • model caching works in IndexedDB
  • fallback to WASM is implemented
  • user prompts remain local
  • assets are served from a trusted origin
  • model conversion and quantisation workflow is documented
  • performance metrics are measured on real devices
  • offline deployment mode is available

Part 16: GPU and Browser Diagnostics

Monitoring browser GPU usage helps keep local inference reliable.

16.1 Browser debug tools

Use Chrome DevTools under chrome://gpu to inspect WebGPU availability and feature status. Look for warnings about unsupported codes or memory limits.

16.2 Runtime memory diagnostics

If the app crashes, inspect the browser console for WebGPU errors such as Out of memory or device lost. These are more common on low-end GPUs.

16.3 Performance profiling

Use the Performance panel in DevTools to capture script execution and frame timing. This tells you whether model loading or token generation is the bottleneck.

Part 17: Browser UX Patterns for Local AI

A good UX makes local AI feel polished.

17.1 Respect user intent

Show the user when the model is running and allow them to cancel long generations.

17.2 Manage expectations

Warn users when the first load will take several minutes and show progress so they know the app is alive.

17.3 Preserve local state

Cache recent prompts or results in IndexedDB so users can reload the page without losing context.

Part 18: Shipping Browser AI as a Secure App

A browser AI app should be packaged as a trusted asset.

18.1 Content Security Policy

Serve the app with a strict CSP that permits only the assets you control.

Content-Security-Policy: default-src 'self'; script-src 'self'; img-src 'self' data:; connect-src 'self';

18.2 Subresource integrity

Use SRI on any CDN-hosted scripts if you must rely on external resources.

18.3 Local offline installation

If you deploy in an air-gapped environment, bundle all JS libraries and assets with the app so it does not depend on external networks.

Part 19: Model and Data Governance

Keep a local governance record for the browser model.

19.1 Model provenance

Record the exact model name, quantisation variant, and checksum used in the app.

If the app runs on a shared machine, make it easy to clear cached models and local data.

19.3 Maintenance schedule

Review and refresh models periodically, especially if the security or privacy requirements change.

Part 20: Final Browser AI Operational Checklist

  • model files are cached and reused locally
  • fallback to WASM is tested
  • browser compatibility is validated on target platforms
  • storage usage is monitored and bounded
  • UI shows progress and allows cancellation
  • service worker caches the app shell for offline use
  • the app ships with a strict CSP
  • model conversion and checksum workflows are documented
  • diagnostics are available for GPU errors
  • privacy controls are visible to the user

Part 21: Practical Browser AI Deployment Patterns

Deploy browser AI as a progressive web app or a static site.

21.1 Static hosting

Host the app on a local web server or internal CDN. The app shell and model loader can be served as static files.

21.2 Service worker caching

Use a service worker to keep the app shell available offline, while IndexedDB stores the heavy model artifacts.

21.3 Secure local hosting

Deploy the app on HTTPS only. Modern browsers require secure contexts for WebGPU.

Part 22: Advanced Browser Security

Keep the local browser AI environment safe.

Ask for permission only when needed. Do not prompt the user for excessive browser permissions.

22.2 Content security policy

Enforce a CSP that blocks inline scripts and external origins. This reduces the risk of malicious code injection.

22.3 Clearing local caches

Provide a simple UI button to clear the model cache and application data. This gives users control over their local storage.

Part 23: Final Browser AI Maintenance Guidance

Local browser AI still needs maintenance.

23.1 Model refresh cadence

Review the model version and quantisation workflow periodically. If new browser-compatible models arrive, test them in a staging environment before deploying.

23.2 Performance reviews

Measure load time, inference throughput, and memory usage after each update. Keep a small benchmark suite for your target devices.

23.3 Local documentation

Keep a local README that explains how to install, update, and troubleshoot the browser model app.

Part 24: Comparing Local Browser AI to Remote Inference

Browser AI is most valuable when it reduces external dependencies.

24.1 Latency and privacy benefits

Local browser inference avoids round-trip latency to a remote service, and user prompts never leave the device. This is powerful for privacy-sensitive workflows.

24.2 When remote inference is still needed

Use remote APIs only when the browser device lacks the GPU/memory required by the model. The ideal sovereign model supports both local and remote inference and chooses based on capability.

24.3 Hybrid local/remote fallback

A practical pattern is:

  • normal mode: local WebGPU inference
  • fallback mode: WASM on weaker devices
  • fallback remote: remote inference only when necessary and with user consent

Part 25: Advanced Caching and Model Selection

A browser AI app can choose the smallest model that satisfies the task.

25.1 Model tiers

Offer model tiers such as “small”, “medium”, and “fast”. Load the appropriate model based on device memory and user preference.

25.2 Cache eviction policy

Implement a cache eviction policy for models and tokenizer artifacts. Use LRU semantics if storage gets tight.

25.3 Preloading for known workflows

If the user frequently uses a particular task, preload the model in the background when the page first loads.

Part 26: Accessibility and Inclusive UX

Local AI should be accessible to all users.

26.1 Keyboard navigation

Ensure the UI is fully keyboard-accessible and that focus states are clear.

26.2 Screen reader support

Use semantic HTML and ARIA roles for buttons, status messages, and generated output.

26.3 Performance considerations

On lower-end devices, provide a lightweight mode with fewer animations and a smaller model. This keeps the experience inclusive.

Part 27: Closing the Loop on Browser AI

WebGPU browser LLMs are a practical expression of local AI sovereignty. If you keep the model artifacts cached, the app offline-capable, and the user interface responsive, the result is a secure and compelling local intelligence experience. The key is to balance performance, privacy, and compatibility, and to document the exact browser and GPU requirements for your audience.

Further Reading

Tested on: Chrome 131 (Ubuntu 24.04, RTX 4090), Safari 17.4 (macOS Sequoia, M3 Max). Transformers.js 3.1.0. Last verified: May 1, 2026.

Anju Kushwaha

About the Author

Founder & Editorial Director

B-Tech Electronics & Communication Engineering | Founder of Vucense | Technical Operations & Editorial Strategy

Anju Kushwaha is the founder and editorial director of Vucense, driving the publication's mission to provide independent, expert analysis of sovereign technology and AI. With a background in electronics engineering and years of experience in tech strategy and operations, Anju curates Vucense's editorial calendar, collaborates with subject-matter experts to validate technical accuracy, and oversees quality standards across all content. Her role combines editorial leadership (ensuring author expertise matches topics, fact-checking and source verification, coordinating with specialist contributors) with strategic direction (choosing which emerging tech trends deserve in-depth coverage). Anju works directly with experts like Noah Choi (infrastructure), Elena Volkov (cryptography), and Siddharth Rao (AI policy) to ensure each article meets E-E-A-T standards and serves Vucense's readers with authoritative guidance. At Vucense, Anju also writes curated analysis pieces, trend summaries, and editorial perspectives on the state of sovereign tech infrastructure.

View Profile

Further Reading

All Dev Corner

Comments