mirror of
https://github.com/tinygrad/tinygrad.git
synced 2026-01-09 15:08:02 -05:00
* load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate app layer diff * add .gitignore for generated files * validate CPU/WEBGPU models in python * prevent infinite generation if validation fails * check if exported weight files are unique --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
183 lines
7.9 KiB
HTML
183 lines
7.9 KiB
HTML
<!DOCTYPE html>
|
|
|
|
<head>
|
|
<title>tinychat</title>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
|
<link rel="icon" href="../favicon.svg" type="image/svg+xml">
|
|
|
|
<script defer src="../assets/cdn.jsdelivr.net/npm/@alpine-collective/toolkit@1.0.2/dist/cdn.min.js"></script>
|
|
<script defer src="../assets/cdn.jsdelivr.net/npm/@alpinejs/intersect@3.x.x/dist/cdn.min.js"></script>
|
|
<script defer src="../assets/cdn.jsdelivr.net/npm/@alpinejs/focus@3.x.x/dist/cdn.min.js"></script>
|
|
<script defer src="../assets/unpkg.com/@marcreichel/alpine-autosize@1.3.x/dist/alpine-autosize.min.js"></script>
|
|
<script defer src="../assets/unpkg.com/alpinejs@3.x.x/dist/cdn.min.js"></script>
|
|
|
|
<script src="../assets/unpkg.com/dompurify@3.1.5/dist/purify.min.js"></script>
|
|
<script src="../assets/unpkg.com/marked@13.0.0/marked.min.js"></script>
|
|
<script src="../assets/unpkg.com/marked-highlight@2.1.2/lib/index.umd.js"></script>
|
|
<script src="../assets/unpkg.com/@highlightjs/cdn-assets@11.9.0/highlight.min.js"></script>
|
|
|
|
<script src="index.js"></script>
|
|
|
|
<link rel="stylesheet" href="../assets/cdn.jsdelivr.net/npm/purecss@3.0.0/build/base-min.css">
|
|
<link rel="stylesheet" href="../assets/cdnjs.cloudflare.com/ajax/libs/font-awesome/6.5.2/css/all.min.css"
|
|
integrity="sha512-SnH5WK+bZxgPHs44uWIX+LLJAJ9/2PkPKZ5QiAj6Ta86w+fsb2TkcmfRyVX3pBnMFcV7oQPJkl9QevSCWr3W6A=="
|
|
crossorigin="anonymous" referrerpolicy="no-referrer" />
|
|
<link rel="stylesheet" href="../assets/unpkg.com/@highlightjs/cdn-assets@11.9.0/styles/vs2015.min.css">
|
|
|
|
<link rel="stylesheet" href="index.css">
|
|
<link rel="stylesheet" href="../common.css">
|
|
</head>
|
|
|
|
<body>
|
|
<main x-data="state" x-init="console.log(endpoint)">
|
|
<div class="home centered" x-show="home === 0" x-transition x-effect="
|
|
$refs.inputForm.focus();
|
|
if (home === 1) setTimeout(() => home = 2, 100);
|
|
if (home === -1) setTimeout(() => home = 0, 100);
|
|
" @popstate.window="
|
|
if (home === 2) {
|
|
cancelGeneration = true;
|
|
if (maxContextReached) generating = false;
|
|
if (!generating) cstate = { time: null, messages: [] };
|
|
home = -1;
|
|
time_till_first = 0;
|
|
tokens_per_second = 0;
|
|
total_tokens = 0;
|
|
}
|
|
">
|
|
<h1 class="title megrim-regular">tinychat</h1>
|
|
<div class="histories-container-container">
|
|
<template x-if="histories.length">
|
|
<div class="histories-start"></div>
|
|
</template>
|
|
<div class="histories-container" x-intersect="
|
|
$el.scrollTo({ top: 0, behavior: 'smooth' });
|
|
">
|
|
<template x-for="_state in histories.toSorted((a, b) => b.time - a.time)">
|
|
<div x-data="{ otx: 0, trigger: 75 }" class="history" @click="
|
|
cstate = _state;
|
|
updateTotalTokens(cstate.messages);
|
|
home = 1;
|
|
// ensure that going back in history will go back to home
|
|
window.history.pushState({}, '', window.TINYCHAT_ROOT || '/');
|
|
" @touchstart="
|
|
otx = $event.changedTouches[0].clientX;
|
|
" @touchmove="
|
|
$el.style.setProperty('--tx', $event.changedTouches[0].clientX - otx);
|
|
$el.style.setProperty('--opacity', 1 - (Math.abs($event.changedTouches[0].clientX - otx) / trigger));
|
|
" @touchend="
|
|
if (Math.abs($event.changedTouches[0].clientX - otx) > trigger) removeHistory(_state);
|
|
$el.style.setProperty('--tx', 0);
|
|
$el.style.setProperty('--opacity', 1);
|
|
">
|
|
<h3 x-text="new Date(_state.time).toLocaleString()"></h3>
|
|
<p x-text="$truncate(_state.messages[0].content, 80)"></p>
|
|
<!-- delete button -->
|
|
<button class="history-delete-button" @click.stop="removeHistory(_state);">
|
|
<i class=" fas fa-trash"></i>
|
|
</button>
|
|
</div>
|
|
</template>
|
|
</div>
|
|
<template x-if="histories.length">
|
|
<div class="histories-end"></div>
|
|
</template>
|
|
</div>
|
|
</div>
|
|
<div x-ref="messages" class="messages" x-init="
|
|
$watch('cstate', value => {
|
|
$el.innerHTML = '';
|
|
value.messages.forEach(({ role, content }) => {
|
|
const div = document.createElement('div');
|
|
div.className = `message message-role-${role}`;
|
|
try {
|
|
div.innerHTML = DOMPurify.sanitize(marked.parse(content));
|
|
} catch (e) {
|
|
console.log(content);
|
|
console.error(e);
|
|
}
|
|
|
|
// add a clipboard button to all code blocks
|
|
const codeBlocks = div.querySelectorAll('.hljs');
|
|
codeBlocks.forEach(codeBlock => {
|
|
const button = document.createElement('button');
|
|
button.className = 'clipboard-button';
|
|
button.innerHTML = '<i class=\'fas fa-clipboard\'></i>';
|
|
button.onclick = () => {
|
|
// navigator.clipboard.writeText(codeBlock.textContent);
|
|
const range = document.createRange();
|
|
range.setStartBefore(codeBlock);
|
|
range.setEndAfter(codeBlock);
|
|
window.getSelection()?.removeAllRanges();
|
|
window.getSelection()?.addRange(range);
|
|
document.execCommand('copy');
|
|
window.getSelection()?.removeAllRanges();
|
|
|
|
button.innerHTML = '<i class=\'fas fa-check\'></i>';
|
|
setTimeout(() => button.innerHTML = '<i class=\'fas fa-clipboard\'></i>', 1000);
|
|
};
|
|
codeBlock.appendChild(button);
|
|
});
|
|
|
|
$el.appendChild(div);
|
|
});
|
|
|
|
$el.scrollTo({ top: $el.scrollHeight, behavior: 'smooth' });
|
|
});
|
|
" x-intersect="
|
|
$el.scrollTo({ top: $el.scrollHeight, behavior: 'smooth' });
|
|
" x-show="home === 2" x-transition>
|
|
</div>
|
|
<div class="input-container">
|
|
<div class="input-performance">
|
|
<span class="input-performance-point">
|
|
<p class="monospace" x-text="(time_till_first / 1000).toFixed(2)"></p>
|
|
<p class="megrim-regular">SEC TO FIRST TOKEN</p>
|
|
</span>
|
|
<span class="input-performance-point">
|
|
<p class="monospace" x-text="tokens_per_second.toFixed(1)"></p>
|
|
<p class="megrim-regular">TOKENS/SEC</p>
|
|
</span>
|
|
<span class="input-performance-point">
|
|
<p class="monospace" x-text="total_tokens"></p>
|
|
<p class="megrim-regular">TOKENS</p>
|
|
</span>
|
|
</div>
|
|
<div class="loading-bar" x-show="loadingMessage !== ''">
|
|
<p class="loading-text" id="loading-message">Loading:</p>
|
|
<span id="progress-percentage">0%</span>
|
|
<div class="progress-bar">
|
|
<div class="progress"></div>
|
|
</div>
|
|
</div>
|
|
<div class="input" x-show="loadingMessage === ''">
|
|
<textarea x-ref="inputForm" id="input-form" class="input-form" autofocus rows=1 x-autosize
|
|
:placeholder="generating ? placeholderText : 'Say something'" :disabled="generating" @input="
|
|
home = (home === 0) ? 1 : home
|
|
if (cstate.messages.length === 0 && $el.value === '') home = -1;
|
|
|
|
if ($el.value !== '') {
|
|
const messages = [...cstate.messages];
|
|
messages.push({ role: 'user', content: $el.value });
|
|
updateTotalTokens(messages);
|
|
} else {
|
|
if (cstate.messages.length === 0) total_tokens = 0;
|
|
else updateTotalTokens(cstate.messages);
|
|
}
|
|
" x-effect="
|
|
console.log(generating);
|
|
if (!generating) $nextTick(() => {
|
|
$el.focus();
|
|
setTimeout(() => $refs.messages.scrollTo({ top: $refs.messages.scrollHeight, behavior: 'smooth' }), 100);
|
|
});
|
|
" @keydown.enter="await handleEnter($event)" @keydown.escape.window="$focus.focus($el)"></textarea>
|
|
<button class="input-button" :disabled="generating" @click="await handleSend()">
|
|
<i class="fas" :class="generating ? 'fa-spinner fa-spin' : 'fa-paper-plane'"></i>
|
|
</button>
|
|
</div>
|
|
</div>
|
|
</main>
|
|
</body>
|
|
|
|
</html>
|