Exact HF front
Repo info, byte-range file resolution, tree listings, commits, preupload, git-LFS batch — matching the Hub's headers, error codes, and pagination quirks, because compatibility is only useful when it's exact.
Shpiel speaks the Hugging Face Hub API — read, write, and Xet — and stores models in the infrastructure your cluster already runs: OCI registries and filesystems today, object storage next. Every existing HF tool works unchanged.
# the whole integration:
export HF_ENDPOINT=https://shpiel.internal
hf download Qwen/Qwen3-0.6B # LAN-speed, cached in your registry
hf CLI · huggingface_hub · vLLM · SGLang · TGI — no new tools, no new auth universe.
Why
Researchers live on the Hugging Face API: every training script ends in push_to_hub(), every inference engine starts with from_pretrained(). Clusters want weights as versioned, content-addressed, P2P-distributable artifacts — that's where cold-start time, egress cost, and reliability are actually won. Today the bridge between those planes is shell scripts, or a heavyweight self-hosted hub with its own database and auth universe.
Shpiel is the bridge as one boring binary: no database, one YAML file, identical on a laptop and in Kubernetes. On autoscaled GPU fleets it turns time-to-first-token from an internet-sized download into a LAN-speed pull — and, paired with Spegel, a peer-to-peer pull from neighboring nodes.
The pieces
Repo info, byte-range file resolution, tree listings, commits, preupload, git-LFS batch — matching the Hub's headers, error codes, and pagination quirks, because compatibility is only useful when it's exact.
huggingface_hub 1.x uploads through the Xet protocol with no LFS fallback. Shpiel implements the CAS API — the first open-source server that does — so stock clients push and pull chunk-level.
On a miss, fetch from huggingface.co, persist to your backend, serve. Request collapsing means a hundred nodes asking for the same model cost one upstream download.
Models land as OCI artifacts: one repository per model, one manifest per commit, one layer per file. The tar-layers format is a mountable image — Kubernetes image volumes and Spegel work out of the box.
Byte-compatible with the huggingface_hub cache. Mount the volume, set HF_HUB_OFFLINE=1, and from_pretrained reads it directly.
Fan-out replication through a disk-spooled retry queue, Prometheus metrics with a ready-made Grafana dashboard, an append-only audit stream, an authenticated admin API, health probes.
Install
# container image
docker run -p 8080:8080 ghcr.io/loewenthal-corp/shpiel:latest \
serve --local --listen-api :8080
# Helm chart
helm install shpiel oci://ghcr.io/loewenthal-corp/charts/shpiel
# from source
go install github.com/loewenthal-corp/shpiel/cmd/shpiel@latest
# binaries for linux/darwin on every release:
# github.com/loewenthal-corp/shpiel/releases
# laptop mode: fs store in ~/.shpiel, pull-through on
shpiel serve --local
export HF_ENDPOINT=http://127.0.0.1:8080
hf download Qwen/Qwen3-0.6B # first pull caches it
hf download Qwen/Qwen3-0.6B # second is served locally
hf upload my-org/my-model ./model
Beyond laptop mode, everything is one YAML file — config.example.yaml documents every knob. The Helm chart's config value is that file.
Compatibility
| Client | Reads | Writes |
|---|---|---|
huggingface_hub / hf CLI 1.x | ✓ HTTP or chunk-level Xet | ✓ via Xet |
huggingface_hub / hf CLI 0.x | ✓ | ✓ via git-LFS |
vLLM, SGLang, TGI, from_pretrained | ✓ incl. Range / lazy loading | — |
HF_HUB_OFFLINE=1 on a shared volume | ✓ fs backend | — |
Enforced by an executable conformance suite that runs against every serving configuration, and end-to-end tests that drive a real Python huggingface_hub/hf_xet client against the real binary. If it regresses, CI fails.