An HF‑compatible model relay.

Shpiel speaks the Hugging Face Hub API — read, write, and Xet — and stores models in the infrastructure your cluster already runs: OCI registries and filesystems today, object storage next. Every existing HF tool works unchanged.

# the whole integration:
export HF_ENDPOINT=https://shpiel.internal

hf download Qwen/Qwen3-0.6B   # LAN-speed, cached in your registry

Get started View on GitHub

hf CLI · huggingface_hub · vLLM · SGLang · TGI — no new tools, no new auth universe.

Why

The bridge as one boring binary.

Researchers live on the Hugging Face API: every training script ends in push_to_hub(), every inference engine starts with from_pretrained(). Clusters want weights as versioned, content-addressed, P2P-distributable artifacts — that's where cold-start time, egress cost, and reliability are actually won. Today the bridge between those planes is shell scripts, or a heavyweight self-hosted hub with its own database and auth universe.

Shpiel is the bridge as one boring binary: no database, one YAML file, identical on a laptop and in Kubernetes. On autoscaled GPU fleets it turns time-to-first-token from an internet-sized download into a LAN-speed pull — and, paired with Spegel, a peer-to-peer pull from neighboring nodes.

The pieces

Exact HF front

Repo info, byte-range file resolution, tree listings, commits, preupload, git-LFS batch — matching the Hub's headers, error codes, and pagination quirks, because compatibility is only useful when it's exact.

Xet, server-side

huggingface_hub 1.x uploads through the Xet protocol with no LFS fallback. Shpiel implements the CAS API — the first open-source server that does — so stock clients push and pull chunk-level.

Pull-through caching

On a miss, fetch from huggingface.co, persist to your backend, serve. Request collapsing means a hundred nodes asking for the same model cost one upstream download.

OCI backend

Models land as OCI artifacts: one repository per model, one manifest per commit, one layer per file. The tar-layers format is a mountable image — Kubernetes image volumes and Spegel work out of the box.

Filesystem backend

Byte-compatible with the huggingface_hub cache. Mount the volume, set HF_HUB_OFFLINE=1, and from_pretrained reads it directly.

Ops built in

Fan-out replication through a disk-spooled retry queue, Prometheus metrics with a ready-made Grafana dashboard, an append-only audit stream, an authenticated admin API, health probes.

Install

Run it

# container image
docker run -p 8080:8080 ghcr.io/loewenthal-corp/shpiel:latest \
  serve --local --listen-api :8080

# Helm chart
helm install shpiel oci://ghcr.io/loewenthal-corp/charts/shpiel

# from source
go install github.com/loewenthal-corp/shpiel/cmd/shpiel@latest

# binaries for linux/darwin on every release:
# github.com/loewenthal-corp/shpiel/releases

Point your tools at it

# laptop mode: fs store in ~/.shpiel, pull-through on
shpiel serve --local

export HF_ENDPOINT=http://127.0.0.1:8080
hf download Qwen/Qwen3-0.6B   # first pull caches it
hf download Qwen/Qwen3-0.6B   # second is served locally
hf upload my-org/my-model ./model

Beyond laptop mode, everything is one YAML file — config.example.yaml documents every knob. The Helm chart's config value is that file.

Compatibility

Client	Reads	Writes
`huggingface_hub` / `hf` CLI 1.x	✓ HTTP or chunk-level Xet	✓ via Xet
`huggingface_hub` / `hf` CLI 0.x	✓	✓ via git-LFS
vLLM, SGLang, TGI, `from_pretrained`	✓ incl. Range / lazy loading	—
`HF_HUB_OFFLINE=1` on a shared volume	✓ fs backend	—

Enforced by an executable conformance suite that runs against every serving configuration, and end-to-end tests that drive a real Python huggingface_hub/hf_xet client against the real binary. If it regresses, CI fails.

Stop pulling weights over the internet.

Get started on GitHub Read the spec