GotGemini

Gemma 4 12B gets local Mac apps and LiteRT-LM serving

Gemma 4 12B now runs through Google AI Edge Gallery, Eloquent, and LiteRT-LM serving. The launch targets local multimodal agents on laptops with 16GB memory.

v1· Gemma 4 / ai-studio v1· June 4, 2026
Gemma 4 12B Google AI Edge banner

Google AI Edge published a Gemma 4 12B workflow for local laptop use: Google AI Edge Gallery on macOS, Google AI Edge Eloquent on macOS, and a LiteRT-LM serve command that exposes a local OpenAI-compatible endpoint. The underlying model is Google DeepMind's Gemma 4 12B, a dense multimodal model with an encoder-free architecture.

Who it's for

This is ideal for developers building local agents, desktop AI tools, multimodal experiments, and privacy-sensitive workflows that should run on a laptop instead of a hosted API. Google says the model can run locally on consumer laptops with 16GB of VRAM or unified memory. The LiteRT-LM Hugging Face model card says the current LiteRT-LM package is ready for macOS and Linux.

shell
# Install the LiteRT-LM CLI
pip install litert-lm

# Import the Gemma 4 12B LiteRT-LM package
litert-lm import \
  --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm \
  gemma4-12b

# Start the local OpenAI-compatible server
litert-lm serve

# Call the local endpoint
curl http://localhost:9379/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-12b,gpu",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The model weights and LiteRT-LM package are publicly downloadable, and the Hugging Face package is listed under Apache 2.0. Google did not announce hosted API pricing because this workflow is local inference.

Disclosed limits and requirements: Google says the model targets laptops with 16GB VRAM or unified memory. The LiteRT-LM model card says the current package supports text and audio modalities; image and multi-token prediction support are planned for a future update. The card says the model can support up to 32k context length, while its published benchmark table used 2,048 context length.

  • The LiteRT-LM model card says image and multi-token prediction support are not in the current LiteRT-LM package yet.
  • Google's launch post says performance is near the larger 26B MoE model on standard benchmarks, but it does not publish the full benchmark table in the announcement.

Discussion

Questions and comments from readers.