This step-by-step guide shows you how to connect open LLMs and APIs to Claude Code entirely locally, complete with screenshots. Run using any open model like DeepSeek, Qwen and Gemma. (View Highlight)
For this tutorial, we’ll use GLM-4.7-Flash, the strongest 30B MoE agentic & coding model as of Jan 2026 (which works great on a 24GB RAM/unified mem device) to autonomously fine-tune an LLM with Unsloth. You can swap in any other model, just update the model names in your scripts. (View Highlight)
We use llama.cpp which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains llama-server which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint. (View Highlight)
We need to install llama.cpp to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don’t have a GPU or just want CPU inference. (View Highlight)
To deploy GLM-4.7-Flash for agentic workloads, we use llama-server. We apply Z.ai’s recommended sampling parameters (temp 1.0, top_p 0.95) and enable --jinja for proper tool calling support. (View Highlight)
Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB)--fit on will also auto offload, but if you see bad performance, reduce --ctx-size . We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization to reduce VRAM usage. (View Highlight)
Claude Code is Anthropic’s agentic coding tool that lives in your terminal, understands your codebase, and handles complex Git workflows via natural language. (View Highlight)