Native local inference

Bonsai CLI

Run Bonsai-8B from your terminal with a small native CLI. Bonsai downloads the GGUF model on first use, caches it locally, and streams generated text directly to standard output.

Download Latest Release View Source

Bonsai-8B Default GGUF model

llama.cpp Native runtime backend

Metal Enabled on macOS arm64

bonsai

$ ./bonsai "Explain GitHub Actions in one sentence."
Downloading Bonsai-8B.gguf on first run...
GitHub Actions is a CI/CD automation service that runs workflows from your repository when events such as pushes, pull requests, or releases happen.

$ terraform plan | ./bonsai "Summarize this Terraform plan."
The plan updates networking resources, leaves existing compute instances unchanged, and does not destroy any infrastructure.

Quick Start

Download the prebuilt binary for your platform, extract it, and run it with a prompt. The model is downloaded once and reused from the local cache after that.

Download

Get a prebuilt binary from the latest GitHub Release for macOS, Windows, or Linux.

Extract

Unpack the archive and place the executable wherever you keep command-line tools.

Run

Pass a prompt as an argument, or pipe input through standard input for summarization and review workflows.

./bonsai "Explain GitHub Actions in one sentence."

.\bonsai.exe "Explain GitHub Actions in one sentence."

Usage

Bonsai works well as a direct prompt tool and as a Unix-style filter for existing command output, logs, plans, and text files.

./bonsai "Summarize this text in Japanese." < terraform_plan_result.log

terraform plan | ./bonsai "Summarize the following Terraform plan in Japanese."

Prompt text can be passed as command arguments.
Standard input is appended after the prompt when present.
Generated tokens are streamed to standard output as they arrive.
Settings can be controlled with flags or BONSAI_* environment variables.

Main Flags

Tune the model path, sampling behavior, context size, cache location, and Hugging Face authentication from the command line.

Flag	Default	Description
`-model-path`	auto	Local path to the GGUF model.
`-model-repo`	`prism-ml/Bonsai-8B-gguf`	Hugging Face model repository.
`-model-file`	`Bonsai-8B.gguf`	GGUF filename inside the repository.
`-cache-dir`	OS user cache directory	Root directory for downloaded models.
`-context-size`	`4096`	Context window size.
`-max-tokens`	`64`	Maximum number of generated tokens.
`-threads`	`runtime.NumCPU()/2`	Number of CPU threads, with a minimum of one.
`-temperature`	`0.5`	Sampling temperature.
`-top-k`	`20`	Top-k sampler setting.
`-top-p`	`0.9`	Top-p sampler setting.
`-repeat-penalty`	`1.0`	Repeat penalty.
`-seed`	`0`	Random seed.
`-raw-prompt`	`false`	Send the prompt without applying the model chat template.
`-hf-token`	empty	Hugging Face token for gated models.

For Developers

Build Bonsai CLI from source with Go, CMake, a C/C++ toolchain, and the llama.cpp submodule checked out.

Build from source

git clone --recurse-submodules https://github.com/pluswing/bonsai-cli.git
cd bonsai-cli
./scripts/build-bonsai.sh

Docker smoke test

BONSAI_CONTEXT_SIZE=4096 \
BONSAI_MAX_TOKENS=24 \
BONSAI_PROMPT="Reply in one short sentence about CPU-only inference." \
docker compose up --build --abort-on-container-exit \
  --exit-code-from bonsai-cli-test