Native local inference

Bonsai CLI

Run Bonsai-8B from your terminal with a small native CLI. Bonsai downloads the GGUF model on first use, caches it locally, and streams generated text directly to standard output.

Bonsai-8B Default GGUF model
llama.cpp Native runtime backend
Metal Enabled on macOS arm64
bonsai
$ ./bonsai "Explain GitHub Actions in one sentence."
Downloading Bonsai-8B.gguf on first run...
GitHub Actions is a CI/CD automation service that runs workflows from your repository when events such as pushes, pull requests, or releases happen.

$ terraform plan | ./bonsai "Summarize this Terraform plan."
The plan updates networking resources, leaves existing compute instances unchanged, and does not destroy any infrastructure.

Quick Start

Download the prebuilt binary for your platform, extract it, and run it with a prompt. The model is downloaded once and reused from the local cache after that.

1

Download

Get a prebuilt binary from the latest GitHub Release for macOS, Windows, or Linux.

2

Extract

Unpack the archive and place the executable wherever you keep command-line tools.

3

Run

Pass a prompt as an argument, or pipe input through standard input for summarization and review workflows.

./bonsai "Explain GitHub Actions in one sentence."
.\bonsai.exe "Explain GitHub Actions in one sentence."

Usage

Bonsai works well as a direct prompt tool and as a Unix-style filter for existing command output, logs, plans, and text files.

./bonsai "Summarize this text in Japanese." < terraform_plan_result.log
terraform plan | ./bonsai "Summarize the following Terraform plan in Japanese."
  • Prompt text can be passed as command arguments.
  • Standard input is appended after the prompt when present.
  • Generated tokens are streamed to standard output as they arrive.
  • Settings can be controlled with flags or BONSAI_* environment variables.

Main Flags

Tune the model path, sampling behavior, context size, cache location, and Hugging Face authentication from the command line.

Flag Default Description
-model-path auto Local path to the GGUF model.
-model-repo prism-ml/Bonsai-8B-gguf Hugging Face model repository.
-model-file Bonsai-8B.gguf GGUF filename inside the repository.
-cache-dir OS user cache directory Root directory for downloaded models.
-context-size 4096 Context window size.
-max-tokens 64 Maximum number of generated tokens.
-threads runtime.NumCPU()/2 Number of CPU threads, with a minimum of one.
-temperature 0.5 Sampling temperature.
-top-k 20 Top-k sampler setting.
-top-p 0.9 Top-p sampler setting.
-repeat-penalty 1.0 Repeat penalty.
-seed 0 Random seed.
-raw-prompt false Send the prompt without applying the model chat template.
-hf-token empty Hugging Face token for gated models.

For Developers

Build Bonsai CLI from source with Go, CMake, a C/C++ toolchain, and the llama.cpp submodule checked out.

Build from source

git clone --recurse-submodules https://github.com/pluswing/bonsai-cli.git
cd bonsai-cli
./scripts/build-bonsai.sh

Docker smoke test

BONSAI_CONTEXT_SIZE=4096 \
BONSAI_MAX_TOKENS=24 \
BONSAI_PROMPT="Reply in one short sentence about CPU-only inference." \
docker compose up --build --abort-on-container-exit \
  --exit-code-from bonsai-cli-test