Use Llama.cpp#

Lumen AI supports running models in-process using the lumen.ai.llm.LlamaCpp class with Llama.cpp, enabling you to leverage local models without external API calls. By default the Llama.cpp provider will fetch and use the Qwen/Qwen2.5-Coder-7B-Instruct-GGUF model, which strikes a good balance between hardware requirements and performance. For working with larger models we recommend using [https://llama-cpp-python.readthedocs.io/en/latest/server/] and using its OpenAI compatible endpoints.

Note

When using the Llama.cpp provider the first time it will download the specified model, which may take some time.

Prerequisites#

Lumen AI installed in your Python environment.
Llama.cpp installed and configured on your system. Follow the Llama.cpp Installation Guide.
Decent hardware such as a modern GPU, an ARM based Mac or a high-core count CPU.

Using CLI Arguments#

Once configured you can select llama.cpp as the provider using a CLI argument:

lumen-ai serve <your-data-file-or-url> --provider llama-cpp

Using Python#

In Python, simply import the LLM wrapper lumen.ai.llm.LlamaCpp and pass it to the lumen.ai.ui.ExplorerUI:

import lumen.ai as lmai

openai_llm = lmai.llm.LlamaCpp()

ui = lmai.ui.ExplorerUI(llm=openai_llm)
ui.servable()

Configuring Models#

If you do not want to use the default model (Qwen/Qwen2.5-Coder-7B-Instruct-GGUF) you can override it by providing a model configuration via the model_kwargs parameter. This allows specifying different models for different scenarios, currently a default and a reasoning model can be provided. If no reasoning model is provided it will always use the default model.

As an example you can override the model configuration by providing the repo and model_file to look up on Huggingface or a model_path pointing to a model on disk. Any other configuration are passed through to the llama.cpp LlamaCpp object.

As an example, let’s replace Qwen 2.5 Coder with a quantized DeepSeek model we found by searching for it on [Huggingface] and then providing the repo name, model file, chat format and other configuration options in Python:

import lumen.ai as lmai

config = {
    "default": {
        "repo": "bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF",
        "model_file": "DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf",
        "chat_format": "qwen",
        "n_ctx": 131072,
    }
}

llm = lmai.llm.LlamaCpp(model_kwargs=config)

lmai.ui.ExplorerUI('<your-data-file-or-url>', llm=llm).servable()

Note

Find all valid configuration options in the Llama.cpp API reference.

Using another model can be done in the CLI as well:

lumen-ai serve --provider llama --model-kwargs '{
  "default": {
    "repo": "unsloth/Mistral-Small-24B-Instruct-2501-GGUF",
    "model_file": "Mistral-Small-24B-Instruct-2501-Q4_K_M.gguf",
    "chat_format": "mistral-instruct"
  }
}'

Providing these arguments via the CLI can be cumbersome. Instead, paste the quantized model file’s llm-model-url and pass model_kwargs as query parameters. If llm-model-url is set, provider automatically defaults to llama, and will error if another provider is set.

lumen-ai serve --llm-model-url https://huggingface.co/unsloth/Mistral-Small-24B-Instruct-2501-GGUF/blob/main/Mistral-Small-24B-Instruct-2501-Q4_K_M.gguf?chat_format=mistral-instruct

lumen-ai serve --llm-model-url https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/blob/main/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf?chat_format=qwen&n_ctx=131072