* feat(gguf): improve model compatibility heuristic for Apple Silicon unified memory
* fix: resolve review
---------
Co-authored-by: Louis <louis@jan.ai>
Introduces support for the `fit` parameter and its associated configurations (`fit_target`, `fit_ctx`) to allow automatic adjustment of arguments to device memory. This change spans the extension settings, guest-js types, and the Rust argument builder.
**Key changes:**
* **Settings & Types:** Added `fit`, `fit_target`, and `fit_ctx` to `settings.json` and synchronized these fields across the TypeScript definitions and the Rust `LlamacppConfig` struct.
* **Logic Updates:** * Implemented `add_fit_settings` in the `ArgumentBuilder` to handle `--fit`, `--fit-target`, and `--fit-ctx` flags.
* Modified `add_gpu_layers` to use `-1` as the default for loading all layers, while treating `100` as a manual override.
* Updated several argument methods (batch size, context size, etc.) to only append flags if the values differ from the defaults, reducing command-line clutter.
* Added a check to exclude `fit` settings when using the `ik` backend fork.
* **Testing:** Significantly expanded the Rust test suite. Replaced basic assertions with dedicated helper functions (`assert_arg_pair`, `assert_has_flag`, `assert_no_flag`) and added comprehensive test cases for various configurations, including GPU layers, embedding mode, and backend-specific behavior.
* refactor: migrate llamacpp backend logic to rust plugin
Moves the core logic for managing llama.cpp backends—including version detection, compatibility checking, migration, prioritization, and updates—from the TypeScript extension to the Rust Tauri plugin.
Changes:
- **tauri-plugin-llamacpp**:
- Added `src/backend.rs` containing the logic for backend management.
- Exposed new commands: `map_old_backend_to_new`, `list_supported_backends`, `determine_supported_backends`, `prioritize_backends`, `check_backend_for_updates`, `remove_old_backend_versions`, etc.
- Added unit tests for backend logic in Rust.
- Updated permissions and guest-js bindings to include new commands.
- **llamacpp-extension**:
- Refactored `src/backend.ts` and `src/index.ts` to delegate logic to the Rust plugin.
- Removed obsolete TypeScript implementation of backend logic and corresponding tests.
- Simplified configuration and update workflows by using the centralized Rust API.
* tests: fix parse backend version tests
* fix: correct backenddir path
* refactor: move llama.cpp config handling to Rust
- Removed duplicated TypeScript type definitions for LlamacppConfig, ModelPlan, DownloadItem, ModelConfig, etc.
- Added a new `src/guest-js/types.ts` that exports the consolidated types and a helper `normalizeLlamacppConfig` for converting raw config objects.
- Implemented a dedicated Rust module `args.rs` that builds all command‑line arguments for llama.cpp from a `LlamacppConfig` struct, handling embedding, flash‑attention, GPU/CPU flags, and other options.
- Updated `commands.rs` to construct arguments via `ArgumentBuilder`, validate paths, and log the generated args.
- Added more explicit error handling for invalid configuration arguments and updated the error enum to include `InvalidArgument`.
- Exported the new `cleanupLlamaProcesses` command and updated the guest‑JS API accordingly.
- Adjusted the TypeScript `loadLlamaModel` helper to use the new config normalization and argument shape.
- Improved logging and documentation for clarity.
* fix: ignore empty mmproj path arguments
Prevent adding the `--mmproj` flag when the provided path string is empty.
An empty `mmproj_path` previously caused an empty argument to be passed to the model loader, potentially leading to errors or undefined behavior. By filtering out empty strings before pushing the flag, the command line construction is now robust against malformed input.
* refactor: use String::new() for empty API key
Use `String::new()` instead of `“”.to_string()` when no API key is supplied.
This eliminates an unnecessary heap allocation and clarifies that the intent is to create an empty string without creating a temporary literal.
* fix: set backend path environment variables for llama.cpp
Ensure that the backend executable’s directory is added to the appropriate
environment variable (`PATH`, `LD_LIBRARY_PATH`, or `DYLD_LIBRARY_PATH`)
before invoking `llama_load` and `get_devices`.
This change fixes load failures on Windows, Linux, and macOS where the
dynamic loader cannot locate the required libraries without the proper
search paths, and cleans up unused imports.
* refactor: centralize library path setup in Rust utilities
Move the library‑path configuration logic out of the TypeScript code into the
Rust `setup_library_path` helper. The TypeScript files no longer set the
`PATH`, `LD_LIBRARY_PATH`, or `DYLD_LIBRARY_PATH` environment variables
directly; instead they defer to the Rust side, which now accepts a
`Path` and performs platform‑specific normalization (including UNC‑prefix
trimming on Windows). This removes duplicated code, keeps environment
configuration consistent across the plugin, and simplifies maintenance.
The import order in `device.rs` was corrected and small formatting fixes
were applied. No functional changes to the public API occur.
* feat: add CUDA path detection and warnings for llama.cpp
Add utilities to detect CUDA installations on Windows and Linux, automatically
inject CUDA paths into the process environment, and warn when the llama.cpp
binary requires CUDA but the runtime is not found. The library‑path setup has
been refactored to prepend new paths and normalise UNC prefixes for Windows.
This ensures the backend can load CUDA libraries correctly and provides
diagnostic information when CUDA is missing.
* refactor: correctly map and store effective backend type
This update unifies backend type handling across the llamacpp extension.
Previously, the stored backend preference, the version string, and the
auto‑update logic used inconsistent identifiers (raw backend names versus
their effective mapped forms). The patch:
* Maps legacy backend names to their new “effective” type before any
comparison or storage.
* Stores the full `version/effectiveType` string instead of just the
type, ensuring the configuration and localStorage stay in sync.
* Updates all logging and warning messages to reference the effective
backend type.
* Simplifies the update check logic by comparing the effective type and
version together, preventing unnecessary migrations.
These changes eliminate bugs that occurred when the backend type
changed after an update and make the internal state more coherent.
* refactor: improve CUDA detection and migrate legacy libs
Enhance `_isCudaInstalled` to accept the backend directory and CUDA version, checking both the new and legacy installation paths. If a library is found in the old location, move it to the new `build/bin` directory and create any missing folders. Update `mapOldBackendToNew` formatting and remove duplicated comments. Minor consistency and readability fixes were also applied throughout the backend module.
* refactor: broaden llama backend archive regex
This update expands the regular expression used to parse llama‑cpp extension archives.
The new pattern now supports:
- Optional prefixes and the `-main` segment
- Version strings that include a hash suffix
- An optional `-cudart-llama` part
- A wide range of backend detail strings
These changes ensure `installBackend` can correctly handle the latest naming conventions (e.g., `k_llama-main-b4314-09c61e1-bin-win-cuda-12.8-x64-avx2.zip`) while preserving backward compatibility with older formats.
Added `mapOldBackendToNew` to translate legacy backend strings (e.g., `win-avx2-x64`, `win-avx512-cuda-cu12.0-x64`) into the new unified names (`win-common_cpus-x64`, `win-cuda-12-common_cpus-x64`). Updated backend selection, installation, and download logic to use the mapper, ensuring consistent naming across the extension and tests. Updated tests to verify the mapping, new download items, and correct extraction paths. Minor formatting updates to the Tauri command file for clearer logging. This change enables smoother migration for stored user preferences and reduces duplicate asset handling.
Co-authored-by: Akarshan Biswas <akarshan@menlo.ai>
* feat: add configurable timeout for llamacpp connections
This change introduces a user-configurable read/write timeout (in seconds) for llamacpp connections, replacing the hard-coded 600s value. The timeout is now settable via the extension settings and used in both HTTP requests and server readiness checks. This provides flexibility for different deployment scenarios, allowing users to adjust connection duration based on their specific use cases while maintaining the default 10-minute timeout behavior.
* fix: correct timeout conversion factor and clarify settings description
The previous timeout conversion used `timeout * 100` instead of `timeout * 1000`, which incorrectly shortened the timeout to 1/10 of the intended value (e.g., 10 minutes became 1 minute). This change corrects the conversion factor to milliseconds. Additionally, the settings description was updated to explicitly state that this timeout applies to both connection and load operations, improving user understanding of its scope.
* style: replace loose equality with strict equality in key comparison
This change updates the comparison operator from loose equality (`==`) to strict equality (`===`) when checking for the 'timeout' key. While the key is always a string in this context (making the behavior identical), using strict equality prevents potential type conversion issues and adheres to JavaScript best practices for reliable comparisons.
* refactor: Simplify Tauri plugin calls and enhance 'Flash Attention' setting
This commit introduces significant improvements to the llama.cpp extension, focusing on the 'Flash Attention' setting and refactoring Tauri plugin interactions for better code clarity and maintenance.
The backend interaction is streamlined by removing the unnecessary `libraryPath` argument from the Tauri plugin commands for loading models and listing devices.
* **Simplified API Calls:** The `loadLlamaModel`, `unloadLlamaModel`, and `get_devices` functions in both the extension and the Tauri plugin now manage the library path internally based on the backend executable's location.
* **Decoupled Logic:** The extension (`src/index.ts`) now uses the new, simplified Tauri plugin functions, which enhances modularity and reduces boilerplate code in the extension.
* **Type Consistency:** Added `UnloadResult` interface to `guest-js/index.ts` for consistency.
* **Updated UI Control:** The 'Flash Attention' setting in `settings.json` is changed from a boolean checkbox to a string-based dropdown, offering **'auto'**, **'on'**, and **'off'** options.
* **Improved Logic:** The extension logic in `src/index.ts` is updated to correctly handle the new string-based `flash_attn` configuration. It now passes the string value (`'auto'`, `'on'`, or `'off'`) directly as a command-line argument to the llama.cpp backend, simplifying the version-checking logic previously required for older llama.cpp versions. The old, complex logic tied to specific backend versions is removed.
This refactoring cleans up the extension's codebase and moves environment and path setup concerns into the Tauri plugin where they are most relevant.
* feat: Simplify backend architecture
This commit introduces a functional flag for embedding models and refactors the backend detection logic for cleaner implementation.
Key changes:
- Embedding Support: The loadLlamaModel API and SessionInfo now include an isEmbedding: boolean flag. This allows the core process to differentiate and correctly initialize models intended for embedding tasks.
- Backend Naming Simplification (Refactor): Consolidated the CPU-specific backend tags (e.g., win-noavx-x64, win-avx2-x64) into generic *-common_cpus-x64 variants (e.g., win-common_cpus-x64). This streamlines supported backend detection.
- File Structure Update: Changed the download path for CUDA runtime libraries (cudart) to place them inside the specific backend's directory (/build/bin/) rather than a shared lib folder, improving asset isolation.
* fix: compare
* fix mmap settings and adjust flash attention
* fix: correct flash_attn and main_gpu flag checks in llamacpp extension
Previously the condition for `flash_attn` was always truthy, causing
unnecessary or incorrect `--flash-attn` arguments to be added. The
`main_gpu` check also used a loose inequality which could match values
that were not intended. The updated logic uses strict comparison and
correctly handles the empty string case, ensuring the command line
arguments are generated only when appropriate.
This commit introduces a new field, `is_embedding`, to the `SessionInfo` structure to clearly mark sessions running dedicated embedding models.
Key changes:
- Adds `is_embedding` to the `SessionInfo` interface in `AIEngine.ts` and the Rust backend.
- Updates the `loadLlamaModel` command signatures to pass this new flag.
- Modifies the llama.cpp extension's **auto-unload logic** to explicitly **filter out** and **not unload** any currently loaded embedding models when a new text generation model is loaded. This is a critical performance fix to prevent the embedding model (e.g., used for RAG) from being repeatedly reloaded.
Also includes minor code style cleanup/reformatting in `jan-provider-web/provider.ts` for improved readability.
* feat: Adjust RAM/VRAM calculation for unified memory systems
This commit refactors the logic for calculating **total RAM** and **total VRAM** in `is_model_supported` and `plan_model_load` commands, specifically targeting systems with **unified memory** (like modern macOS devices where the GPU list may be empty).
The changes are as follows:
* **Total RAM Calculation:** If no GPUs are detected (`sys_info.gpus.is_empty()` is true), **total RAM** is now set to $0$. This avoids confusing total system memory with dedicated GPU memory when planning model placement.
* **Total VRAM Calculation:** If no GPUs are detected, **total VRAM** is still calculated as the system's **total memory (RAM)**, as this shared memory acts as VRAM on unified memory architectures.
This adjustment improves the accuracy of memory availability checks and model planning on unified memory systems.
* fix: total usable memory in case there is no system vram reported
* chore: temporarily change to self-hosted runner mac
* ci: revert back to github hosted runner macos
---------
Co-authored-by: Louis <louis@jan.ai>
Co-authored-by: Minh141120 <minh.itptit@gmail.com>
The KV cache size calculation in estimate_kv_cache_internal now includes a fallback mechanism for models that do not explicitly define key_length and value_length in the GGUF metadata.
If these attention keys are missing, the head dimension (and thus key/value length) is calculated using the formula embedding_length / total_heads. This improves robustness and compatibility with GGUF models that don't have the proper keys in metadata.
Also adds logging of the full model metadata for easier debugging of the estimation process.