Inference

Alith is designed to provide comprehensive integration support for modern inference engines through a unified interface architecture. Our multi-backend solution will supports:

Core Inference Engines

Llamacpp: Lightweight CPU inference with GGUF quantization support.
MistralRs: Built in Rust, it leverages low-level optimizations for Mistral-family models, ideal for scenarios requiring low-latency streaming (e.g., chatbots).
vLLM: High-throughput GPU serving with PagedAttention.
SGLang: Advanced structured generation for complex workflows.
ONNX Runtime: Production-grade execution with cross-platform optimizations.
Python: Native Python runtime integration for prototyping and production, supporting popular frameworks (PyTorch, TensorFlow) and custom scripting.

Custom Operator Ecosystem

We extend framework capabilities through platform-specific optimizations:

Triton custom kernels for PyTorch acceleration
CUDA/HIP kernels for GPU-specific optimizations

Integrations

Llamacpp


use alith::{Agent, Chat, inference::LlamaEngine};
 
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let model = LlamaEngine::new("/root/models/qwen2.5-1.5b-instruct-q5_k_m.gguf").await?;
    let agent = Agent::new("simple agent", model);
    println!("{}", agent.prompt("Calculate 10 - 3").await?);
    Ok(())
}

Note: we need to open the llamacpp feature to run the code.

MistralRs


use alith::{Agent, Chat, inference::MistralRsEngine};
 
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let model = MistralRsEngine::new("/root/models/qwen2.5-1.5b-instruct-q5_k_m.gguf").await?;
    let agent = Agent::new("simple agent", model);
    println!("{}", agent.prompt("Calculate 10 - 3").await?);
    Ok(())
}

Note: we need to open the mistralrs feature to run the code.

ONNX Runtime


use alith::{
    Agent, Chat,
    inference::engines::ort::{GraphOptimizationLevel, ort_init, present::GPT2},
};
 
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    ort_init()?;
    let model = GPT2::new(
        "https://cdn.pyke.io/0/pyke:ort-rs/example-models@0.0.0/gpt2.onnx",
        "tokenizer.json",
        GraphOptimizationLevel::Level1,
        1,
    )?;
    let agent = Agent::new("simple agent", model);
    println!("{}", agent.prompt("Calculate 10 - 3").await?);
    Ok(())
}

Note: we need to open the ort feature to run the code.