Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
n > 1, temperature, top-p, presence/frequency penalties, all pass straight through to the upstream vLLM / SGLang / TRT-LLM server.
Minimal request
Streaming
Setstream: true. Response is SSE; each event is a JSON delta, terminated by data: [DONE].
Function / tool calling
Passtools + tool_choice. The model can decide to call tools; your code executes them and sends results back as follow-up messages. Identical shape to the OpenAI API.
Structured output
Request parameters
The model id (e.g.
"llama-3.3-70b"). Must be a model deployed in your workspace. Get the list via GET /v1/models.Array of messages. Each has
role ("system", "user", "assistant", "tool") and content (string). Tool calls use the same format as the OpenAI API.Return an SSE stream of delta chunks.
Maximum tokens to generate.
Sampling temperature, 0-2. Lower is more deterministic.
Nucleus sampling cutoff.
Up to 4 sequences that halt generation.
Function-calling tool definitions. Forwarded unchanged.
"auto" | "none" | "required" | { type: "function", function: { name } }.{ type: "json_object" } or { type: "json_schema", json_schema: {...} }.For deterministic sampling on supported models.
Return token-level log probabilities.
Number of top logprobs per token (0-5).
Number of completions to generate.
Penalty for repeating tokens already in the output,, 2 to 2.
Penalty for frequent-token usage,, 2 to 2.