RunInfra is part of RightNow, a research lab co-designing models and hardware. The papers below are the public output of that work, focused on two areas that matter to anyone running open models in production. The full reading list with PDFs, arXiv links, and code repos lives at runinfra.ai/research. This page is a brief index of what is published and how to cite it.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Research areas
Compute efficiency
Faster, leaner, more memory-bounded ways to run models on existing hardware. Sparse attention, early exit, autonomous GPU kernel search.
Model architectures
New ways to compose, compute, and adapt model internals at training and inference time. Recursive transformers, causal world models.
Published papers
Compute efficiency
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
A memory-bounded sparse attention mechanism that selects top-k keys in a single streaming pass over the sequence, fused as a Triton kernel for production inference workloads.arXiv: 2605.02568 (PDF)
Code: github.com/RightNow-AI/streamindex
Tags: Sparse Attention, Streaming Top-k, Triton
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
Per-token early exit in LLM inference, driven by token-informed depth signals that decide when each token has enough computation to commit to a final logit.arXiv: 2603.21365 (PDF)
Code: github.com/RightNow-AI/tide
Tags: LLM Inference, Early Exit, Efficiency
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Autonomous GPU kernel optimization via iterative agent-driven search, using LLM agents to explore the kernel design space and validate candidates on real hardware.arXiv: 2603.21331 (PDF)
Code: github.com/RightNow-AI/autokernel
Tags: GPU, Kernel Optimization, Agents
Model architectures
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
A recursive transformer where each layer generates its own weights through input-conditioned LoRA modulation, enabling dynamic capacity allocation without storing additional parameters.arXiv: 2604.02051 (PDF)
Code: github.com/RightNow-AI/ouroboros
Tags: Transformers, LoRA, Weight Generation
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Hierarchical causal latent state machines for object-centric world modeling, with explicit slots for entities and the causal relationships between them.arXiv: 2603.29090 (PDF)
Code: github.com/RightNow-AI/hclsm
Tags: World Models, Object-Centric, Causal