Research - RunInfra

RunInfra is part of RightNow, a research lab co-designing models and hardware. The papers below are the public output of that work, focused on two areas that matter to anyone running open models in production. The full reading list with PDFs, arXiv links, and code repos lives at runinfra.ai/research. This page is a brief index of what is published and how to cite it.

Research areas

Compute efficiency

Faster, leaner, more memory-bounded ways to run models on existing hardware. Sparse attention, early exit, autonomous GPU kernel search.

Model architectures

New ways to compose, compute, and adapt model internals at training and inference time. Recursive transformers, causal world models.

Published papers

Compute efficiency

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

A memory-bounded sparse attention mechanism that selects top-k keys in a single streaming pass over the sequence, fused as a Triton kernel for production inference workloads.arXiv: 2605.02568 (PDF) Code: github.com/RightNow-AI/streamindex Tags: Sparse Attention, Streaming Top-k, Triton

TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

Per-token early exit in LLM inference, driven by token-informed depth signals that decide when each token has enough computation to commit to a final logit.arXiv: 2603.21365 (PDF) Code: github.com/RightNow-AI/tide Tags: LLM Inference, Early Exit, Efficiency

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Autonomous GPU kernel optimization via iterative agent-driven search, using LLM agents to explore the kernel design space and validate candidates on real hardware.arXiv: 2603.21331 (PDF) Code: github.com/RightNow-AI/autokernel Tags: GPU, Kernel Optimization, Agents

Model architectures

Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

A recursive transformer where each layer generates its own weights through input-conditioned LoRA modulation, enabling dynamic capacity allocation without storing additional parameters.arXiv: 2604.02051 (PDF) Code: github.com/RightNow-AI/ouroboros Tags: Transformers, LoRA, Weight Generation

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

Hierarchical causal latent state machines for object-centric world modeling, with explicit slots for entities and the causal relationships between them.arXiv: 2603.29090 (PDF) Code: github.com/RightNow-AI/hclsm Tags: World Models, Object-Centric, Causal

Authors

All papers are joint work by Jaber Jaber (RunInfra founder, RightNow) and Osama Jaber (RightNow). Correspondence to jaber@runinfra.ai.

How to cite

Each paper has the canonical BibTeX entry on its arXiv page. Use the arXiv id as the identifier.

@misc{streamindex2026,
  title  = {StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k},
  author = {Jaber, Jaber and Jaber, Osama},
  year   = {2026},
  eprint = {2605.02568},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
}

Glossary

News

⌘I

​Research areas

Compute efficiency

Model architectures

​Published papers

​Compute efficiency

​Model architectures

​Authors

​How to cite

Research areas

Published papers

Compute efficiency

Model architectures

Authors

How to cite