RunInfraby RightNow

Research from the team.

Open papers from the RunInfra team on attention efficiency, LLM inference, kernel optimization, and the architectures behind production AI infrastructure.

Part of RightNow / Research lab co-designing models and hardware

Compute efficiency

Faster, leaner, more memory-bounded ways to run models on existing hardware.

01
arXiv 2606.09682
cs.LG2026
- Megakernels
- CUDA
- Agents
- Jaber Jaber
- Osama Jaber
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis
Compiles a Llama-family model into one persistent CUDA megakernel with no hand-written CUDA, statically certifies every agent-proposed schedule deadlock-free and race-free before launch, and self-improves the kernel through an unattended agent loop.
PDF arXiv Code
02
arXiv 2603.21331
cs.LG2026
- GPU
- Kernel Optimization
- Agents
- Jaber Jaber
- Osama Jaber
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Autonomous GPU kernel optimization via iterative agent-driven search, using LLM agents to explore the kernel design space and validate candidates on real hardware.
PDF arXiv Code
03
arXiv 2605.02568
cs.LG2026
- Sparse Attention
- Streaming Top-k
- Triton
- Jaber Jaber
- Osama Jaber
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
A memory-bounded sparse attention mechanism that selects top-k keys in a single streaming pass over the sequence, fused as a Triton kernel for production inference workloads.
PDF arXiv Code
04
arXiv 2603.21365
cs.LG2026
- LLM Inference
- Early Exit
- Efficiency
- Jaber Jaber
- Osama Jaber
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
Per-token early exit in LLM inference, driven by token-informed depth signals that decide when each token has enough computation to commit to a final logit.
PDF arXiv Code

Model architectures

New ways to compose, compute, and adapt model internals at training and inference time.

01
arXiv 2604.02051
cs.LG2026
- Transformers
- LoRA
- Weight Generation
- Jaber Jaber
- Osama Jaber
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
A recursive transformer where each layer generates its own weights through input-conditioned LoRA modulation, enabling dynamic capacity allocation without storing additional parameters.
PDF arXiv Code
02
arXiv 2603.29090
cs.LG2026
- World Models
- Object-Centric
- Causal
- Jaber Jaber
- Osama Jaber
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Hierarchical causal latent state machines for object-centric world modeling, with explicit slots for entities and the causal relationships between them.
PDF arXiv Code