Research from the team.
Open papers from the RunInfra team on attention efficiency, LLM inference, kernel optimization, and the architectures behind production AI infrastructure.
Compute efficiency
Faster, leaner, more memory-bounded ways to run models on existing hardware.
- 01arXiv 2605.02568cs.LG2026
- Sparse Attention
- Streaming Top-k
- Triton
Jaber Jaber
Osama Jaber
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
A memory-bounded sparse attention mechanism that selects top-k keys in a single streaming pass over the sequence, fused as a Triton kernel for production inference workloads.
- 02arXiv 2603.21365cs.LG2026
- LLM Inference
- Early Exit
- Efficiency
Jaber Jaber
Osama Jaber
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
Per-token early exit in LLM inference, driven by token-informed depth signals that decide when each token has enough computation to commit to a final logit.
- 03arXiv 2603.21331cs.LG2026
- GPU
- Kernel Optimization
- Agents
Jaber Jaber
Osama Jaber
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Autonomous GPU kernel optimization via iterative agent-driven search, using LLM agents to explore the kernel design space and validate candidates on real hardware.
Model architectures
New ways to compose, compute, and adapt model internals at training and inference time.
- 01arXiv 2604.02051cs.LG2026
- Transformers
- LoRA
- Weight Generation
Jaber Jaber
Osama Jaber
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
A recursive transformer where each layer generates its own weights through input-conditioned LoRA modulation, enabling dynamic capacity allocation without storing additional parameters.
- 02arXiv 2603.29090cs.LG2026
- World Models
- Object-Centric
- Causal
Jaber Jaber
Osama Jaber
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Hierarchical causal latent state machines for object-centric world modeling, with explicit slots for entities and the causal relationships between them.
Own your AI. We benchmark GPUs, optimize kernels, and deploy open-source models as production APIs.
Start building