RunInfra is now public.See what's new

Research from the team.

Open papers from the RunInfra team on attention efficiency, LLM inference, kernel optimization, and the architectures behind production AI infrastructure.

Part of RightNow / Research lab co-designing models and hardware

Compute efficiency

Faster, leaner, more memory-bounded ways to run models on existing hardware.

  1. 01
    arXiv 2605.02568
    cs.LG2026
    • Sparse Attention
    • Streaming Top-k
    • Triton
    • Jaber Jaber
    • Osama Jaber

    StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    A memory-bounded sparse attention mechanism that selects top-k keys in a single streaming pass over the sequence, fused as a Triton kernel for production inference workloads.

  2. 02
    arXiv 2603.21365
    cs.LG2026
    • LLM Inference
    • Early Exit
    • Efficiency
    • Jaber Jaber
    • Osama Jaber

    TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

    Per-token early exit in LLM inference, driven by token-informed depth signals that decide when each token has enough computation to commit to a final logit.

  3. 03
    arXiv 2603.21331
    cs.LG2026
    • GPU
    • Kernel Optimization
    • Agents
    • Jaber Jaber
    • Osama Jaber

    AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

    Autonomous GPU kernel optimization via iterative agent-driven search, using LLM agents to explore the kernel design space and validate candidates on real hardware.

Model architectures

New ways to compose, compute, and adapt model internals at training and inference time.

  1. 01
    arXiv 2604.02051
    cs.LG2026
    • Transformers
    • LoRA
    • Weight Generation
    • Jaber Jaber
    • Osama Jaber

    Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

    A recursive transformer where each layer generates its own weights through input-conditioned LoRA modulation, enabling dynamic capacity allocation without storing additional parameters.

  2. 02
    arXiv 2603.29090
    cs.LG2026
    • World Models
    • Object-Centric
    • Causal
    • Jaber Jaber
    • Osama Jaber

    HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

    Hierarchical causal latent state machines for object-centric world modeling, with explicit slots for entities and the causal relationships between them.

Deploy your first optimized model
in under 5 minutes

Start Building for Free
RunInfra

Own your AI. We benchmark GPUs, optimize kernels, and deploy open-source models as production APIs.

Start building

© 2026 RunInfra. All rights reserved.