Research Engineer - RL Infrastructure at Prime Intellect

**Who this is for** This role is designed for a systems engineer with a deep interest in the performance intricacies of large-scale AI model training. You'll fo

Work type: remote

Location: San Francisco | Remote

Salary: $150,000 – $300,000/yr

Type: Full-time

Summary

**Who this is for** This role is designed for a systems engineer with a deep interest in the performance intricacies of large-scale AI model training. You'll focus on optimizing the core infrastructure that enables faster, more efficient, and robust reinforcement learning and distributed training processes. **Key highlights** You will be responsible for building and optimizing the systems layer for massive RL training workloads, aiming to push training systems closer to hardware limits. This involves improving efficiency across compute, memory, and networking, and contributing to the architectural design of the RL training stack. **You might be a good fit if you...** - Have strong systems engineering experience in AI/ML infrastructure, particularly with large-scale model training. - Possess deep familiarity with PyTorch and distributed training frameworks (e.g., DeepSpeed, FSDP, Megatron). - Have experience optimizing training performance across kernels, memory, communication, or parallelization strategies. - Understand GPU architecture, profiling, and performance debugging. - Can identify and resolve bottlenecks across the entire training stack.

Job Description

## Building Open Superintelligence Infrastructure

Prime Intellect is building the open superintelligence stack: from frontier agentic models to the infrastructure that enables anyone to train, adapt, and deploy them.

We unify globally distributed compute into a single control plane and pair it with the full reinforcement learning post-training stack: environments, secure sandboxes, verifiable evaluations, and our async RL trainer. We enable researchers, startups, and enterprises to run end-to-end RL at frontier scale, adapting models to real tools, workflows, and deployment environments.

We are looking for a Research Engineer to work on the systems layer behind large-scale RL training. This role is for someone who enjoys going deep on performance: optimizing kernels, improving memory and communication efficiency, scaling distributed workloads, and pushing the throughput and reliability of training systems closer to hardware limits.

If you care about making large-scale model training faster, cheaper, and more robust, we’d love to talk.

##

## What You’ll Work On









##

## You May Be a Fit If You Have








## Especially Exciting







## Why This Role Matters

The next frontier in AI will not be unlocked by models alone. It will be unlocked by systems that let those models train faster, adapt continuously, and operate across real environments at scale.

That infrastructure does not exist yet in the form the world needs.

We’re building it.

## Benefits & Perks






If you’re excited about building the systems foundation for frontier-scale RL and open superintelligence, we’d love to hear from you.

View this job on nocollar jobs