About the job
AMD is looking for a Principal-level PyTorch training framework expert to help drive performance, scalability, and correctness of large-scale AI training on AMD Instinct™ accelerators. You will work at the intersection of PyTorch internals, distributed training, and hardware-aware optimization, partnering closely with compiler, kernel, driver, and architecture teams to deliver industry-leading training performance and developer experience.
Responsibilities
Act as a technical authority for PyTorch training at AMD, setting direction for performance, scalability, and reliability
Drive optimization of key PyTorch training workloads (LLMs/foundation models) across single-node and multi-node systems
Improve and debug training performance in areas such as DDP/FSDP, gradient checkpointing, mixed precision, memory planning, and communication/computation overlap
Partner with ROCm compiler/runtime, kernel, and driver teams to resolve performance bottlenecks and correctness issues across the full stack
Contribute to and influence upstream PyTorch (design discussions, code contributions, performance fixes, CI/debug)
Develop and maintain representative training benchmarks, profiling workflows, and performance regression detection for key models
Lead deep-dive investigations of performance regressions and hard correctness issues; drive cross-team resolution to closure
Mentor engineers and raise the bar on framework-quality code, performance engineering practices, and technical rigor
Engage with strategic customers/partners on training enablement, root-cause analysis, and best-practices for AMD platforms
Qualifications
Minimum
No minimum qualifications listed.
Preferred
Deep experience with PyTorch internals and training systems (Autograd, optimizers, dataloading, compilation paths, runtime behavior)
Strong distributed training expertise: DDP, FSDP, tensor/pipeline parallel concepts, collectives (NCCL/RCCL), multi-node debugging
Proven track record in performance engineering (profiling, tracing, kernel/runtime analysis, memory optimization, scaling studies)
Strong programming skills in Python and C/C++ (ability to land clean, maintainable changes in large codebases)
Familiarity with PyTorch ecosystem components such as TorchInductor / torch.compile, Triton, CUDA/HIP-style programming models, and performance tooling
Experience working across OS/hardware boundaries in Linux-based environments (containers, CI, drivers/runtimes are a plus)
Clear technical communication: design docs, code reviews, stakeholder updates, and cross-team coordination
Demonstrated ability to lead through influence (principal-level impact, mentoring, and architectural decision-making)