Principal Software Engineer – PyTorch Training Frameworks

About the job

AMD is looking for a Principal-level PyTorch training framework expert to help drive performance, scalability, and correctness of large-scale AI training on AMD Instinct™ accelerators. You will work at the intersection of PyTorch internals, distributed training, and hardware-aware optimization, partnering closely with compiler, kernel, driver, and architecture teams to deliver industry-leading training performance and developer experience.

Responsibilities

Act as a technical authority for PyTorch training at AMD, setting direction for performance, scalability, and reliability

Drive optimization of key PyTorch training workloads (LLMs/foundation models) across single-node and multi-node systems

Improve and debug training performance in areas such as DDP/FSDP, gradient checkpointing, mixed precision, memory planning, and communication/computation overlap

Partner with ROCm compiler/runtime, kernel, and driver teams to resolve performance bottlenecks and correctness issues across the full stack

Contribute to and influence upstream PyTorch (design discussions, code contributions, performance fixes, CI/debug)

Develop and maintain representative training benchmarks, profiling workflows, and performance regression detection for key models

Lead deep-dive investigations of performance regressions and hard correctness issues; drive cross-team resolution to closure

Mentor engineers and raise the bar on framework-quality code, performance engineering practices, and technical rigor

Engage with strategic customers/partners on training enablement, root-cause analysis, and best-practices for AMD platforms

Qualifications

Minimum

No minimum qualifications listed.

Preferred

Deep experience with PyTorch internals and training systems (Autograd, optimizers, dataloading, compilation paths, runtime behavior)

Strong distributed training expertise: DDP, FSDP, tensor/pipeline parallel concepts, collectives (NCCL/RCCL), multi-node debugging

Proven track record in performance engineering (profiling, tracing, kernel/runtime analysis, memory optimization, scaling studies)

Strong programming skills in Python and C/C++ (ability to land clean, maintainable changes in large codebases)

Familiarity with PyTorch ecosystem components such as TorchInductor / torch.compile, Triton, CUDA/HIP-style programming models, and performance tooling

Experience working across OS/hardware boundaries in Linux-based environments (containers, CI, drivers/runtimes are a plus)

Clear technical communication: design docs, code reviews, stakeholder updates, and cross-team coordination

Demonstrated ability to lead through influence (principal-level impact, mentoring, and architectural decision-making)