Senior - Staff HPC Engineer
A pioneering AI-driven biotech organization is seeking a Principal Engineer to lead the development and optimization of its GPU-based computing infrastructure. This role is central to enabling large-scale training of deep learning models that power breakthroughs in biological research and healthcare innovation.
You'll be responsible for architecting and managing high-performance GPU clusters, implementing distributed training strategies, and collaborating with research teams to integrate scalable infrastructure into model development workflows. This is a high-impact role at the intersection of AI, biology, and systems engineering.
Key Responsibilities
- Design, deploy, and maintain high-performance GPU clusters, ensuring reliability, scalability, and efficient resource utilization.
- Implement distributed and parallel training techniques to accelerate deep learning model convergence across multi-GPU and multi-node environments.
- Optimize system performance through profiling, tuning, and analysis of deep learning workloads.
- Collaborate with AI researchers and ML engineers to integrate distributed training capabilities into model development pipelines.
- Develop strategies for resource allocation and prioritization to support growing computational demands.
- Troubleshoot and resolve issues related to cluster performance, training workflows, and infrastructure anomalies.
- Maintain clear and comprehensive documentation for cluster configurations, workflows, and operational best practices.
- Ensure compliance with data privacy and security standards, particularly in handling sensitive biological data.
Ideal Qualifications
- Master's or PhD in Computer Science, High-Performance Computing, Distributed Systems, or a related field.
- 2+ years of hands-on experience managing GPU clusters, including setup, configuration, and performance optimization.
- Deep expertise in distributed deep learning and parallel training methodologies.
- Proficiency in frameworks such as PyTorch, Megatron-LM, DeepSpeed, and GPU-accelerated libraries like CUDA and cuDNN.
- Strong programming skills in Python, with experience in system-level optimization.
- Familiarity with profiling tools and performance tuning for HPC and deep learning environments.
- Experience with resource schedulers and orchestration tools (e.g., SLURM, Kubernetes).
- Background in cloud computing (AWS, GCP) and containerization (Docker, Kubernetes).
This role offers the opportunity to shape the future of AI-powered biological discovery by building the infrastructure that enables cutting-edge model development. If you're passionate about high-performance systems and want to contribute to transformative science, we'd love to hear from you.
FAQs
Congratulations, we understand that taking the time to apply is a big step. When you apply, your details go directly to the consultant who is sourcing talent. Due to demand, we may not get back to all applicants that have applied. However, we always keep your CV and details on file so when we see similar roles or see skillsets that drive growth in organisations, we will always reach out to discuss opportunities.
Yes. Even if this role isn’t a perfect match, applying allows us to understand your expertise and ambitions, ensuring you're on our radar for the right opportunity when it arises.
We also work in several ways, firstly we advertise our roles available on our site, however, often due to confidentiality we may not post all. We also work with clients who are more focused on skills and understanding what is required to future-proof their business.
That's why we recommend registering your CV so you can be considered for roles that have yet to be created.
Yes, we help with CV and interview preparation. From customised support on how to optimise your CV to interview preparation and compensation negotiations, we advocate for you throughout your next career move.