Senior System Software Engineer, NCCL - Partner Enablement

3 Days ago • 5 Years + • DevOps • Research & Development • $148,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA's GPU Communications Libraries and Networking team seeks a Senior System Software Engineer to focus on NCCL (NVIDIA Collective Communications Library) partner enablement. Responsibilities include troubleshooting functional and performance issues with NCCL, conducting performance analysis on GPU clusters, developing diagnostic tools and automation, providing HPC expertise to customers and support teams, creating training materials and webinars, and collaborating with internal teams across different time zones. The role requires deep expertise in parallel programming, high-performance networking (Infiniband, RoCE, Ethernet), Linux, and scripting languages (Python).
Must have:
  • 5+ years relevant experience
  • Parallel programming & communication runtime experience
  • Excellent C/C++ programming skills
  • HPC or AI community support experience
  • High-performance networking expertise (Infiniband/RoCE/Ethernet)
  • Linux fundamentals & Python scripting
Good to have:
  • HPC cluster infrastructure experience
  • System administration experience (large clusters)
  • Network configuration debugging in large deployments
  • CUDA programming and/or GPU familiarity
  • Deep Learning framework experience (PyTorch, TensorFlow)
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication runtimes like NCCL and NVSHMEM for Deep Learning and HPC applications. We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL. Most DL/HPC applications run on large clusters with high-speed networking (Infiniband, RoCE, Ethernet). This is an outstanding opportunity to get an end to end understanding of the AI networking stack. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

  • Engage with our partners and customers to root cause functional and performance issues reported with NCCL

  • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters

  • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)

  • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters

  • Document and conduct trainings/webinars for NCCL

  • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

What we need to see:

  • B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)

  • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design

  • Experience working with engineering or academic research community supporting HPC or AI

  • Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control

  • Expert in Linux fundamentals and a scripting language, preferably Python

  • Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)

  • Adaptability and passion to learn new areas and tools

  • Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:

  • Experience conducting performance benchmarking and developing infrastructure on HPC clusters. Prior system administration experience, esp for large clusters. Experience debugging network configuration issues in large scale deployments

  • Familiarity with CUDA programming and/or GPUs. Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such PyTorch, TensorFlow

  • Deep understanding of technology and passionate about what you do

The base salary range is 148,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Microsoft - Research Intern - Networking

Microsoft

Redmond, Washington, United States (On-Site)
• 1 Month ago
Google - Senior Software Engineer, Machine Learning, Google Ads

Google

Los Angeles, California, United States (On-Site)
• 1 Month ago
Meta - Mechanical Engineer

Meta

Austin, Texas, United States (On-Site)
• 3 Months ago
Argus Labs - 2D Artist Internship

Argus Labs

Indonesia (Remote)
• 3 Weeks ago
Behaviour Interactive - Senior Online Programmer - Unannounced IP | Programmeur·euse Senior·e en ligne  - Projet non annoncé

Behaviour Interactive

Montreal, Quebec, Canada (Hybrid)
• 3 Months ago
Ubisoft - IT Developer - Fixed Term Contract

Ubisoft

Montreal, Quebec, Canada (On-Site)
• 1 Month ago
PwC - ETIC, GCP/Oracle Cloud Engineer - Manager

PwC

Cairo, Cairo Governorate, Egypt (On-Site)
• 4 Months ago
Nintendo - Sr Manager, Engineering Infrastructure and IT

Nintendo

Redmond, Washington, United States (On-Site)
• 2 Months ago
UXBERT Labs - Senior Solution Architect (IoT/Bluetooth Integration)

UXBERT Labs

Riyadh, Riyadh Province, Saudi Arabia (Hybrid)
• 1 Month ago
The Walt Disney Company - Senior Software Engineer, Big Data Infrastructure

The Walt Disney Company

California, United States (On-Site)
• 3 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Tencent - Tencent Cloud - Senior Channel Sales Executive (Indonesia)

Tencent

Jakarta, Jakarta, Indonesia (On-Site)
• 4 Months ago
Nintendo - Senior Software Engineer (NTD)

Nintendo

Redmond, Washington, United States (On-Site)
• 5 Months ago
ByteDance - Network Engineer, Optical Long-Haul and Submarine

ByteDance

Ashburn, Virginia, United States (On-Site)
• 2 Weeks ago
Ubisoft - Golang Developer

Ubisoft

Montreal, Quebec, Canada (Hybrid)
• 5 Months ago
Google - Senior Software Engineer, Google Cloud

Google

Pune, Maharashtra, India (On-Site)
• 3 Months ago
Zones - Key Account Manager

Zones

Chennai, Tamil Nadu, India (On-Site)
• 3 Months ago
IGT - Temporary Systems Administrator

IGT

Providence, Rhode Island, United States (On-Site)
• 3 Months ago
Google - Staff Software Engineer, Infrastructure, Google Cloud Data Management

Google

Sunnyvale, California, United States (On-Site)
• 1 Month ago
Xsolla - Business Development Manager, Partner Network

Xsolla

Seoul, South Korea (Hybrid)
• 1 Week ago
Activision - Platform Engineering Co-op - May 2025 - Demonware

Activision

Vancouver, British Columbia, Canada (On-Site)
• 3 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Notion - Application Security Engineer

Notion

San Francisco, California, United States (On-Site)
• 3 Months ago
Workco - Design Lead

Workco

Los Angeles, California, United States (On-Site)
• 4 Months ago
Meta - Software Engineering Manager, Product Infrastructure

Meta

Bellevue, Washington, United States (Remote)
• 3 Months ago
The Walt Disney Company - Manager, Content Planning

The Walt Disney Company

Glendale, California, United States (On-Site)
• 1 Month ago
DraftKings - Sr. Associate, Marketing Operations

DraftKings

Boston, Massachusetts, United States (On-Site)
• 3 Weeks ago
Snail Games - Associate Art Director / Art Manager

Snail Games

Beverly Hills, California, United States (Hybrid)
• 3 Weeks ago
ByteDance - Senior Site Reliability Architect - Security Engineering - San Jose

ByteDance

San Jose, California, United States (On-Site)
• 2 Months ago
Lionsgate Games - Intern, Corporate Development (MBA)

Lionsgate Games

Santa Monica, California, United States (On-Site)
• 1 Month ago
ByteDance - Site Reliability Engineer - Data Infrastructure (Seattle)

ByteDance

Seattle, Washington, United States (On-Site)
• 3 Months ago
Egnyte - Customer Marketing Manager

Egnyte

Spokane, Washington, United States (Remote)
• 2 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Playtech - ProdOps Engineer

Playtech

Kyiv, Kyiv City, Ukraine (On-Site)
• 2 Months ago
Blinkhealth - Senior DevOps Engineer (Platform Engineer, AWS)

Blinkhealth

(Remote)
• 1 Week ago
Info Stretch - Lead Data Engineer

Info Stretch

Hyderabad, Telangana, India (On-Site)
• 3 Months ago
SparkCognition - Senior DevOps Engineer

SparkCognition

Bengaluru, Karnataka, India (On-Site)
• 5 Months ago
Inworld AI - Staff Cloud DevOps/Site Reliability Engineer (SRE) - USA

Inworld AI

Mountain View, California, United States (On-Site)
• 6 Months ago
NVIDIA - Senior DevOps Engineer, Deep Learning Frameworks

NVIDIA

Warsaw, Masovian Voivodeship, Poland (On-Site)
• 1 Month ago
Sourcegraph  Inc  - Support Engineer

Sourcegraph Inc

(Remote)
• 1 Month ago
Omnissa - Staff Engineer (C++ Linux)

Omnissa

Bengaluru, Karnataka, India (Hybrid)
• 4 Months ago
Onit India - Senior DevOps Engineer

Onit India

Pune, Maharashtra, India (Hybrid)
• 4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

Shanghai, Shanghai, China (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug