Senior System Software Engineer, Distributed Systems - DGX Cloud

1 Month ago • 6 Years + • DevOps • $148,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior System Software Engineer specializing in distributed systems for its DGX Cloud platform. The role involves designing, developing, and optimizing solutions for datacenter firmware, collaborating with hardware and software teams, ensuring seamless integration across the system. Responsibilities include automating GPU asset provisioning, configuration, and lifecycle management across cloud providers, defining reliability and availability requirements, and driving failure analysis. The ideal candidate possesses strong programming skills in Python and Linux, system-level expertise, familiarity with industry standards (SPI, I2C, PCIe, UEFI, PLDM), and experience with distributed systems. This is a full-time position with remote options.
Must have:
  • 6+ years experience with Python & Linux
  • Distributed systems understanding
  • System programming (Go/Python)
  • Familiarity with SPI, I2C, PCIe, UEFI, PLDM
  • Data structures & algorithms expertise
Good to have:
  • Machine check architecture knowledge
  • Linux server design, x86/ARM architecture
  • Experience with large-scale distributed systems
  • Cloud AI infrastructure operational excellence
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is hiring engineers to scale up its AI Infrastructure. We expect you to have a strong programming background, a deep understanding of distributed systems, familiarity with software testing and deployment, and excellent communication and planning abilities. We also welcome out-of-the-box thinkers who can provide new ideas with strong at execution bias. Expect to be constantly challenged, improving, and evolving for the better. You and other engineers in this team will help advance NVIDIA's capacity to build and deploy leading infrastructure solutions for a broad range of AI-based applications that affect core data science. What are you waiting for if you're creative, passionate about what you do, and love having fun apply today!

What you’ll be doing:

  • We are designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.

  • Design, develop, test, debug, and optimize creative solutions for Datacenter firmware throughout lifecycle.

  • Work closely with hardware, software, infrastructure, and business teams to transform new firmware features from idea to reality.

  • Define server-level reliability, availability, and serviceability requirements in collaboration with various customers like CSPs and deliver fault resilient solution at scale as per customer expectations.

  • Collaborate with hardware, software and firmware teams to drive failure analysis and large scale solution deployment.

  • Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications.

What we need to see:

  • BS, MS, or PhD in EE/CS or related field of education (or equivalent experience) with 6+ years of experience active development using Python as primary programming language using Linux as OS.

  • Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles and architects and coordinate effectively across organizational boundaries and geographies.

  • Familiarity with industry standards and specifications such as SPI, I2C, PCIe, UEFI and PLDM.

  • System knowledge - how platform management works - areas like BMC-BIOS communication, thermal management, power management, firmware update, device monitoring, firmware security, etc.

  • Expert level knowledge of a systems programming language (Go, Python) and a solid understanding of Data Structure and Algorithms.

  • Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture, data synchronization, fault tolerance and state management.

Ways to stand out from the crowd:

  • Background with In-depth understanding of the interaction of machine check architecture and error flows with system firmware/software.

  • Familiar with Linux server design, x86/ARM system architecture, interconnects like PCI, and other I/O buses.

  • Proven operational excellence in designing and maintaining cloud AI infrastructure. Proficiency in architecting and running large-scale distributed systems, independent of cloud providers.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you.

The base salary range is 148,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Iksha Labs - Senior C++ Engineer

Iksha Labs

Gurugram, Haryana, India (On-Site)
5 Months ago
ByteDance - Lead Research Scientist, Foundation Model, Speech & Audio

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Impact Theory - Social Media Editor - Lisa Bilyeu & Women of Impact

Impact Theory

West Hollywood, California, United States (On-Site)
4 Months ago
Xsolla - Principal AI Engineer

Xsolla

Maryland, United States (On-Site)
6 Months ago
The Walt Disney Company - Sr Software Engineer (webOS/Tizen)

The Walt Disney Company

San Francisco, California, United States (On-Site)
3 Months ago
Hashone Careers - Cloud Engineer

Hashone Careers

Bengaluru, Karnataka, India (Remote)
3 Months ago
Microsoft - Sr. Hardware Engineer - DEBUG

Microsoft

Taipei City, Taiwan (On-Site)
1 Month ago
Dynamics - Cloud Architect (SEVIS)

Dynamics

(Remote)
2 Months ago
Cargo Studio - MIS Engineer

Cargo Studio

(On-Site)
2 Weeks ago
ByteDance - Production System Engineer

ByteDance

Singapore (On-Site)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

CharacterAI - Research Engineer - Multimodal

CharacterAI

Canada (On-Site)
6 Months ago
ByteDance - Student Researcher (Doubao (Seed) - Machine Learning System) - 2025 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
Snowed In Studios - Lead Software Developer - Montreal

Snowed In Studios

Quebec, Canada (Remote)
3 Months ago
Intel Corporation - AI Frameworks Engineer

Intel Corporation

Ireland (Hybrid)
2 Months ago
Google - Software Engineer III, Infrastructure, Google Cloud NetInfra

Google

Sunnyvale, California, United States (On-Site)
3 Months ago
Blind Squirrel Games - Technical Director

Blind Squirrel Games

Auckland, Auckland, New Zealand (On-Site)
2 Months ago
Eleven Labs - Machine Learning Researcher

Eleven Labs

London, England, United Kingdom (Remote)
2 Months ago
GoReel - Python Developer

GoReel

Bratislava, Bratislava Region, Slovakia (Remote)
1 Month ago
GoMotive - Embedded Engineer

GoMotive

India (Remote)
6 Days ago
Apolloio - Senior Backend Engineer

Apolloio

India (Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Nintendo - Associate Insights and Analytics Analyst

Nintendo

Redmond, Washington, United States (Hybrid)
2 Months ago
Spin Master - Project Designer Gabby's Dollhouse

Spin Master

California, United States (Hybrid)
3 Weeks ago
ByteDance - Research Engineer- Foundation Model AI Platform- Seattle

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
Intel Corporation - Director, ICAP and Mergers and Acquisitions Accounting

Intel Corporation

Hillsboro, Oregon, United States (Hybrid)
3 Months ago
Experis - Voice Recording Engineer (Jersey City)

Experis

Jersey City, New Jersey, United States (Hybrid)
6 Months ago
Microsoft - Research Intern - Algorithms Group: Theory

Microsoft

Redmond, Washington, United States (On-Site)
1 Month ago
Warner Bros Games - Sales Lead - Harry Potter Flagship

Warner Bros Games

New York, New York, United States (On-Site)
3 Months ago
CD PROJEKT RED - Senior Programmer, Story

CD PROJEKT RED

Boston, Massachusetts, United States (On-Site)
2 Months ago
ByteDance - Software Researcher/Engineer - Applied Research Center (Infrastructure+AI)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Milestone - Regional Sales Executive, West

Milestone

United States (Remote)
2 Weeks ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Luxoft - Sophis Platform Engineer

Luxoft

Bucharest, Bucharest, Romania (On-Site)
3 Months ago
Zones - Azure Backend Developer

Zones

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Gaming Innovation Group  - Infrastructure Engineer

Gaming Innovation Group

(Hybrid)
2 Months ago
OpenGov - DevOps Engineer III

OpenGov

Atlanta, Georgia, United States (Hybrid)
4 Months ago
Bounteous - Senior Cloud Engineer - BOT

Bounteous

India (Remote)
4 Months ago
King - Site Reliability Engineer | Core Platform

King

(On-Site)
1 Week ago
Tencent - Technical Account Representative

Tencent

Frankfurt, Hessen, Germany (On-Site)
2 Weeks ago
Info Stretch - Lead Data Engineer

Info Stretch

Bengaluru, Karnataka, India (On-Site)
3 Months ago
PwC - ETIC, Azure Technical Support Engineer - Senior Associate

PwC

Cairo, Cairo Governorate, Egypt (On-Site)
4 Months ago
Microsoft - Principal Software Engineering Manager

Microsoft

(On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

Shanghai, Shanghai, China (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug