Jobs

Courses

Resources

Companies

Placements

Community

Home >

Jobs >

Senior SRE Software Engineer, Storage and Data

Taipei City, Taiwan (On-site)

Senior SRE Software Engineer, Storage and Data

1 Month ago • 5 Years + • DevOps

Job Summary

Job Description

As a Senior SRE Software Engineer at NVIDIA, you'll be responsible for ensuring the reliability, availability, and performance of storage infrastructures for the DGX Cloud platform. This involves developing strategies for redundancy and disaster recovery, continuously analyzing and optimizing storage systems, developing automation scripts, implementing monitoring and alerting systems, and participating in on-call rotations. You'll collaborate with cross-functional teams, troubleshoot issues, conduct root cause analysis, and work with AI/ML workloads. The role requires expertise in storage systems, SRE principles, and automation, along with experience with various tools and technologies like Ansible, Python, AWS S3, and monitoring stacks.

Must have:

5+ years experience
Storage system administration
SRE experience
Automation scripting
Linux system administration
Problem-solving skills
Collaboration skills

Good to have:

Experience with OpenStack Swift, AWS S3, DDN, Lustre
Strong Linux and network troubleshooting skills
Experience with Kubernetes, OpenStack, Docker
Experience with Ansible, Chef, Puppet, Terraform

Perks:

Competitive salary
Generous benefits package

15 skills required

15 skills required for this role

Add these skills to join the top 1% applicants for this job

linux

aws

ansible

openstack

networking

cross-functional

problem-solving

java

bash

unity

github

puppet

kubernetes

restful-api

chef

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.
Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.
Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.
Implement monitoring and alerting systems to proactively identify and address issues.
Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.
Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.
Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.
Proven experience in storage system administration and site reliability engineering.
Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.
Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java
Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana.

Ways to stand out from the crowd:

Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.
Strong Linux and network troubleshooting skills by running various commands and tools.
Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..
Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the most desirable employers in the world. We have some of the most brilliant and talented people in the world working for us. If you are creative, autonomous and love a challenge, we want to hear from you. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

Intern/Computer Vision Engineer

DeepSight AI Labs

Gurugram, Haryana, India (On-Site)

• 8 Months ago

Senior C++ Software Engineer (Build System)

Playrix

Ireland (Remote)

• 2 Months ago

Principal Security Engineer

Scopely

Barcelona, Catalonia, Spain (Hybrid)

• 3 Months ago

Senior System Software Engineer

NVIDIA

Bengaluru, Karnataka, India (On-Site)

• 1 Month ago

Sr. AI Engineer

Trend Micro

Taipei City, Taiwan (On-Site)

• 4 Months ago

Solutions Architect

Luxoft

Bengaluru, Karnataka, India (On-Site)

• 2 Months ago

Senior Software Engineer (Java/Scala, Spark, Kubernetes, AWS)

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)

• 4 Months ago

Security Systems Engineer, Fleet Management

ByteDance

Singapore (On-Site)

• 1 Month ago

DevOps Manager

Unity

Montreal, Quebec, Canada (On-Site)

• 4 Months ago

Tech Lead (SRE) - Cloud Infrastructure

ByteDance

Singapore (On-Site)

• 3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Senior DevOps (Azure) Engineer

Velotio Technologies

Maharashtra, India (Remote)

• 2 Days ago

Test Engineer

Kaiying Network

Shanghai, Shanghai, China (On-Site)

• 2 Days ago

Intern, System Software Engineer - CXL

Samsung Semiconductor

San Jose, California, United States (On-Site)

• 1 Month ago

Principal Backend Programmer – Dead By Daylight | Programmeur·euse Backend Principal·e – Dead By Daylight

Behaviour Interactive

Toronto, Ontario, Canada (On-Site)

• 5 Months ago

Support Engineer

Sourcegraph Inc

(Remote)

• 1 Month ago

Foundation Software Intern (Data-Technical Infrastructures-Foundational Technology-Foundation Software) - 2025 Summer (MS)

ByteDance

San Jose, California, United States (On-Site)

• 3 Months ago

Senior Hardware Engineer

Microsoft

Bengaluru, Karnataka, India (On-Site)

• 1 Month ago

Android Automotive Developer

Luxoft

Brazil, Indiana, United States (Remote)

• 3 Months ago

Solutions Engineer, SaaS Specialist

Milestone

United States (Remote)

• 2 Weeks ago

FPGA Firmware Engineer

ByteDance

San Jose, California, United States (On-Site)

• 2 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Taipei City, Taiwan

Security System Software Engineer (RDSS Intern)

NVIDIA

Taipei City, Taiwan (On-Site)

• 1 Month ago

Sales Account Manager, Japan

Corsair

Taipei City, Taiwan (On-Site)

• 1 Month ago

Hardware Engineering Intern, 2025

Google

New Taipei City, Taiwan (On-Site)

• 1 Month ago

Garena - Backend Engineer

Garena

Taipei City, Taiwan (On-Site)

• 3 Months ago

Director, Technical Program Management - Taiwan

Netflix

Hsinchu, Hsinchu City, Taiwan (On-Site)

• 1 Month ago

Sea Group - Infrastructure Engineer (DC Site)

Garena

Taipei City, Taiwan (On-Site)

• 1 Month ago

Program Manager, Supply and Material Planning, Global Infrastructure

Google

Taipei City, Taiwan (On-Site)

• 1 Month ago

Senior ASIC Verification Engineer, Coherent High Speed Interconnect

NVIDIA

Taipei City, Taiwan (On-Site)

• 1 Month ago

GTM Manager

Corsair

Taipei City, Taiwan (On-Site)

• 1 Month ago

Payroll Specialist

Corsair

Taoyuan City, Taiwan (On-Site)

• 1 Month ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Senior Site Reliability Engineer - Data Infrastructure (Seattle)

ByteDance

Seattle, Washington, United States (On-Site)

• 3 Months ago

Lead Software Engineer

Virtuos

Singapore (On-Site)

• 3 Months ago

Lead or Senior Data Scientist

HiLabs

Pune, Maharashtra, India (On-Site)

• 4 Months ago

IT Systems Engineer - Cloud

PlayStation Global

Aliso Viejo, California, United States (On-Site)

• 3 Months ago

Cloud Engineer

Travelex

Mumbai, Maharashtra, India (On-Site)

• 3 Months ago

Build & Release Engineer

Pixar Animation Studios

Emeryville, California, United States (Hybrid)

• 3 Weeks ago

Sr. Software Engineer

Egnyte

Mountain View, California, United States (Hybrid)

• 3 Months ago

DevOps Linux Administrator

Ubisoft

Saint-Mandé, Île-de-France, France (On-Site)

• 3 Weeks ago

Tencent Cloud Cloud Native Solution Architect (Kubernetes-focused), Japan

Tencent

Tokyo, Japan (On-Site)

• 4 Months ago

Senior DevOps Engineer

Funcom

Bucharest, Bucharest, Romania (Hybrid)

• 2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

667 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Senior SRE Software Engineer, Storage and Data

Job Summary

Job Description

15 skills required

15 skills required for this role

Job Details

Similar Jobs

Intern/Computer Vision Engineer

Senior C++ Software Engineer (Build System)

Principal Security Engineer

Senior System Software Engineer

Sr. AI Engineer

Solutions Architect

Senior Software Engineer (Java/Scala, Spark, Kubernetes, AWS)

Security Systems Engineer, Fleet Management

DevOps Manager

Tech Lead (SRE) - Cloud Infrastructure

Similar Skill Jobs

Senior DevOps (Azure) Engineer

Test Engineer

Intern, System Software Engineer - CXL

Principal Backend Programmer – Dead By Daylight | Programmeur·euse Backend Principal·e – Dead By Daylight

Support Engineer

Foundation Software Intern (Data-Technical Infrastructures-Foundational Technology-Foundation Software) - 2025 Summer (MS)

Senior Hardware Engineer

Android Automotive Developer

Solutions Engineer, SaaS Specialist

FPGA Firmware Engineer

Jobs in Taipei City, Taiwan

Security System Software Engineer (RDSS Intern)

Sales Account Manager, Japan

Hardware Engineering Intern, 2025

Garena - Backend Engineer

Director, Technical Program Management - Taiwan

Sea Group - Infrastructure Engineer (DC Site)

Program Manager, Supply and Material Planning, Global Infrastructure

Senior ASIC Verification Engineer, Coherent High Speed Interconnect

GTM Manager

Payroll Specialist

DevOps Jobs

Senior Site Reliability Engineer - Data Infrastructure (Seattle)

Lead Software Engineer

Lead or Senior Data Scientist

IT Systems Engineer - Cloud

Cloud Engineer

Build & Release Engineer

Sr. Software Engineer

DevOps Linux Administrator

Tencent Cloud Cloud Native Solution Architect (Kubernetes-focused), Japan

Senior DevOps Engineer

About The Company

Validation and Automation Student

Silicon Reliability Engineer

Silicon System Level Test Development Engineer

Senior Synthesis Flow CAD Engineer

Principal Engineer

Senior Design for Debug Architect and Methodology Engineer

Senior ASIC Physical Design Engineer, Netlisting

ASIC Verification Engineer

DFT Engineer

System Software Engineer, GPU Server Diagnostics

Level Up Your Career in Game Development!