Community

Home >

Jobs >

Staff Software Engineer, AI/ML Telemetry Debugging Tools

Washington, United States (On-site)

Staff Software Engineer, AI/ML Telemetry Debugging Tools

1 Month ago • 8-13 Years • Artificial Intelligence • $189,000 PA - $284,000 PA

Job Summary

Job Description

This Staff Software Engineer role focuses on building large-scale distributed systems for ML workload monitoring and diagnostics. You'll apply distributed systems principles and ML expertise to create systems providing insights into ML workload performance. Responsibilities include driving technical strategy for ML workload profiling and debugging, building consensus across multiple platforms, developing infrastructure for diagnosing model performance issues, empowering ML engineers with performance optimization tools, and designing systems with iterative milestones. The role requires expertise in distributed systems, ML, performance analysis, and large-scale system design. You'll work within Google Cloud's fast-paced environment, collaborating with various teams to enhance the performance and observability of AI/ML workloads on GCP.

Must have:

8+ years software development experience
5+ years ML experience
5+ years large-scale system experience
Experience with distributed systems
Expertise in performance analysis and debugging

Good to have:

Experience with IAAS in accelerators
Kubernetes/GKE experience
GPU programming
TensorFlow experience

Perks:

Bonus
Equity
Benefits

7 skills required

7 skills required for this role

Add these skills to join the top 1% applicants for this job

tensorflow

algorithms

kubernetes

data-structures

pytorch

google-cloud-platform

networking

Job Details

Minimum qualifications:

Bachelor's degree or equivalent practical experience.
8 years of experience in software development, and with data structures/algorithms.
5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
5 years of experience with performance, large-scale systems data analysis, visualization tools, or debugging.
5 years of experience in the Machine Learning field.
Experience with distributed systems.

Preferred qualifications:

Experience IAAS in accelerators.
Experience building infrastructure for models, diagnosis failures and tooling.
Experience in designing and implementing large-scale distributed systems.
Experience with one or more of the following competencies (e.g., Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, etc).

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

Cloud ML Compute Services team is accountable for defining and driving the overall Cloud ML Compute IaaS and IaaS product offering and technical strategy. We are leveraging Google AI leadership to differentiate GCP and delight our customers with the best ML and High performance computing (HPC) platform in the world for top talent powered by TPUs, GPUs and CPUs and all ML frameworks (Tensorflow, PyTorch and JAX).

In this role, you will be building large-scale distributed systems for ML workload monitoring and diagnostics, applying distributed systems principles and combine it with ML expertise to build systems that provide insights into performance degradation of ML workloads. You will be passionate about solving models convergence problems/building observability capabilities for AI/ML customers.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

The US base salary range for this full-time position is $189,000-$284,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. The range displayed on each job posting reflects the minimum and maximum target salaries for the position across all US locations. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about .

Responsibilities

Drive technical strategy and roadmap for large-scale ML workload for profiling at scale and debugging workload issues in real time.
Build consensus and alignment across multiple Product Area platforms, coreML, Google Compute Engine (GCE) and other ML teams to build a system that serves customer ML Operations.
Build infrastructure and tooling to diagnose model performance issues, remediation steps and observability for internal and external customers to monitor the workload running on Google Cloud Platform (GCP).
Partner and empower ML engineers, data scientist and ML frameworks team to optimize the performance of the model on GCP through a set of tooling and capabilities needed for ideation.
Design and develop system with incremental milestone for iteration for newer models launched in the market.

Similar Jobs

Senior Audio AI Researcher

Dolby Laboratories

Bengaluru, Karnataka, India (Hybrid)

• 4 Months ago

Scientifique de données sénior | Senior Data Scientist

Unity

Montreal, Quebec, Canada (On-Site)

• 2 Months ago

Software Engineer III, Machine Learning (Generative AI), YouTube

Google

Mountain View, California, United States (On-Site)

• 1 Month ago

Machine Learning Software Engineer L4/L5, Algorithms

Netflix

United States (Remote)

• 3 Months ago

Lead Machine Learning Engineer, Ad Platforms

The Walt Disney Company

San Francisco, California, United States (On-Site)

• 2 Months ago

Research Scientist Intern, Synthetic Image Generation (PhD)

Meta

Seattle, Washington, United States (On-Site)

• 2 Months ago

Software Engineer III, Generative AI, Google Cloud AI

Google

Mountain View, California, United States (On-Site)

• 3 Months ago

Principal Software Engineer, AI/ML Platform

Autodesk

Toronto, Ontario, Canada (On-Site)

• 4 Months ago

Tech Lead, Machine Learning

Ello

Canada (On-Site)

• 2 Months ago

Research Scientist in Foundation Model (Speech & Audio Generation) - 2025 Start (PhD）

ByteDance

Seattle, Washington, United States (On-Site)

• 3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Senior MLOps Engineer

Homa games

Paris, Île-de-France, France (On-Site)

• 2 Months ago

Software Engineer III, Computer Vision, Pixel Camera

Google

(On-Site)

• 2 Months ago

Machine Learning Engineer

Palo Alto, California, United States (On-Site)

• 4 Months ago

AI Product Manager

Seedify

London, England, United Kingdom (Remote)

• 1 Month ago

Senior Machine Learning Data Scientist - Product Security

Bungie

(Hybrid)

• 1 Month ago

Software Engineer III, Machine Learning, Google Cloud Compute Infrastructure

Google

(On-Site)

• 2 Months ago

Internship -AI Agents

Vigaet

(Remote)

• 3 Months ago

Associate Customer Engineer, GenAI, Google Cloud

Google

(On-Site)

• 1 Month ago

Research Scientist 5 - Content and Studio

Netflix

Los Gatos, California, United States (On-Site)

• 3 Months ago

Software Engineer, Machine Learning

Meta

Menlo Park, California, United States (On-Site)

• 2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Seattle, Washington, United States

UI Developer

Next Level Business Services

Dallas, Texas, United States (On-Site)

• 3 Months ago

Research Scientist, Computer Vision for Generative AI (PhD)

Meta

Seattle, Washington, United States (On-Site)

• 2 Months ago

Intelligence Data Integrator - SOCOM

Barbaricum

Fayetteville, North Carolina, United States (On-Site)

• 3 Months ago

Sr Manager, Technical Program Management

The Walt Disney Company

Santa Monica, California, United States (On-Site)

• 2 Months ago

Software Engineer, Product

Meta

Sunnyvale, California, United States (On-Site)

• 2 Months ago

Product Manager

Meta

San Francisco, California, United States (Remote)

• 2 Months ago

Research Scientist Graduate (Quantum Chemistry and Machine Learning - AI Lab) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)

• 3 Months ago

Partner Sales Lead, Embedded

Workato

Chicago, Illinois, United States (On-Site)

• 3 Months ago

Software Engineer in Large Model System Graduate (Machine Learning Sys-US) - 2024 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)

• 3 Months ago

Regional Product Lead, Advanced Measurement, Agency, PSA Americas

Google

New York, New York, United States (On-Site)

• 1 Month ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Data Engineer, AI Startup (Remote) - $95000/year USD

Pesto Tech

India (Remote)

• 6 Months ago

Software Engineer III, AI/ML GenAI, Google Cloud

Google

Sunnyvale, California, United States (On-Site)

• 1 Month ago

Technical Program Manager - Artificial Intelligence

Zoox

Foster City, California, United States (Hybrid)

• 3 Months ago

Machine Learning Engineer, Large Language Models, Personal AI

Google

(On-Site)

• 1 Month ago

Senior Software Engineer, GenAI, Google Distributed Cloud AI

Google

(On-Site)

• 1 Month ago

Technical Investigator / Data Scientist - AI Safety

Eleven Labs

Berlin, Berlin, Germany (Remote)

• 4 Months ago

Product Manager, AI/ML, Google Cloud

Google

(On-Site)

• 1 Month ago

Senior/Staff Software Engineer - Prediction & Behavior ML

Zoox

Foster City, California, United States (Hybrid)

• 3 Months ago

Backend Engineer - Customer Engineering

Level AI

Noida, Uttar Pradesh, India (Hybrid)

• 3 Months ago

Principal Data Engineer

Unity

San Francisco, California, United States (On-Site)

• 7 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Google

698 Active Jobs

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Staff Software Engineer, AI/ML Telemetry Debugging Tools

Job Summary

Job Description

7 skills required

7 skills required for this role

Job Details

Minimum qualifications:

Preferred qualifications:

About the job

Responsibilities

Similar Jobs

Senior Audio AI Researcher

Scientifique de données sénior | Senior Data Scientist

Software Engineer III, Machine Learning (Generative AI), YouTube

Machine Learning Software Engineer L4/L5, Algorithms

Lead Machine Learning Engineer, Ad Platforms

Research Scientist Intern, Synthetic Image Generation (PhD)

Software Engineer III, Generative AI, Google Cloud AI

Principal Software Engineer, AI/ML Platform

Tech Lead, Machine Learning

Research Scientist in Foundation Model (Speech & Audio Generation) - 2025 Start (PhD）

Similar Skill Jobs

Senior MLOps Engineer

Software Engineer III, Computer Vision, Pixel Camera

Machine Learning Engineer

AI Product Manager

Senior Machine Learning Data Scientist - Product Security

Software Engineer III, Machine Learning, Google Cloud Compute Infrastructure

Internship -AI Agents

Associate Customer Engineer, GenAI, Google Cloud

Research Scientist 5 - Content and Studio

Software Engineer, Machine Learning

Jobs in Seattle, Washington, United States

UI Developer

Research Scientist, Computer Vision for Generative AI (PhD)

Intelligence Data Integrator - SOCOM

Sr Manager, Technical Program Management

Software Engineer, Product

Product Manager

Research Scientist Graduate (Quantum Chemistry and Machine Learning - AI Lab) - 2025 Start (PhD)

Partner Sales Lead, Embedded

Software Engineer in Large Model System Graduate (Machine Learning Sys-US) - 2024 Start (BS/MS)

Regional Product Lead, Advanced Measurement, Agency, PSA Americas

Artificial Intelligence Jobs

Data Engineer, AI Startup (Remote) - $95000/year USD

Software Engineer III, AI/ML GenAI, Google Cloud

Technical Program Manager - Artificial Intelligence

Machine Learning Engineer, Large Language Models, Personal AI

Senior Software Engineer, GenAI, Google Distributed Cloud AI

Technical Investigator / Data Scientist - AI Safety

Product Manager, AI/ML, Google Cloud

Senior/Staff Software Engineer - Prediction & Behavior ML

Backend Engineer - Customer Engineering

Principal Data Engineer

About The Company

Senior Software Engineering Manager, Infrastructure, Google Cloud Data Management

Web Solutions Engineer, University Graduate, 2025

Software Engineer, Computational Videography and ML, Pixel Camera

Software Engineering Manager II, Google Cloud

Customer Growth Associate, Google Customer Solutions, Early Careers (English, German)

Technical Program Manager, Transport Infrastructure and Capacity Team

Trust and Safety App Review Analyst, Play, Android

Field Sales Representative, Google Cloud

Account Strategist, Google Customers Solutions (English, Korean)

Program Manager III, Retail Design and Development, Consumer Hardware

Level Up Your Career in Game Development!