Principal Software Engineer-distributed training system

2 Weeks ago • 6 Years + • Research & Development • Undisclosed

About the job

Job Description

The Principal Software Engineer will design and implement a distributed training system for trillion-parameter machine learning models. Responsibilities include optimizing training and inference on GPUs, implementing streaming training and publishing of these models, analyzing metrics to identify improvement opportunities, and developing scalable solutions. Collaboration with cross-functional teams is essential. The role requires expertise in high-performance C++, CUDA, Python, or C#, experience with machine learning and TensorFlow/PyTorch distributed training, and strong problem-solving and communication skills. The team works on various aspects of online advertising, impacting millions of users and advertisers.
Must have:
  • 6+ years software engineering experience
  • High-performance C++, CUDA, Python, or C# coding
  • Machine learning & TensorFlow/PyTorch experience
  • Distributed training system design & implementation
  • GPU utilization and optimization
  • Strong problem-solving and debugging skills
Good to have:
  • Ads, search, or content service domain knowledge
Perks:
  • Industry-leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Networking opportunities

Overview

MAI Ads team in Microsoft APRD is responsible for providing the advertising industry with the state-of-the-art online advertising platform and service. Our team is at the core of this effort, working on the following research & development: Selection(recall), Relevance, User Response Prediction (Click Prediction and Conversion prediction), Autobidding, Large Language Model and Large Scale Machine Learning & Serving System. The team is a world-class R&D team of passionate and talented scientists and engineers who aspire to solve challenging problems and turn innovative ideas into high-quality products and services that can help hundreds of millions of users and advertisers, and directly impact our business.

Qualifications

• Bachelor, Master, PhD degree in CS/EE or related areas is required.
• 6+ years of industry experiences in software engineering.
• Solid experience of shipping high performance C++, CUDA, python, C#, or equivalent language code.
• Experience with machine learning and TensorFlow/PyTorch distributed training is preferred.
• Domain knowledge of ads, search or content services is a plus.
• Quick learning and solid problem solving and debugging skills.
• Good communication skill, fluent in English (both oral and written).


Responsibilities

• Design and implement distributed training system for trillion parameter machine learning models.
• Drive our team efforts around utilization and optimization of training and inference on GPUs.
• Design and implement streaming training and publish of trillion parameter machine learning models.
• Analyze metrics and identify opportunities based on offline and online testing, develop and deliver robust and scalable solutions.
• Collaborate with cross-functional teams to deliver high-quality solutions.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect
View Full Job Description

Add your resume

80%

Upload your resume, increase your shortlisting chances by 80%

About The Company

Microsoft is a tech giant that develops, licenses, and supports a range of software products, services, and devices.

Vancouver, British Columbia, Canada (On-Site)

Redmond, Washington, United States (On-Site)

New York, New York, United States (On-Site)

Bengaluru, Karnataka, India (On-Site)

Dublin, County Dublin, Ireland (Hybrid)

View All Jobs

Get notified when new jobs are added by Microsoft

Similar Jobs

Trustana - Senior Data Engineer

Trustana, India (Hybrid)

Meta - Software Engineer, Machine Learning

Meta, United States (On-Site)

ByteDance - AI Security Researcher - Security - San Jose

ByteDance, United States (On-Site)

Microsoft - Senior Applied Scientist

Microsoft, United States (Hybrid)

Inworld AI - Staff / Principal AI Researcher - USA

Inworld AI, United States (Remote)

Nielsen Holdings - Software Engineering Manager - Windows\C++\.Net

Nielsen Holdings, India (Hybrid)

Zuru - Sr. Python Developer

Zuru, India (On-Site)

Tencent - NLP Research Intern 104493

Tencent, United Kingdom (On-Site)

Riot Games - Principal Technical Producer - League Studio

Riot Games, United States (On-Site)

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Dolby Laboratories - Sr. Generative Computer Vision Research

Dolby Laboratories, India (Hybrid)

Frost & Sullivan - AI Engineer

Frost & Sullivan, India (On-Site)

Inworld AI - Staff / Principal AI Researcher - USA

Inworld AI, United States (Remote)

The Walt Disney Company - Software Engineering Manager, Machine Learning - Ad Platforms

The Walt Disney Company, United States (On-Site)

ByteDance - Research Engineer in Large Model System

ByteDance, United States (On-Site)

Intel Corporation - AI Frameworks Architect

Intel Corporation, India (Hybrid)

ByteDance - Software Engineer in Machine Learning Systems

ByteDance, United States (On-Site)

Get notifed when new similar jobs are uploaded

Jobs in Beijing, Beijing, China

Intel Corporation - Senior Project Manager - CSA Lead

Intel Corporation, China (On-Site)

Riot Games - Senior Visual Design Artist

Riot Games, China (On-Site)

Maersk Careers - Customer Solution Manager

Maersk Careers, China (On-Site)

Intel Corporation - Senior NAND Product Development Technologist

Intel Corporation, China (On-Site)

Buckman - Sourcing and Procurement Director

Buckman, China (On-Site)

undefined - Scenario mode FO

Beijing, Beijing, China (On-Site)

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Get notifed when new similar jobs are uploaded