About the job
SummaryBy Outscal
Bungie seeks a Data Reliability Engineer to design, deploy, and maintain highly available data infrastructure, including Kafka, RabbitMQ, Redis, Elasticsearch, and Graphite. You'll troubleshoot issues, ensure data security, and collaborate with engineering teams on projects and services. Must have experience with Linux, infrastructure automation, and distributed production environments.
Data Reliability Engineering at Bungie is a core team of the Central Tech area that keeps our games and tooling running at scale. Our team owns the overall scalability, observability and resilience of the databases, data processing platforms and in-memory key-value stores used throughout the Bungie ecosystem. We partner with our engineering teams and business units on projects, services, designs, and processes We are the stewards of architecture and provide tools and services to enable engineering teams to meet their design requirements.
RESPONSIBILITIES
- Design, deploy, and maintain highly available and scalable data infrastructure components including Kafka, RabbitMQ, Redis, Elasticsearch, and Graphite
- Perform capacity planning and scalability assessments for data platforms
- Troubleshoot and resolve issues related to data processing pipelines, message queuing, and performance including participation in on-call rotation
- Ensure data security, integrity, and compliance with industry best practices and regulatory requirements
- Document system configurations, procedures, and operational knowledge
- Advise service owners on industry and company standards and best practices
- Maintain reliability and performance levels for core data platform infrastructure
- Data observability strategy and implementation
- Data ownership strategy and documentation
REQUIRED SKILLS
- Strong understanding of Linux operating systems and their administration
- Effective communication skills and ability to collaborate effectively in a team environment
- Experience with infrastructure automation and configuration management (e.g., Ansible, Terraform…)
- Excellent troubleshooting skills and the ability to analyze and resolve complex infrastructure resource and application deployment issues
- Experience working in a distributed production environment
- Deep understanding of cluster management areas, such as scaling, consistency tuning, replication, and multi-datacenter configuration
- Familiarity with time-series monitoring systems & tools (e.g., Datadog, Prometheus, Grafana and ELK)
- Experience designing and implementing logging and metric pipelines