We are looking for a Head of SRE to lead the design and management of a distributed infrastructure project. This role involves building a system from scratch, overseeing deployment, maintenance, and uptime, while growing a global SRE team. The candidate will focus on scaling infrastructure, automation, and performance optimization across regions like APAC and LATAM, fostering a culture of improvement and excellence.
Responsibilities:
- Lead infrastructure design, ensuring high availability and scalability
- Build and mentor a global SRE team with 24/7 support
- Develop SLAs for uptime and performance, focusing on automation
- Implement strategies for monitoring, incident response, and rapid recovery
- Collaborate with engineering teams on scalable architecture and processes
- Oversee security best practices and compliance
- Manage tools for infrastructure automation and incident management
- Ensure cost-effective vendor management and comprehensive documentation
Qualifications:
- 10+ years in SRE or infrastructure engineering, 5+ in leadership
- Experience managing large-scale cloud systems (AWS, GCP, Azure)
- Strong skills in automation (Terraform, Ansible) and scripting (Python, Bash)
- Expertise in Docker, Kubernetes, and network infrastructure
- Proven ability to meet SLAs and manage global teams
- Strong knowledge of CI/CD pipelines, incident management, and security practices
- Leadership, communication, and project management skills
Bonus Skills:
- Experience with decentralized or distributed systems
- Familiarity with observability tools (OpenTelemetry, Jaeger)
- Multi-cloud and hybrid cloud knowledge
- AWS, GCP, or Azure certifications
- Understanding of security frameworks (SOC 2, ISO 27001) and agile environments