Role Overview
We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.
Responsibilities
- Design, implement and maintain scalable AI/ML infrastructure solutions.
- Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
- Automate deployment, configuration and management of infrastructure resources.
- Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
- Implement CI/CD pipelines for infrastructure deployment and orchestration.
- Ensure security, compliance and best practices across infrastructure.
- Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
- Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
- Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
- Regional/international travel to GMI data center locations.
Qualifications
- Bachelor's degree in Computer Science or related field.
- Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
- Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
- Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
- Familiarity with Linux system administration and scripting (Python, Bash).
- Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
- Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
- Strong troubleshooting skills and ability to analyze system logs and performance metrics.
- Excellent communication and teamwork abilities.
- Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.