返回查詢:Site Reliability / 台北市

Role Overview

We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.

Responsibilities

  • Design, implement and maintain scalable AI/ML infrastructure solutions.
  • Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
  • Automate deployment, configuration and management of infrastructure resources.
  • Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
  • Implement CI/CD pipelines for infrastructure deployment and orchestration.
  • Ensure security, compliance and best practices across infrastructure.
  • Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
  • Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
  • Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
  • Regional/international travel to GMI data center locations.

Qualifications

  • Bachelor's degree in Computer Science or related field.
  • Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
  • Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
  • Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
  • Familiarity with Linux system administration and scripting (Python, Bash).
  • Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
  • Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
  • Strong troubleshooting skills and ability to analyze system logs and performance metrics.
  • Excellent communication and teamwork abilities.
  • Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.