Site Reliability Engineer (SRE)

Thời gian

Full time

Mức lương

Negotiable

Địa điểm

5BT2 Me Tri Ha, Nam Tu Liem, Hanoi

Working time

  • Working days: Monday to Friday
  • Working hours: 08:00 AM – 06:00 PM
  • Lunch break: 12:00 PM – 01:15 PM

Job Description

Reliability & Operations 

  • Operate and optimize AWS workloads across EC2, ECS/Fargate, and EKS, ensuring predictable latency and throughput under production load. 
  • Architect highly available, self-healing systems leveraging AWS features such as Auto Scaling Groups, Multi-AZ RDS, ALB/NLB failover, and S3 replication. 
  • Define, measure, and enforce SLIs/SLOs/SLAs for availability, latency (p95/p99), error rates, and saturation. 
  • Lead capacity planning, load testing, and performance benchmarking to prevent bottlenecks and optimize scaling. 
  • Validate resiliency through chaos testing, disaster recovery drills, and automated failover across multi-AZ and cross-region environments. 

Observability & Incident Management 

  • Build and maintain observability platforms using the ELK stack (Elasticsearch, Logstash, Kibana) for centralized logging and real-time analytics. 
  • Integrate Prometheus, Grafana, and CloudWatch for system metrics and application telemetry, aligned with error budgets. 
  • Develop automated alerting pipelines with CloudWatch Alarms and anomaly detection to reduce MTTR. 
  • Lead incident response, conduct blameless RCAs, and define preventive measures to continuously improve system reliability. 

Performance & Cost Efficiency 

  • Continuously optimize system performance by analyzing resource utilization, tuning AWS services (RDS, ElastiCache, EKS), and benchmarking workloads. 
  • Apply FinOps practices to ensure cost-efficient reliability: rightsizing, autoscaling policies, S3 lifecycle management, Graviton adoption, and Savings Plans. 
  • Balance reliability vs. cost trade-offs using error budgets and performance-per-dollar metrics. 

Security & Compliance 

  • Enforce least-privilege IAM, VPC isolation, GuardDuty, Security Hub, and automated patching to secure production systems. 
  • Integrate security monitoring and compliance checks into reliability workflows. 

Standardization & Knowledge Sharing 

  • Maintain runbooks, architecture diagrams, SLO/SLI definitions, and incident response playbooks for consistent operations. 
  • Provide standardized deployment templates (Terraform modules, Helm charts) to accelerate safe, reliable releases. 
  • Foster an SRE culture by embedding reliability reviews, chaos engineering, and error budget discussions into team processes. 

Experience & Skills

  • Bachelor’s or College degree in Information Technology, Mathematics – Informatics, Electronics & Telecommunications, or equivalent. 
  • Minimum of 5 years of experience in SRE. 
  • Proven experience in deploying and operating infrastructure on AWS (EC2, S3, RDS, IAM, VPC, etc.). 
  • Proficient in Linux and system administration; capable of writing Bash scripts and basic coding (Java, Python, .NET, Go…). 
  • Hands-on experience with CI/CD & IaC tools: Jenkins, GitLab CI, ArgoCD, Ansible, Terraform; artifact/repository management (Nexus, JFrog, Docker Registry), and Vault Secret. 
  • Skilled in deploying and operating applications on VMs, Docker, Kubernetes; good understanding of microservices, monolithic architectures, and GitOps. 
  • Preferably experienced with service mesh, load balancers (HAProxy, Nginx, Kong), cache, and queue systems. 
  • Ability to monitor, analyze, and optimize system performance. 
  • Able to read and understand technical documents in English; strong communication, teamwork, and cross-functional collaboration skills. 
  • Systematic mindset, proactive in improvements, and high sense of responsibility. 

BENEFITS & PERKS

  • Annual salary review 
  • 13th-month salary bonus, National Day bonus, New Year bonus, etc. 
  • Annual health checkups 
  • Comprehensive employee care: birthdays, weddings, childbirth, illnesses, etc. 
  • Social insurance and leave policies in accordance with company and legal regulations 
  • Internal Offer: The company has its own café: 50% discount on the beverage menu for staff members
  • Internal activities: Year-End Party, Team Building, Parties, etc. 

TRAINING & REWARDS

  • Opportunities to learn and join internal training programs, work with the Management Board and experienced professionals from large corporations such as Vingroup, OneMount, Viettel, MoMo, VNPay, FPT, Tiki, etc. 
  • Quarterly awards for outstanding initiatives and achievements 

Contact

Please enter your full name
Please enter your phone number
Please enter email
Please make your selection
Drag & Drop Files, Choose Files to Upload