Site Reliability Engineer (SRE)

Full time

Negotiable

5BT2 Me Tri Ha, Nam Tu Liem, Hanoi

Apply

Working time

Reliability & Operations

Operate and optimize AWS workloads across EC2, ECS/Fargate, and EKS, ensuring predictable latency and throughput under production load.
Architect highly available, self-healing systems leveraging AWS features such as Auto Scaling Groups, Multi-AZ RDS, ALB/NLB failover, and S3 replication.
Define, measure, and enforce SLIs/SLOs/SLAs for availability, latency (p95/p99), error rates, and saturation.
Lead capacity planning, load testing, and performance benchmarking to prevent bottlenecks and optimize scaling.
Validate resiliency through chaos testing, disaster recovery drills, and automated failover across multi-AZ and cross-region environments.

Observability & Incident Management

Build and maintain observability platforms using the ELK stack (Elasticsearch, Logstash, Kibana) for centralized logging and real-time analytics.
Integrate Prometheus, Grafana, and CloudWatch for system metrics and application telemetry, aligned with error budgets.
Develop automated alerting pipelines with CloudWatch Alarms and anomaly detection to reduce MTTR.
Lead incident response, conduct blameless RCAs, and define preventive measures to continuously improve system reliability.

Performance & Cost Efficiency

Continuously optimize system performance by analyzing resource utilization, tuning AWS services (RDS, ElastiCache, EKS), and benchmarking workloads.
Apply FinOps practices to ensure cost-efficient reliability: rightsizing, autoscaling policies, S3 lifecycle management, Graviton adoption, and Savings Plans.
Balance reliability vs. cost trade-offs using error budgets and performance-per-dollar metrics.

Security & Compliance

Enforce least-privilege IAM, VPC isolation, GuardDuty, Security Hub, and automated patching to secure production systems.
Integrate security monitoring and compliance checks into reliability workflows.

Standardization & Knowledge Sharing

Maintain runbooks, architecture diagrams, SLO/SLI definitions, and incident response playbooks for consistent operations.
Provide standardized deployment templates (Terraform modules, Helm charts) to accelerate safe, reliable releases.
Foster an SRE culture by embedding reliability reviews, chaos engineering, and error budget discussions into team processes.

Bachelor’s or College degree in Information Technology, Mathematics – Informatics, Electronics & Telecommunications, or equivalent.
Minimum of 5 years of experience in SRE.
Proven experience in deploying and operating infrastructure on AWS (EC2, S3, RDS, IAM, VPC, etc.).
Proficient in Linux and system administration; capable of writing Bash scripts and basic coding (Java, Python, .NET, Go…).
Hands-on experience with CI/CD & IaC tools: Jenkins, GitLab CI, ArgoCD, Ansible, Terraform; artifact/repository management (Nexus, JFrog, Docker Registry), and Vault Secret.
Skilled in deploying and operating applications on VMs, Docker, Kubernetes; good understanding of microservices, monolithic architectures, and GitOps.
Preferably experienced with service mesh, load balancers (HAProxy, Nginx, Kong), cache, and queue systems.
Ability to monitor, analyze, and optimize system performance.
Able to read and understand technical documents in English; strong communication, teamwork, and cross-functional collaboration skills.
Systematic mindset, proactive in improvements, and high sense of responsibility.

Annual salary review
13th-month salary bonus, National Day bonus, New Year bonus, etc.
Annual health checkups
Comprehensive employee care: birthdays, weddings, childbirth, illnesses, etc.
Social insurance and leave policies in accordance with company and legal regulations
Internal Offer: The company has its own café: 50% discount on the beverage menu for staff members
Internal activities: Year-End Party, Team Building, Parties, etc.

Opportunities to learn and join internal training programs, work with the Management Board and experienced professionals from large corporations such as Vingroup, OneMount, Viettel, MoMo, VNPay, FPT, Tiki, etc.
Quarterly awards for outstanding initiatives and achievements