Working time
- Working days: Monday to Friday
- Working hours: 08:00 AM – 06:00 PM
- Lunch break: 12:00 PM – 01:15 PM
Job Description
Reliability & Operations
- Operate and optimize AWS workloads across EC2, ECS/Fargate, and EKS, ensuring predictable latency and throughput under production load.
- Architect highly available, self-healing systems leveraging AWS features such as Auto Scaling Groups, Multi-AZ RDS, ALB/NLB failover, and S3 replication.
- Define, measure, and enforce SLIs/SLOs/SLAs for availability, latency (p95/p99), error rates, and saturation.
- Lead capacity planning, load testing, and performance benchmarking to prevent bottlenecks and optimize scaling.
- Validate resiliency through chaos testing, disaster recovery drills, and automated failover across multi-AZ and cross-region environments.
Observability & Incident Management
- Build and maintain observability platforms using the ELK stack (Elasticsearch, Logstash, Kibana) for centralized logging and real-time analytics.
- Integrate Prometheus, Grafana, and CloudWatch for system metrics and application telemetry, aligned with error budgets.
- Develop automated alerting pipelines with CloudWatch Alarms and anomaly detection to reduce MTTR.
- Lead incident response, conduct blameless RCAs, and define preventive measures to continuously improve system reliability.
Performance & Cost Efficiency
- Continuously optimize system performance by analyzing resource utilization, tuning AWS services (RDS, ElastiCache, EKS), and benchmarking workloads.
- Apply FinOps practices to ensure cost-efficient reliability: rightsizing, autoscaling policies, S3 lifecycle management, Graviton adoption, and Savings Plans.
- Balance reliability vs. cost trade-offs using error budgets and performance-per-dollar metrics.
Security & Compliance
- Enforce least-privilege IAM, VPC isolation, GuardDuty, Security Hub, and automated patching to secure production systems.
- Integrate security monitoring and compliance checks into reliability workflows.
Standardization & Knowledge Sharing
- Maintain runbooks, architecture diagrams, SLO/SLI definitions, and incident response playbooks for consistent operations.
- Provide standardized deployment templates (Terraform modules, Helm charts) to accelerate safe, reliable releases.
- Foster an SRE culture by embedding reliability reviews, chaos engineering, and error budget discussions into team processes.
Experience & Skills
- Bachelor’s or College degree in Information Technology, Mathematics – Informatics, Electronics & Telecommunications, or equivalent.
- Minimum of 5 years of experience in SRE.
- Proven experience in deploying and operating infrastructure on AWS (EC2, S3, RDS, IAM, VPC, etc.).
- Proficient in Linux and system administration; capable of writing Bash scripts and basic coding (Java, Python, .NET, Go…).
- Hands-on experience with CI/CD & IaC tools: Jenkins, GitLab CI, ArgoCD, Ansible, Terraform; artifact/repository management (Nexus, JFrog, Docker Registry), and Vault Secret.
- Skilled in deploying and operating applications on VMs, Docker, Kubernetes; good understanding of microservices, monolithic architectures, and GitOps.
- Preferably experienced with service mesh, load balancers (HAProxy, Nginx, Kong), cache, and queue systems.
- Ability to monitor, analyze, and optimize system performance.
- Able to read and understand technical documents in English; strong communication, teamwork, and cross-functional collaboration skills.
- Systematic mindset, proactive in improvements, and high sense of responsibility.
BENEFITS & PERKS
- Annual salary review
- 13th-month salary bonus, National Day bonus, New Year bonus, etc.
- Annual health checkups
- Comprehensive employee care: birthdays, weddings, childbirth, illnesses, etc.
- Social insurance and leave policies in accordance with company and legal regulations
- Internal Offer: The company has its own café: 50% discount on the beverage menu for staff members
- Internal activities: Year-End Party, Team Building, Parties, etc.
TRAINING & REWARDS
- Opportunities to learn and join internal training programs, work with the Management Board and experienced professionals from large corporations such as Vingroup, OneMount, Viettel, MoMo, VNPay, FPT, Tiki, etc.
- Quarterly awards for outstanding initiatives and achievements
Contact
- Name: Pham Thi Nga (Ms)_ Human Resource Executive
- Department: People Operations
- Email: [email protected]
- Website: https://cardoctor.com.vn/

