Senior SRE Engineer (R3386)
Shield AI
What You'll Do:
- Design, implement, and maintain robust monitoring, logging, and alerting systems
- Define incident response procedures and participate in on-call rotations
- Identify and resolve reliability and performance issues across services
- Develop automation tools to streamline operations and reduce manual interventions
- Collaborate with engineering teams to ensure new services are production-ready
- Conduct root cause analyses and implement post-incident improvements
- Champion a culture of reliability, observability, and operational excellence
Required Qualifications:
- 5+ years of experience in Site Reliability Engineering, DevOps, or related roles
- Strong experience with AWS services (EC2, ECS/EKS, RDS, IAM, etc.)
- Deep understanding of Kubernetes and containerized deployments
- Proficiency with monitoring and observability tools (e.g. Prometheus, Grafana, Datadog, ELK)
- Strong scripting or programming skills (Python, Go, Bash, etc.)
- Experience with infrastructure-as-code (Terraform, CloudFormation, or similar)
- Solid understanding of networking, Linux systems, and distributed architectures
Preferred Qualifications:
- Experience with service meshes (e.g., Istio or Linkerd)
- Familiarity with security best practices in cloud environments
- Exposure to GitOps workflows and tools (e.g., ArgoCD or Flux)