Senior Site Reliability Engineer (SRE) - Kubernetes Focus
- ₹15L – ₹40L • 0.1% – 0.5%
- Remote •
- 4 years of exp
- Full Time
Not Available
Onsite or remote
Alert Mend
About the job
About AlertMend.io
At AlertMend.io, we’re on a mission to simplify and automate incident management in Kubernetes environments. Our SaaS platform empowers teams to define, diagnose, and remediate infrastructure issues with streamlined workflows and AI-driven insights, ensuring uptime and operational efficiency. Join us in building cutting-edge solutions that are transforming how businesses manage their cloud infrastructure!
What You’ll Do
As a Senior SRE, you will be responsible for designing, building, and maintaining systems that power AlertMend.io’s platform. You will collaborate with our development team to ensure high reliability, scalability, and automation for Kubernetes-based environments. This role involves working closely with our customers to solve complex infrastructure issues, optimize workflows, and improve the platform’s overall performance.
Responsibilities
- Lead the development and execution of reliability strategies for our Kubernetes-based infrastructure.
- Automate remediation workflows and enhance our platform's integration with tools like Prometheus, Grafana, Alertmanager, and Slack/MS Teams.
- Collaborate with engineering teams to implement best practices for Kubernetes, cloud-native technologies, and GitOps.
- Develop and maintain infrastructure-as-code scripts for deployments on AWS, GCP, Azure, or self-hosted Kubernetes environments.
- Provide deep-rooted troubleshooting for complex platform issues, including persistent volume claims (PVC), pod status checks (Pending, ImagePullBackOff), and root cause analysis.
- Help define SLAs, SLOs, and improve incident response processes.
- Mentor junior engineers and share knowledge across teams.
Requirements
- 5+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, preferably in Kubernetes-focused environments.
- Strong expertise in Kubernetes (EKS, AKS, GKE, or self-hosted), container orchestration, and cloud-native technologies.
- Experience with monitoring and observability tools like Prometheus, Grafana, and integration with alerting systems such as Alertmanager.
- Proficiency with infrastructure automation (Terraform, Helm, GitOps).
- Solid scripting and coding skills (Bash).
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure.
- Strong understanding of networking, security, and Kubernetes operational patterns.
- Excellent communication skills and the ability to work collaboratively in a remote team environment.
Nice to Have
- Experience with AI/ML applications in incident management and automation.
- Familiarity with tools like PagerDuty, Jira, and ServiceNow for incident management and notifications.
- Experience in optimizing workflows for platform operations teams and SREs.
What We Offer
- Competitive salary and equity package.
- Fully remote work with flexible hours.
- A collaborative and dynamic team focused on building the future of cloud operations.
- Opportunities for professional growth and career advancement.
- Supportive work environment that values innovation and problem-solving.