Avatar for Alertmend.io
AI-Powered SRE and Platform Troubleshooting Companion

Senior Site Reliability Engineer (SRE) - Kubernetes Focus

  • ₹15L – ₹40L • 0.1% – 0.5%
  • Remote • 
  • 4 years of exp
  • Full Time
Posted: 1 month ago
Visa Sponsorship

Not Available

Remote Work Policy

Onsite or remote

Hires remotely in
RelocationAllowed
Skills
Automation
AWS Cloud Services
Microsoft Azure
Kubernetes
GCP
Hiring contact

Alert Mend

About the job

About AlertMend.io
At AlertMend.io, we’re on a mission to simplify and automate incident management in Kubernetes environments. Our SaaS platform empowers teams to define, diagnose, and remediate infrastructure issues with streamlined workflows and AI-driven insights, ensuring uptime and operational efficiency. Join us in building cutting-edge solutions that are transforming how businesses manage their cloud infrastructure!

What You’ll Do
As a Senior SRE, you will be responsible for designing, building, and maintaining systems that power AlertMend.io’s platform. You will collaborate with our development team to ensure high reliability, scalability, and automation for Kubernetes-based environments. This role involves working closely with our customers to solve complex infrastructure issues, optimize workflows, and improve the platform’s overall performance.

Responsibilities

  • Lead the development and execution of reliability strategies for our Kubernetes-based infrastructure.
  • Automate remediation workflows and enhance our platform's integration with tools like Prometheus, Grafana, Alertmanager, and Slack/MS Teams.
  • Collaborate with engineering teams to implement best practices for Kubernetes, cloud-native technologies, and GitOps.
  • Develop and maintain infrastructure-as-code scripts for deployments on AWS, GCP, Azure, or self-hosted Kubernetes environments.
  • Provide deep-rooted troubleshooting for complex platform issues, including persistent volume claims (PVC), pod status checks (Pending, ImagePullBackOff), and root cause analysis.
  • Help define SLAs, SLOs, and improve incident response processes.
  • Mentor junior engineers and share knowledge across teams.

Requirements

  • 5+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, preferably in Kubernetes-focused environments.
  • Strong expertise in Kubernetes (EKS, AKS, GKE, or self-hosted), container orchestration, and cloud-native technologies.
  • Experience with monitoring and observability tools like Prometheus, Grafana, and integration with alerting systems such as Alertmanager.
  • Proficiency with infrastructure automation (Terraform, Helm, GitOps).
  • Solid scripting and coding skills (Bash).
  • Hands-on experience with cloud platforms such as AWS, GCP, or Azure.
  • Strong understanding of networking, security, and Kubernetes operational patterns.
  • Excellent communication skills and the ability to work collaboratively in a remote team environment.

Nice to Have

  • Experience with AI/ML applications in incident management and automation.
  • Familiarity with tools like PagerDuty, Jira, and ServiceNow for incident management and notifications.
  • Experience in optimizing workflows for platform operations teams and SREs.

What We Offer

  • Competitive salary and equity package.
  • Fully remote work with flexible hours.
  • A collaborative and dynamic team focused on building the future of cloud operations.
  • Opportunities for professional growth and career advancement.
  • Supportive work environment that values innovation and problem-solving.

About the company

Alertmend.io company logo
AI-Powered SRE and Platform Troubleshooting Companion1-10 Employees
Learn more about Alertmend.io image

Founders

Alert Mend
Founder • 3 years
India
image
View the team image

Similar Jobs

SciSpace  company logo
SciSpace
AI Assistant for Research using state of the art language models (ChatGPT for Research)
LogiNext company logo
LogiNext
SaaS for Delivery and Transportation Business
FORMCEPT company logo
FORMCEPT
#1 Augmented Data Management Company Trusted by Fortune 1000 Brands Globally
Marvin company logo
Marvin
The best user research platform for designers, product teams and consultants
Cloud Scale®  company logo
Cloud Scale®
Transforming Cloud, Data Center Management & profitability with Integrated Data-Insights
eLitmus.com company logo
eLitmus.com
Accurate skill matching using Data Analytics, Research & Technology
StackBOX company logo
StackBOX
At StackBOX, we are helping our clients win at the last mile
| Networth Corp | company logo
| Networth Corp |
Fast-tracking of global problem solving and value generation from innovation
Digit88 company logo
Digit88
Empowering digital transformation as a trusted software product engineering partner!