Site Reliability Engineer (SRE) Lead
- $220k – $300k
- Full Time
Not Available
Liz Chmiel
About the job
About us.
Trumid is a dynamic fintech revolutionizing the landscape of fixed income trading. With intelligent, easy-to-use, electronic solutions, we are rapidly growing and seeking exceptional talent to help redefine the boundaries of technology and finance.
Founded in 2014 by a team of fixed income market experts, Trumid has quickly become one of the top three corporate bond e-trading platforms in the U.S. Today, over 1,300 traders from an extensive and expanding client network of 890+ buy-and sell-side institutions transact on Trumid monthly.
With a rich history of innovation and a unique ability to innovate at scale, we collaborate closely with our clients, iterating quickly toward optimal solutions. With market share and client engagement at all-time highs and our pace of product development faster than ever, this is an exciting and transformative time at Trumid.
Our business model thrives on participation, and so does our company culture. We rely on every team member’s contribution to help us accomplish our goals. To succeed at Trumid, you must be curious, passionate about your craft, ambitious, collaborative, and driven.Learn more at www.trumid.com.
The opportunity.
Trumid is looking for a Lead Site Reliability Engineer (SRE) to ensure our systems' reliability, scalability, and performance as we continue to grow. This role offers a unique opportunity to shape our fast-growing firm's reliability practices and infrastructure. You will be crucial in optimizing our existing infrastructure, implementing new technologies, and enhancing our incident response capabilities.
As a Lead SRE, you will oversee the stability and performance of our trading platform, which serves a large and growing client base. You’ll work closely with development and DevOps teams to build scalable solutions and automate processes to enhance system reliability. You will also play a critical role in incident management, problem resolution, and capacity planning, ensuring that our systems meet our users' high expectations.
This role is ideal for someone passionate about reliability, automation, and efficiency. You will have the chance to lead initiatives that directly impact our platform's stability and user experience, ensuring that we maintain the highest levels of service availability.
Responsibilities will include:
- Transform the SRE function to evolve, simplify, and scale existing solutions. Innovate and create new solutions and practices where needed.
- Drive improvements in system reliability, scalability, and performance through innovative solutions and industry best practices.
- Lead incident response efforts, including troubleshooting, resolution, and conducting post-mortem analysis to prevent future incidents.
- Automate repetitive tasks to reduce manual intervention and improve operational efficiency.
- Collaborate closely with software development, DevOps, and infrastructure teams to embed reliability into the development lifecycle.
- Design, implement, and maintain highly available, scalable, and resilient infrastructure to meet the demands of our growing client base.
- Develop and maintain monitoring, logging, and alerting frameworks to ensure system health and to identify and resolve issues preemptively.
- Conduct capacity planning and performance tuning to support future growth.
About you.
- SRE expert with foundation knowledge of SRE best practices.
- Demonstrated hands-on experience managing large-scale and highly-available cloud-based systems.
- Deep understanding of cloud components in at least one of the major cloud providers (eg, AWS, GCP, Azure), including infrastructure, services, and tooling.
- Expertise in containerization and orchestration tools (e.g., Docker, Kubernetes) and experience with deployment strategies such as blue-green and canary deployments.
- Strong knowledge of CI/CD pipelines and experience in integrating reliability practices within CI/CD processes.
- Proficient with monitoring and observability tools (e.g., Prometheus, Grafana, Alertmanager) to ensure system health and to create effective alerting mechanisms.
- Experience with Infrastructure as Code (IaC) tools like Terraform and Ansible and experience automating infrastructure deployment and management.
- Excellent problem-solving skills, focusing on diagnosing complex issues in large-scale distributed systems.
- Strong scripting and programming skills in Python, Bash, Go, or similar languages.
- Strong communication and collaboration skills, capable of working effectively with cross-functional teams in a fast-paced environment.
- Passion for reliability, automation, and continuous improvement.
- Bachelor's degree in computer science (or equivalent) and at least 10 years of professional experience at a fast-paced tech oriented company. Experience with financial and trading systems is a plus but not required.
Employee Benefits.
- Highly competitive compensation
- Fully paid medical, dental, and vision coverage
- Remote work
- Team-oriented and collaborative company culture
Trumid is an equal-opportunity employer.
In compliance with New York City Pay Transparency Law, the base salary range for this role in New York City is between $220,000 and $300,000. This range does not include discretionary bonuses or other compensation or benefits offered with this job. Several factors are considered when determining a candidate’s salary.