B2B
Scale Stage
Rapidly increasing operations

B2B
Scale Stage
Rapidly increasing operations

Site Reliability Engineer

No equity
|
Remote •
Atlanta
|1 year of exp
|Full Time

Posted: 2 years ago

Visa Sponsorship

Not Available

Hires remotely in

United States

RelocationAllowed

Skills

Python

PHP

Customer Service

Networking

Ruby

Git

Infrastructure

Automation

Scalability

PHP Frameworks

SCRUM

Nginx

DevOps

DNS

Infrastructure Monitoring

Amazon Web Services

Amazon S3

Amazon RDS

Amazon SQS

Jenkins

PHPUnit

Network Security

Agile

Apache Tomcat

Capacity Planning

Performance Monitoring

Test Automation

Microsoft Windows

Agile/Scrum

Performance Tuning

Site Reliability Engineering

Performance Testing

Big Data Infrastructure

Agile Software Development

Teamcity

AWS/EC2/ELB/S3/DynamoDB

IT Infrastructure Management

Computer Networking

Reliability

Forecasting

Root Cause Analysis

System testing

Active Directory

agile methodologies

Amazon Redshift

Networking & TCP/IP

Monitoring

AWS Cloud Services

Performance Management

Docker

Amazon AWS EC2 API

Ansible

ec2

AWS S3

Capacity Building

Cloud Based Infrastructure

AWS CloudFormation

DevOps Engineering

Git & Github

Agile methodology

AWS Redshift

AWS RDS

AWS

Elasticache

AWS/EC2/S3

Reliability Engineering

Demand Planning/Forecasting

Microsoft Active Directory

Terraform

Apache Maven

Strategic Planning & Capacity Management

Reliability Testing

Database Performance Tuning

Amazon Lambda

Root Cause analysis and corrective plan

Redshift

HTTP/DHCP/DNS

SLA Management

Application deployment (Docker)

CI Jenkins

Web Severs (Apache / Nginx / Node)

Amazon Elasticache

Git

AWS Lambda

Amazon EC2

Apache Web Server

Docker / Docker Compose / Kubernetes

Jira/FeatureBee/Teamcity/Confluence/Trac

Reliability and Autonomy

Amazon ECS

AWS/EC2/ELB/S3/DynamoDB/VPC/RDS/ElasticSearch

Networking: TCP/IP DNS DHCP VLAN

DNS, DHCP, UDP, TCP, IPv6, IPv4, RIP, SSH, HTTP, NAT/PAT, ARP/ND, ICMP

AWS/Lambda/DynamoDB/Cognito

AWS/EC2/S3/RDS

Problem Solving /Root Cause Failure Analysis

AWS (EC2/EMR/S3/RedShift)

Unit Testing TeamCity

Ansible/Docker

DevOps (ansible)

Web Servers (Apache - Nginx)

Configuration Management/Ansible/Vagrant

99.9% Network Uptime

Cloudformation

SRE / DevOps

Amazon CloudFormation

Selenium WebDriver Core Java Test NG Jenkins

AWS ElastiCache

Continuous Integration Server - Jenkins, TeamCity

IaC

AWS CodeBuild

DevOps/Linux/Docker/Jenkins/Chef/Puppet/Git

Gunicorn / Nginx

Root Cause Analysis and Problem Solving (8D, 5-Why, DMAIC)

AWS EC2/ S3 / Lamda / RDS/ IAM

99.99+ Uptime

AWS Services, Linux, CI/CD Tools, Jenkins, Scripting Languages (python, Bash)

DynamoDB/S3/SNS/SQS/CloudFormation/CodeBuild/CodeCommit/Cloudfront/Route53/SES

Redshift Render

SRE

AWS EC2/ECS/ECR/S3/ElastiCache

Hashicorp Terraform

AWS Lambda,EC2,S3,SNS,SQS,Kenises,Terraform

Kubernetes, Jenkins, Jira, Visual Studio, GitHub, Bitbucket Terraform, Ansible

Devops, Chef, Docker, Ansble, Jenkins, Github, Splunk>, Terraform, Kubernetes, Maven,

Infrastructure As Code (IaC)

Cloud Formation, ECS, Kinesis, EMR, Security, X-Ray, AWS CodeCommit, AWS CodeBuild,

About the job

QGenda is a fast growing Atlanta-based healthcare software company, with an amazing corporate culture, where we strive to be the best place to be a customer. Our software is used by thousands of hospital departments around the world to automatically generate the most optimized physician work schedules to accommodate complex business rules and accurately schedule the appropriate medical provider based on their skill level, specialty, availability, and preferences.

As a Site Reliability Engineer, you will work with our product development teams to increase the scalability, reliability, and performance of our systems. You’ll build and extend existing automation for configuration and monitoring of our AWS hosted applications. You’ll evaluate new AWS services and tools to determine if they could be utilized in our environments. You’ll bring a focus to platform health and monitoring to allow us to deliver the best possible experience for our customers.

Apply Online: https://qgenda.applytojob.com/apply/7W5gJYZ2Nq/Site-Reliability-Engineer

*Site Reliability Engineer Key Responsibilities: *

Assist in Development Operations
Partner with software engineering teams to make sure scalability/reliability is designed and implemented in new features and products
Promote fundamentals of site reliability across the Product Development department and the organization as a whole
Work closely with development and operations teams to build highly available, cost effective systems Build and Maintain Infrastructure
Write automation code for provisioning and operating infrastructure
Oversee infrastructure for customer facing applications hosted in AWS within production and pre-production environments including their provisioning
Maintain an understanding of new cloud computing capabilities on Amazon Web Services and look for opportunities to utilize those capabilities for our products
Ensure Application Uptime and Performance
Use extensive metrics to identify issues before they impact our customers
Establish end-to-end monitoring and alerting on all critical aspects of the system to ensure SLAs and get proactive notifications of possible issues for all systems
Design platforms for extremely high uptime metrics and ensure that our production SLAs are measured, monitored and maintained
Identify underlying root causes and provide recommendations or solutions for long term permanent fixes to critical production issues
Participate in service capacity planning and demand forecasting, software performance analysis and system tuning
Assure High Security Across the Application and Organization
Troubleshoot problems across the entire cloud-based stack: network, databases, and application – and build automation to prevent problem recurrence
Develop effective documentation, tooling, and alerts to both identify and address reliability risks Participate in on‐call rotation with other team members on the Development Team

Site Reliability Engineer Knowledge, Skills and Abilities:

Advanced proficiency with at least one scripting or programming language, preferably Ruby or Python
Solid Linux administration experience, experience with Windows and Active Directory is a plus
Strong experience supporting applications running Ruby, Python or PHP
Experience with Nginx, Apache, Docker or similar technologies
Hands‐on experience building infrastructure and supporting applications in AWS using services such as Lambda, EC2, ECS, S3, SNS, SQS, RDS, Redshift, and Elasticache
Strong understanding of networking and DNS
Familiarity with configuration management and infrastructure as code (IaC) tools such as Ansible, Terraform or Cloudformation
Availability for off-hours deployment and upgrades of production systems during release and maintenance windows
Firm understanding and experience with Agile and Scrum SDLC processes
Using distributed version control system experience (Git preferred) to check‐in code, branching, merging, pull request, code review, etc.
Knowledge of CI/CD best practices and tools such as AWS CodeBuild, Jenkins and TeamCity
Experience designing and delivering secure, high performance and highly‐available cloud services
Experience working with stakeholders to define and track SLIs, SLOs and SLAs using metrics and monitoring to ensure the objectives are met or exceeded

Education / Professional Certifications or Licenses Required:

Bachelor's degree (B.S. preferred) from a major university in a related field

Qgenda Compensation & Perks:

Competitive Salary
Bonus Eligible
401k Employer Match

QGenda Benefits & Culture:

Full Health and Dental (QGenda pays 100% of the individual premiums)
Employee-centric work culture
3 "Flex Hours" per week
Relaxed vacation policy
Company outings
Costco membership
Casual dress
Opportunity to be part of a fast growing software company with hundreds of customers and thousands of users around the world.