Site Reliability Engineer at five9

This role is ideal for a mid-level Engineer who thrives at the intersection of software development and systems operations. You should have at least 3 years of

Work type: remote

Location: United States (Remote)

Salary: $71,800 – $190,000/yr

Type: Full-time

Summary

This role is ideal for a mid-level Engineer who thrives at the intersection of software development and systems operations. You should have at least 3 years of experience managing large-scale production environments and a strong grasp of the "SRE mindset"—prioritizing automation and "toil reduction" over manual fixes. The team follows a true 50/50 split between coding and operational work, making it perfect for someone who wants to keep their programming skills sharp while mastering cloud infrastructure. The compensation range is exceptionally broad (up to $190k), allowing for significant growth as you progress. Flexibility is a major draw; the role is fully remote for those living over 50 miles from San Ramon, CA. Five9 also offers standout benefits, including 100% employer-paid healthcare premiums for employees and a generous equity program. **You might be a good fit if you:** * Have hands-on experience orchestrating containers with **Kubernetes** and **Docker**. * Are proficient in **Python** or **Java** and can build your own automation tools. * Are comfortable with a **24/7 on-call rotation** and leading incident post-mortems. * Have mastered Infrastructure as Code using tools like **Terraform** or **Ansible**.

Job Description

Join us in bringing joy to customer experience. Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide.

Living our values everyday results in our team-first culture and enables us to innovate, grow, and thrive while enjoying the journey together. We celebrate diversity and foster an inclusive environment, empowering our employees to be their authentic selves.

We are seeking a Site Reliability Engineer (SRE) to join our team and help build and maintain highly reliable, scalable systems. This role combines software engineering and operations expertise to ensure our services meet ambitious reliability targets while enabling rapid development and deployment. This position requires approximately 50% software development and 50% operational work, focusing on automation, monitoring, and system reliability rather than manual operations. The team works collaboratively with our platform, application and database teams to provide a reliable and available service.

Key Responsibilities:

• Dashboards & Metrics: Design and implement comprehensive dashboards. These dashboards cover OS/platform level monitoring and application-level monitoring. These dashboards are broken into primary (RED) and secondary indicators (USE).
• Availability & Reliability: Establish and maintain SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets for the service.
• Performance Monitoring: Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact users.
• Incident Response: Participate in on-call rotations and lead incident response efforts, including post-mortem analysis and remediation. Maintain the official on-call routing. Assign and track application level problems to the engineering team.

• CI/CD Pipeline Management: Maintain continuous integration and deployment pipelines working with our cloud and on-premise deployment teams.
• Infrastructure as Code: Develop and maintain infrastructure using tools like Terraform, Ansible, or similar.
• Configuration Management: Automate system configuration and ensure consistency across environments. Provide recommendations for and implement best practices for configuration control.

• Security Automation: Ensure security scanning systems are in place and review escalated vulnerabilities.
• Access Control: Maintain proper authentication, authorization, and audit logging systems.
• Compliance Reporting: Ensure systems meet regulatory requirements and industry standards.
• Security Incident Response: Participate in security incident response and remediation efforts.

• Resource Management: Monitor and optimize cloud resource usage and costs looking for planned and unplanned resource changes.
• Capacity Planning: Analyze usage patterns and plan for future capacity needs.
• Cost Analysis: Provide recommendations for cost-effective architecture and resource allocation.
• Right-sizing: Implement automated scaling and resource optimization strategies.

• Shared Infrastructure: Build and maintain common services like notification systems, caching layers, and message queues or third-party software stacks.
• Database Operations: Manage database reliability, performance, and scaling (where not handled by dedicated DB teams).
• Service Mesh & Networking: Implement and maintain service discovery, load balancing, and network policies.
• Developer Tools: Create and maintain tools and platforms that improve developer productivity and system reliability.

Required Qualifications:

• Production Systems: 3+ years managing large-scale production environments.
• On-call Experience: Comfortable with 24/7 on-call responsibilities and incident response.
• System Administration: Strong Linux/Unix system administration skills.
• Networking: Understanding of TCP/IP, DNS, load balancing, and network security.
• Database Systems: Experience with SQL and NoSQL databases in production environments.

• Programming Languages: Proficiency in at least two of: Python, Shell, PHP, Java, or similar languages.
• Cloud Platforms: Experience with one of AWS, GCP, or Azure infrastructure and services.
• Containerization: Hands-on experience with Docker, Kubernetes, and container orchestration.
• Monitoring & Observability: Experience with Prometheus, Grafana, ELK stack, or similar tools.
• Infrastructure as Code: Proficiency with Terraform, CloudFormation, or similar tools.
• Version Control: Expert-level Git usage and collaborative development practices.

• SLI/SLO Management: Experience defining and maintaining service level objectives.
• Error Budget Policy: Understanding of error budget concepts and implementation.
• Toil Reduction: Track record of identifying and eliminating repetitive manual work.
• Capacity Planning: Experience with performance testing and capacity management.

Preferred Qualifications:







Work Location: This role is fully remote for candidates who reside outside the 50 mile radius of our San Ramon office. For candidates who reside within 50 miles of our San Ramon location, this role is Hybrid and would require 3 days a week (M, W, TH) in our San Ramon office.

As part of our continued commitment to diversity, equity, and inclusion, Five9 supports pay transparency during the entire recruitment process. Actual compensation packages are based on several factors that are unique to each candidate including, but not limited to: skill set, depth of experience, certifications, and specific work location. The range displayed reflects the minimum and maximum target for new hire salaries for the job across the United States. Your recruiter can share more about the specific compensation package during your hiring process.

Additionally, the total compensation package for this position may also include an annual performance bonus, stock, and/or other applicable incentive compensation plans.

Our total reward package also includes:





All compensation and benefits are subject to the requirements and restrictions set forth in the applicable plan documents and any written agreements between the parties.

The US base salary range for this role is below.
$71,800—$190,000 USD

View this job on nocollar jobs