Site Reliability Manager

Checkr
Denver, CO Remote Full Time
POSTED ON 12/1/2022 CLOSED ON 1/21/2023

What are the responsibilities and job description for the Site Reliability Manager position at Checkr?

We’re looking for a Site Reliability Manager with extensive leadership and observability experience in cloud-based applications. In this role, you will lead, manage, and mentor a team of SRE engineers, define and execute metrics surrounding the company's SLO, SLI, and SLAs, and operationalize incident management, communication, and handling.  The SRE Manager will be responsible for the availability, performance, and quality of all external and internal facing application endpoints that help drive Checkr’s business. Extensive knowledge of AWS, Kubernetes, and event orchestration is desired.   Tooling knowledge with DataDog, PagerDuty, and Atlassian (Jira, Confluence) is highly preferred to identify strategies to improve our full-stack telemetry and monitoring capabilities while mentoring SREs contributing to observability-related work, as well as to their career development. 

The SRE Manager will work cross-functionally with Cloud Operations, Platform, and Product Engineering, combining operations work with software engineering principles to assist and contribute to the high availability of Checkr’s production systems. You will serve as a partner to our Product Engineering teams to strategize on making their services more performant, scalable, observable, and reliable. We believe every engineering team at Checkr should be responsible for the software they build, and SREs play a critical part in providing the tools, practices, and expertise to make that happen. 

We are growing and evolving the SRE team to help meet Checkr’s product-first reliability goals for 2023 and beyond. Having established a strong foundation--including a containerized microservices architecture (AWS, Kong, Kubernetes, Kafka, MySQL, and MongoDB), CI/CD, full-stack monitoring, structured incident response, and a blameless postmortem culture--we are focused on implementing new capabilities like:

  • Automating observability and alerting across an ever-changing landscape of microservices
  • Automated Service Reliability Scorecards and Production Readiness Standards
  • Software engineering project work, proposed and driven by individual SRE team members, to remove operational bottlenecks and increase velocity in ways we’ve never considered before

What a typical week may look like at Checkr

  • Expand and improve our observability and monitoring footprint in line with cost optimizations.
  • Coordinate with the product team(s) to assist with sprint planning for task and project-based work.
  • Drive and delegate the day-to-day escalations and incidents with on-call engineering teams.
  • Collaborate with other Engineering Managers to define metrics and dashboarding requirements.
  • Ensure stakeholders and partners are informed of incidents while working with other departments, such as account managers,  legal, and marketing, for outbound communication.
  • Review the work of the SRE team, help them get unblocked, and provide mentoring.
  • Meet with the team and individuals weekly to collaborate and discuss topics related to processes, planning, and goals.
  • Manage and assist the on-call incident commander and owners in resolving production reliability issues, ensuring timely communication, retrospectives, and postmortems are performed and delivered.
  • Participate in design and production reviews for new features, products, or infrastructure.
  • Plan for the growth of Checkr’s infrastructure, reliability/resiliency, and resources.

What we value in a Site Reliability Manager

SREs combine some level of experience in both software engineering and operations and may hail from various backgrounds and job titles, including production or application engineers, software developers with a strong DevOps mindset, SysAdmins with solid systems and programming skills, Cloud Infrastructure or DevOps engineers. We are looking for someone with the following experience:

  • 7 years working in a relevant role, including 3 years of technical leadership experience mentoring engineers
  • 3 years of experience architecting and administrating observability stacks, either managed or self-hosted (e.g., DataDog, New Relic, Prometheus, Elastic Stack/ELK, OpenTelemetry)
  • Operation of containerized microservices running on the public cloud, asynchronous event processing, and databases
  • Knowledge of Linux, Git, and CI/CD pipelines
  • On-call support of highly available production systems
  • Design and build new tools to automate repetitive tasks, prevent incidents or improve TTR using an object-oriented programming language such as Python
  • Experience with automation and Infrastructure as Code using tools like Terraform, Terragrunt, or Cloud Formation
  • Understand how application components interact and contribute to architectural discussions
  • Unwavering commitment to operational security and best practices
  • Ownership: identify problems, propose solutions, and then coach and guide a team to implement them.
  • Connection: motivated to help other teams improve their service reliability and continuous improvement of tooling and services.

What you get

  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive compensation and opportunity for advancement
  • 100% medical, dental, and vision coverage
  • Up to 25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend

The base salary for this position will vary based on geography and other factors.  In accordance with Colorado and New York law, the base salary for this role if filled within Colorado is $180,625 - $212,500 and within the city limits of NYC is $212,500 - $250,000.

Equal Employment Opportunities at Checkr
Checkr is committed to hiring talented and qualified individuals with diverse backgrounds for all of its tech, non-tech, and leadership roles. Checkr believes that the gathering and celebration of unique backgrounds, qualities, and cultures enriches the workplace.   
Checkr also welcomes the opportunity to consider qualified applicants with prior arrest or conviction records. Checkr’s commitment to diversity extends to hiring talented individuals in spite of a prior criminal history in accordance with local, state, and/or federal laws, including the San Francisco’s Fair Chance Ordinance.

#LI-Remote

Site Reliability Engineer
Ping Identity Career Center -
Denver, CO
Site Reliability Developer 4
Oracle -
Broomfield, CO
Site Reliability Engineer II
Vertafore Career Center -
Denver, CO

For Employer
Looking for Real-time Job Posting Salary Data?
Keep a pulse on the job market with advanced job matching technology.
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

Sign up to receive alerts about other jobs with skills like those required for the Site Reliability Manager.

Click the checkbox next to the jobs that you are interested in.

  • Bug/Defect Analysis Skill

    • Income Estimation: $148,050 - $191,503
    • Income Estimation: $153,019 - $201,912
  • Debugging Skill

    • Income Estimation: $142,078 - $184,123
    • Income Estimation: $148,050 - $191,503
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Checkr

Checkr
Hired Organization Address Denver, CO Full Time
About Checkr Checkr builds people infrastructure for the future of work. We've designed a faster—and fairer—way to scree...
Checkr
Hired Organization Address Denver, CO Full Time
About Checkr Checkr builds people infrastructure for the future of work. We've designed a faster—and fairer—way to scree...
Checkr
Hired Organization Address Denver, CO Full Time
About Checkr Checkr builds people infrastructure for the future of work. We've designed a faster—and fairer—way to scree...

Not the job you're looking for? Here are some other Site Reliability Manager jobs in the Denver, CO area that may be a better fit.

Site Reliability Engineer

Altamira Technologies Corporation, Broomfield, CO

Site Reliability Engineer

LeoVegas Group, Denver, CO