What are the responsibilities and job description for the Lead Site Reliability Engineer position at TalentMinded?
The opportunity
Our client, a Canadian-based global SaaS company, is looking to add a new Lead Site Reliability Engineer to their team. We need someone with an infrastructure operations background to join our team. We need a self-starter who is excited by the opportunity to support the migration, availability, security, and releases of multiple global cloud products.
Who you are - and why you should join
You may be a Team Lead or Manager looking to return to hands-on work. Or, you are a Senior SRE seeking the next step to lead the technical work. You will focus on improving the reliability, resiliency, and scalability of our customer-facing products in AWS Cloud. You will evaluate and evolve our current Cloud operational practices, procedures and tooling. As part of the team, you will continue to respond to availability alerts and security issues and plan for timely updates and releases.
The new Lead Site Reliability Engineer will:
- Take ownership of service readiness from a Cloud reliability perspective for new AWS/public cloud technologies introduced in the enterprise. You will define the Cloud hosting service branding tier (Gold, Silver, Bronze) that corresponds to the level of service assurance required for an application, defining and assigning SLOs (Availability, Reliability, and Observability ) and designing compliance dashboards against the SLO targets.
- Create and execute a configuration management strategy and an automation strategy for the enterprise. You will implement a Site reliability monthly service review and host a show-case call for the leadership to highlight how we are tracking Operational Site Reliability.
- Gather and analyze metrics from systems and applications to assist in performance tuning and fault finding. You will implement dashboards in an Observability tool to help surface performance patterns that need attention and work with Development and QA teams to fix them. You will work with QA teams in performance testing and help them isolate performance bottlenecks.
- Advise and guide leadership in technical solutions and participate directly in investigating performance and availability issues. You will advocate for better practices and implementation across DevOps teams to unify and improve practices. In addition, you will participate in system design consulting, platform management, and capacity planning. You will analyze RCAs and create a service health scorecard to highlight opportunities for remediation.
- Build and integrate automation playbooks for every alert for Incident response. You will reduce manual toil and apply software engineering skills to IT Operations from automated OS patching to rolling out a configuration management tool and capabilities.
What you bring:
- You have a degree in Computer Science, Engineering, or Math. You have AWS Solution Architect Associate Certification, and you are pursuing Security certification.
- You have worked as a Site Reliability Engineer. You come with a blend of engineering and cloud administration experience. You have the skill set to apply sound engineering principles, operational discipline, mature automation, and best practices you have previously put into practice, as well as the latest in the industry focusing on availability, reliability, and performance.
- You can build trusting relationships at any level of the organization. You respect diverse approaches and can champion your own choices. You have flexible communication skills and comfort in creating documentation and making presentations. You thrive working across inter-disciplinary groups that bring teams together and build great products.
TalentMinded welcomes and encourages applications from people with disabilities. Should you require accommodation, in any aspect of the selection process, please contact us at careers@talentminded.ca and we will be happy to assist.