What are the responsibilities and job description for the Senior Site Reliability Engineer position at Angi?
About the Role
Site Reliability Engineers (SREs) on the Telemetry team are responsible for ensuring that Angi’s Insights Platform can be relied upon to support the needs of our mission-critical systems. The SRE role at Angi is different from many other organizations. You will find yourself working in a team of SREs tasked with completing company objectives instead of being embedded amongst development teams. The team works together to address client needs as any development group would. This allows for easier sharing of knowledge between team members and a more consistent experience for the clients. We build all of our solutions using EKS in AWS with Terraform and leverage Weave Flux, Prometheus, Cortex, Loki, Tempo, and Grafana to provide telemetry services for our clients. Every day you’ll find yourself either managing them, providing solutions based on their data, or working with clients on how to properly use our Telemetry Platform.
We are looking for experienced Site Reliability Engineers who meet the following criteria
Technical:
- A working knowledge of metrics, logs, and distributed tracing practices.
- Depth of knowledge in at least one of those practices.
- Comfortable contributing to a shared codebase.
- Understand Kubernetes and the container orchestration concepts it uses.
- Passionate about process automation and familiar with enough different approaches to entertain several before deciding on which to pursue.
- A healthy amount of curiosity for containerized technology and how it works.
Execution:
- Experience identifying changes that improve processes from a reliability and performance perspective.
- Enjoy finding solutions in low information situations.
- Comfortable using telemetry data to spot parts of a system that do not scale, research solutions, and implement a migration plan that mitigates the situation
- Enjoy working to determine what service information is important enough to drive service levels and create the means for them to use that data.
Collaboration and Communication:
- Have a curiosity for current and new practices that lead to collaboration and process change.
- Enjoy documenting and sharing solutions to interesting challenges with others.
- Participated in post-mortems and have definite opinions on how they serve the organization.
- Experience working as a team to support a critical core system.
As an SRE you will:
- Determine what information is important enough to drive service levels for our services.
- Use service level information to determine reliability on our Telemetry Platform.
- Participate in an on-call rotation that responds to incidents concerning the Telemetry Platform.
- Contribute to solutions defined in GitLab projects and GitHub repositories.
- Maintain AWS EKS clusters using our Terraform modules.
- Automate complex business challenges that require your specific skill set.
- Contribute to core infrastructure pieces that allow Angi to scale to meet the needs of its clients.
- Use the Telemetry Platform to assist in investigations that happen across the organization.
- Plan and shape the growth of Angi’s infrastructure as we iterate it over time.
You may be a fit for this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Have an understanding of large scale system design, monitoring, observability, and operational practices.
- Have strong programming skills - Go, Python, and/or Ruby
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have experience with Weave Flux, Nginx, Kubernetes, Terraform, Prometheus, Loki, Cortex, Tempo, or similar technologies
- Are compelled to keep a constant eye on the Observability space, identifying and planning ahead based on changes in practices/technologies as they arise
Projects you could work on:
- Contribute to our team’s Telemetry Platform that consists of Prometheus, Cortex, Loki, Tempo, and Grafana deployed in EKS using Terraform and Weave Flux on AWS.
- Contribute to projects across the organization to address challenges that your skill set exceeds.
- Work with our dev teams to determine how to make their paging strategy more meaningful and less problematic.
- Develop ways to aid our development teams in instrumenting their services to collect important information about our applications that allows for investigation
- Working to reduce the level of effort needed to utilize the instrumentation that the teams are creating.
- Provide valuable feedback and collaborate with the teams whose products we use as we iterate on our own infrastructure.
Compensation & Benefits:
- The salary band for this position ranges from 140k - 200k, commensurate with experience and performance. Compensation may vary based on factors such as cost of living.
- This position will be eligible for a competitive year end performance bonus & equity package
- Full medical, dental, vision package to fit your needs
- Flexible vacation policy; work hard and take time when you need it
- Pet discount plans & retirement plan with company match (401K)
- The rare opportunity to work with sharp, motivated teammates solving some of the most unique challenges and changing the world
#LI-Remote
#BI-Remote