ProdOps Engineer
A ProdOps Engineer at iHerb will be entrusted and empowered to use their skills to keep the lights on for not only our development and engineering teams but for the organization at large, as our team grows. By joining us to take on the daily operation of iHerb’s vast infrastructure and services, we will enable our customers, partners, and other departments to accelerate iHerb’s drive forward as an industry leader. You will also develop a broad, fulfilling technical knowledge of many tools, systems, and their many integrations. As a level two engineer, you will take extra steps to work with our partners and assist them in offloading their operational demands to our team by developing automation and incident playbooks. You will bring your advanced knowledge of scripting and automation to not only assist our customers, but to help ensure that ProdOps’ own processes operate efficiently, reliably, and with the highest availability.
Objectives of this Role
Operate the production environment by monitoring availability and having a holistic view of system health, utilization, and environmental
changes.
Lead in developing proactive monitoring solutions, with a focus on streamlining remediation through response plays, scripting, and
automation, whenever possible.
Continually improve the availability and reliability of production workloads by measuring system performance and taking steps to
optimize and autoscale based on trends and changing demands.
Assist engineering and development with formulating playbooks for deployments and upgrades that ProdOps can use for future
maintenance, with a focus on automating as much of the process as possible.
Contribute to 24x7 operational support, triage, and incident management for IT and beyond, as our influence grows.
Daily Responsibilities
Respond to alerts with appropriate urgency and timing.
Gather and analyze system metrics to determine if changes to capacity or configuration should be made and work with the appropriate
teams and resources to implement them, always considering opportunities to automate/autoscale.
Partner with engineering, development, and other teams to continually improve our operational support offerings and capabilities and
provide demos and use cases for review.
Participate in system design and capacity planning, with a focus on operational readiness and relevant, effective monitoring.
Open, manage, and escalate incidents based on our capabilities and playbooks provided by our partners and customers throughout the
organization, continually reviewing them to find ways to automate.
Manage problems and continually improve our tools, with a focus on automation to reduce repetitive incidents and degraded
performance.
Communicate with affected departments about upcoming maintenance or open incidents.
Develop automated maintenance and incident notifications based on a correlation of alerts, system metrics, and findings from previous
incidents.
Automate the generation of reports and dashboards, via the use of scripting languages and API calls.
Required Skills and Qualifications
At least three years of experience in an internet/web operations environment.
At least one year of experience working in or closely with a NOC group.
Experience with enterprise networking (switching, routing, load balancing, and firewalls).
Experience with containers and orchestration (Kubernetes)
Experience with enterprise monitoring and status dashboarding solutions. (Datadog, Statuspage)
Skilled in scripting and process automation components (bash, Python, PowerShell, API, etc.,)
Skilled provisioning workloads and infrastructure in cloud environments (AWS, Azure, GCP)
In-depth, practical knowledge of DNS.
In-depth, practical knowledge and experience with Windows and Linux Server OS.
Must be able to quickly recognize, understand, and act on alerts from monitoring tools.
Must be able to manage and participate in the remediation of incidents of critical severity while maintaining a calm demeanor and
attention to detail and process.
Must possess excellent verbal and written communication skills, with the ability to write timely status updates for critical incidents.
Must be driven, self-starting, and possesses a customer-focused mindset.
#LI-MK2
#MO
#LI-REMOTE
Click the checkbox next to the jobs that you are interested in.
Automation Skill
Bilingual Skill
Darden - Yard House HIRE (Prod), Costa Mesa, CA
Darden - Yard House HIRE (Prod), Costa Mesa, CA