What are the responsibilities and job description for the DevOps Engineer position at Reltio?
About Reltio
At Reltio, we’re on a mission to enable digital transformation by delivering a single source of truth for enterprise data designed for the digital experience economy. We are disrupting the master data management (MDM) software market when we launched the first cloud-native MDM software-as-a-service (SaaS) platform. The Reltio Connected Data Platform leverages a cloud-native multi-tenant architecture and our ecosystem to enable speed, agility and flexibility at scale. Companies across industries rely on Reltio to deliver mission-critical, secure, trusted real-time data at scale to create connected omnichannel experiences for their customers, partners and employees.
We’ve earned numerous awards and top rankings for our technology, our culture and our people. Reltio was founded on a distributed workforce and offers flexible work arrangements to help our people manage their personal and professional lives. So if you’re ready to work on unrivaled technology where your desire to be part of a collaborative team is met with a laser-focused mission to enable digital transformation with connected data, let’s talk.
About the team
The Reltio DevOps team is a tight-knit team of 12 highly responsive, highly technical Cloud engineers helping the company scale it’s cloud-based services through heavy focus on automation and continuous delivery.
What is exciting about this opportunity
- Dynamic pace and opportunity to work with the hottest technology (Linux, Cassandra, Elastic, Spark, Serverless architecture, Multi-cloud, Docker/K8S) in AWS and GCP.
- Exposed to challenging problems in blending MDM and Big Data worlds with a focus on performance optimization, reliability, and availability.
- Have the opportunity to bring new automated approaches to scalability, capacity management, and elasticity
Responsibilities
Senior Cloud DevOps Engineer/SME in infrastructure management and services for Reltio cloud services - MDM, RIQ, RDM, and other services or components that create the foundation of the core Reltio Product line. DevOps team is the first line of investigation and troubleshooting of the service reliability issues pertaining to DevOps.
Infrastructure and Automation and deployments:
- Using end-to-end environment tests, ensure that Engineering dev and test environments are functional and monitored comparable to customer-facing environments. No alerting for non-customer facing environments is expected.
- Own delivery and deployment of all instances and capacity increases for AWS and GCP environments using the current provisioning automation tools. Own expectations to internal customers for delivery dates of infrastructure build-outs and deployments.
- Work with engineering team to profile and find improvements to get a better cost/ratio for Document and maintain critical operations processes for incident management, change management, and root-cause analysis. Ensure that processes are consistently followed by the team.
- Ensure that all systems and application are monitored in 4 areas: Logging, Infrastructure Monitoring, Application Performance Monitoring, and end-to-end service level monitoring. Manage list of checks and provide alert escalation paths for all checks. Owning the reliability of Monitoring and Logging services to support Reltio services and applications is required.
- Own and ensure the security patching and updating for all Reltio services and applications.
- Identify opportunities and fix areas that do not conform to Reltio RASP Principles. Provide guidance and lead projects to ensure RASP Compliance as directed by Senior Engineering Leaders
- Work with Engineering teams to apply architecture knowledge to provide high availability cloud deployment solutions.
- Take ownership of complete Reltio Platform infrastructure’s reliability
Daily Operations:
- Own the scheduling and staffing model for 24x7 Oncall support rotation for all DevOps team members. Act as primary escalation POC for critical issues. Escalate to senior engineering leaders, customer support, and product management where necessary.
- Ensure the team is trained in full-stack troubleshooting and effectively communicating findings to other teams. Improve hands-on skills in Linux, Cloud services, Cassandra, ElasticSearch, and Reltio applications management to ensure quick resolution on issues.
- Ensure alerting policy aligns with incident severity policy in all cases. Develop and have team perform root-cause analysis on every incident. Templatize and develop automation recommendations for common alert situations.
- Vigilantly monitor and improve performance of platform components
- Continuously look for ways to optimize and streamline the Reltio infrastructure to facilitate a seamless experience for our customers
- Update status and prioritization of DevOps jira issues and ensure the team commits to scheduled tasks and deliverables.
- Track and monitor SLAs for various services and infrastructure and work within the team to manage
- Own communications and status on key customer issues and collaborate with customer-facing teams on resolutions and timelines.
Required Skills
- 2 years maintaining, supporting and deploying large-scale enterprise systems in the cloud
- 2 years hands on experience in bigdata, data ingest pipelines, master data management, data analytics as SaaS services
- Proven experience in Amazon AWS/GCP and Solution architecture certification is a plus
- 1 yrs experience in cloud-based infrastructure architecture, security advisement and deep experience in environment management, hardware topology design
- Strong Experience architecting and building reliable, highly available, scalable and distributed systems
- Experience with administration of NoSQL databases (Cassandra preferred) and distributed indexes
- Experience with environment monitoring and alerting with automated issue correction including Zabbix, SignalFx, Graphana
- 2 yrs experience in Kubernetes based container orchestration
- 2 yrs experience in infrastructure as code, application deployment and configuration using tools like AWS cloud formation, Terraform, CHEF, Ansible
- 2 yrs experience source code control and management using github/bucket
- 2 yrs experience in setting up CI/CD pipelines using Jenkins
- 3 yrs experience with Linux system administration
- 2 yrs experience with container technologies and Kubernetes
- Working experience with cost optimizations and use of tools like SpotInst, Cloud Health is a plus
- Experience with administration of Spark is a plus
- Understanding of performance and profiling techniques of complex systems is a plus
- Excellent communication skills
- BS/MS in computer science or equivalent experience