What are the responsibilities and job description for the Site Reliability Engineer position at capgemini?
Job Summary:
We are looking for a skilled SRE Reliability Engineer to join our Site Reliability Engineering (SRE) team. The primary focus of this role is to ensure the reliability, availability, and performance of our systems and services. You will work closely with software engineers, DevOps teams, and other SREs to build and maintain resilient systems that meet our service level objectives (SLOs). Your expertise will help us identify potential reliability risks, automate processes, and improve our incident response capabilities.
Key Responsibilities:
Reliability Engineering:
- Design and implement strategies to improve the reliability and availability of our services.
- Develop and maintain service level objectives (SLOs), service level indicators (SLIs), and service level agreements (SLAs) to measure and ensure system reliability.
- Identify and mitigate potential risks to system reliability through proactive measures, including redundancy, fault tolerance, and capacity planning.
Monitoring and Alerting:
- Set up and fine-tune monitoring and alerting systems to detect anomalies and issues in real-time.
- Implement service level objectives (SLOs), service level indicators (SLIs), and service level agreements (SLAs) to measure system reliability and performance.
Performance and Reliability Analysis:
- Analyze system performance data to identify bottlenecks, trends, and potential issues.
- Work with development and operations teams to optimize application performance and improve system reliability.
Automation and Tooling:
- Automate the collection and processing of observability data to reduce manual effort and improve accuracy.
- Develop custom tools and scripts to extend observability capabilities as needed.
Collaboration:
- Work closely with development teams to integrate observability best practices into the software development lifecycle.
- Collaborate with security teams to ensure observability tools and practices align with security and compliance requirements.
Qualifications:
Experience:
- 5 years of experience in Site Reliability Engineering, DevOps, or a related role with a focus on system reliability and performance.
- Strong background in monitoring, alerting, and incident management tools and practices.
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration tools (e.g., Kubernetes, Docker).
Skills:
- Proficiency in scripting and automation languages (e.g., Python, Bash, Go).
- Strong understanding of networking, system performance, and reliability principles.
- Knowledge of service level management, including SLOs, SLIs, and SLAs.
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
Soft Skills:
- Excellent problem-solving and analytical skills, with a proactive approach to identifying and addressing system vulnerabilities.
- Strong communication skills, with the ability to work effectively with cross-functional teams.
- A commitment to continuous learning and staying current with the latest industry trends and technologies.
Mandatory skill sets needed:
- GutHub Actions
- AWS Cloud Formation
- AWS Code Pipeline
- In Depth understanding of Secure Coding practices operationalization.?