New Energy Platform is a new business unit within Centrica building the future energy supply platform for our UK customers.
As a key part of the Technology function, we are creating a Site Reliability Engineering (SRE) team. The SRE team will work with our squad based engineering teams, global security and networks team in Centrica DTS and key vendors to drive the reliability agenda.
Improving the experience for our existing customer base and enabling the growth agenda by ensuring our levels of service consistently meet customer expectations.
Role accountabilities :
Deep understanding of SRE philosophy, technologies, platforms and tools, SLA management, incident resolution, and automation
Focus on system reliability, performance, and supportability by balancing feature development velocity and reliability with well-defined SLOs
You'll improve CI / CD pipelines to increase development squad’s velocity and confidence while automating provisioning, quality controls, security auditing and maintenance
Establish, manage and optimise our monitoring solutions to achieve observability
Support squads with best practices in monitoring and improving alert thresholds
Design monitoring systems that prioritize the customer perspective and experience
Contribute to architectural and design principles to drive reliability, scalability and reusability for a large-scale distributed platform
Work with development squads to implement automation opportunities to drive down toil and reduce technical debt
Carrying out end-to-end stability inspections to take a holistic view of system health and proactively mitigate customer impacts
Firefighting stability problems with business teams and engage in troubleshooting, service capacity planning and demand forecasting, platform performance analysis and system tuning
Conducting post-incident reviews and trend analysis and owning the learning loop back to the development squads
Providing reports on system health built around the service level indicators (SLIs)
The role requires flexibility to participate in rotating on-call duties and timely post-mortems of production incidents.
Competencies, Experience and Qualifications :
Experience and Qualifications
BS degree in Computer Science or related technical field involving coding or systems engineering
Significant experience working in an SRE or DevOps team supporting a scaled production platform
Certification(s) within Cloud Architecture and / or AWS
Experience of implementing, maintaining and optimising a CI / CD pipeline
Real-world coding, whether that's with traditional compiled languages or scripting languages or both.
Experience of working within Cloud Computing and familiarity with Infrastructure as Code.
Working knowledge of contemporary monitoring, analytics tooling and best practice
Working knowledge of automation tooling and best practice
Excellent investigative and diagnosis abilities with strong problem-solving skills combined with ability to take courageous decisions often with limited time and information in order to restore service
Strong technical knowledge across cloud, infrastructure and application domains
Experience using Terraform
SecDevOps Integrating secure development practices and controls into the development / deployment process
Capability for continual improvement and ultra-fast technology skill take-on
Ability to engage, build and sustain stakeholder relationships and influence decisions
Excellent oral and written communication skills, including the ability to explain technology solutions in business terms and clearly communicate to both technical and non-technical staff
Able to lead virtual teams and hold peers to agreed standards of delivery and performance
Strong analytical skills to bring out key information from multiple data sources to drive superior operational performance
Calm under pressure and takes the lead during complex situations
Desirable if you have experience of the Azure DevOps product.