Senior Site Reliability Engineer
Job Description
Senior Site Reliability Engineer, observability is needed for a global pioneer in Cloud and Internet Intelligence. They are giving organizations visibility and insight into a borderless network. Arming their clients with a precise understanding of how the network impacts their applications, users and customers.
This role will be a unique opportunity for an experienced SRE to provide the tools, services, and infrastructure to monitor and observe the Platform. Leveraging cloud native tools and enabling the developers to instrument, analyse, and monitor the application.
Permanent position, Hybrid in London.
Responsibilities
Responsibilities involve designing, deploying, and maintaining cloud-native monitoring services that are both elastic and resilient to failure across AWS. It is also fundamental to establish standards and best practices for the instrumentation of container-based services and cloud-managed services. The maintenance of their pipeline is key to ensure that notifications are well-timed, accurate, and directed to the appropriate channels. Automation is a priority, as it allows the monitoring platforms to scale smoothly, promoting a self-service approach.
Requirements
• Strong Infrastructure as Code skills, ideally with Terraform and Kubernetes.
• Strong knowledge of modern logging tool sets, including Logstash or Fluentd.
• Understanding of Prometheus and its ecosystem, including Alertmanager.
• Good knowledge of Application Performance Monitoring tools and crash reporting tools, such as Sentry.
• Good knowledge of cloud provider managed services, and how they can be leveraged in our context.
• Ability to write high quality code in Python, Go, or equivalent languages.
This is an exciting opportunity for a Senior SRE to join an expanding global business. If you are interested, please apply with your CV.