Sr. Site Reliability Engineer

IT, Development Operations & Security (London-UK)

Monitoring & Incident Management:

Improve the studio’s reliability through monitoring, rapid response, communication and coordination.

Develop and manage the deployment architecture for the application, develop the monitoring architecture and implement monitoring agents, dashboards, escalations and alerts.

Routinely identify operational problems by observing and studying system architecture, functionality and performance results. Troubleshoot procedures with studio architect and investigate surfaced issues; and handling incidents.

Identify operational priorities by assessing operational objectives. Determine project objectives; such as; efficiency, cost savings, energy conservation, operator convenience, safety, environmental quality; estimating relevance, time, and costs.

Development & Data Analyzing:

Develop operational solutions by defining, studying, estimating, and screening alternative solutions; calculating economics; determining impact on all systems.

Create new tools to facilitate automated monitoring of the studio’s operational environment.

Anticipate operational problems by studying operating targets, modes of operation, unit limitations; monitoring unit performance.

Improve operational quality results by studying, evaluating, and recommending process re architecting, implementing changes, contributing information and opinion to unit design and modification teams.

Provide operational management information by collecting, analyzing, and summarizing operating and engineering data and trends.

Update job knowledge by participating in educational opportunities; reading professional publications; maintaining personal networks; participating in professional organizations.

Accomplish engineering and organization mission by completing related results as needed.

Operations Engineer Skills and Qualifications: 

Mastery of Systems Linux and Networking administration

  • High level understanding of Linux/Unix operating systems

  • Strong systems engineering and troubleshooting skills

  • Strong understanding of TCP/IP,SSL,DNS 

  • Ability to create and maintain technical documentation

  • Good understanding of webserver configuration and management  (Apache,Nginx)

  • Knowledge in Load Balancing concepts 

  • Experience with service performance monitoring and automation

  • Experience with systems and application security

  • Ability to analyze and troubleshoot in networking, performance, system and infrastructure issues using Linux/Unix standard tools.

  • Ability to administer networking firewalls


Cloud Management

  • AWS Expertise (EC2,VPC,S3, RDS, Route53 Integration (DNS),Code deploy,IAM,ACM)


Monitoring Systems

  • Nagios, Sensu, Grafana, Munin, Check_MK, Cloudwatch, and/or DataDog. 

  • Backend - Graphite, Prometheus, influxdb

  • Writing checks & scripts

  • Log/Application Level (Splunk, Elastic Search, Apache)

  • Ability to diagnose infrastructure as a whole


Database fundamentals

  • Administer and maintain MySQL and other open source databases

  • Write and perform basic queries to evaluate database stability, integrity and performance

  • Good to have knowledge in NoSql databases (Couchbase,mongodb etc)


  • Shell scripting (BASH)

  • Python

Configuration management - 

  • Chef or Ansible. Puppet 

  • Provisioning - Packer, Terraform , Could Formation 

  • Containerisation - Docker swarm or kubernetes or AWS ecs,eks

CI/CD Jenkins, AWS CI/CD

Source code management 

  • Preferably Git or SVN

Bonus to have (Recommended, but not required):

  • Basic knowledge of containers. I.E [Docker/Kubernetes]

  • PhP