Sr. Site Reliability Engineer
IT, Development Operations & Security (London-UK)
Monitoring & Incident Management:
Improve the studio’s reliability through monitoring, rapid response, communication and coordination.
Develop and manage the deployment architecture for the application, develop the monitoring architecture and implement monitoring agents, dashboards, escalations and alerts.
Routinely identify operational problems by observing and studying system architecture, functionality and performance results. Troubleshoot procedures with studio architect and investigate surfaced issues; and handling incidents.
Identify operational priorities by assessing operational objectives. Determine project objectives; such as; efficiency, cost savings, energy conservation, operator convenience, safety, environmental quality; estimating relevance, time, and costs.
Development & Data Analyzing:
Develop operational solutions by defining, studying, estimating, and screening alternative solutions; calculating economics; determining impact on all systems.
Create new tools to facilitate automated monitoring of the studio’s operational environment.
Anticipate operational problems by studying operating targets, modes of operation, unit limitations; monitoring unit performance.
Improve operational quality results by studying, evaluating, and recommending process re architecting, implementing changes, contributing information and opinion to unit design and modification teams.
Provide operational management information by collecting, analyzing, and summarizing operating and engineering data and trends.
Update job knowledge by participating in educational opportunities; reading professional publications; maintaining personal networks; participating in professional organizations.
Accomplish engineering and organization mission by completing related results as needed.
Operations Engineer Skills and Qualifications:
Mastery of Systems Linux and Networking administration
High level understanding of Linux/Unix operating systems
Strong systems engineering and troubleshooting skills
Strong understanding of TCP/IP,SSL,DNS
Ability to create and maintain technical documentation
Good understanding of webserver configuration and management (Apache,Nginx)
Knowledge in Load Balancing concepts
Experience with service performance monitoring and automation
Experience with systems and application security
Ability to analyze and troubleshoot in networking, performance, system and infrastructure issues using Linux/Unix standard tools.
Ability to administer networking firewalls
AWS Expertise (EC2,VPC,S3, RDS, Route53 Integration (DNS),Code deploy,IAM,ACM)
Nagios, Sensu, Grafana, Munin, Check_MK, Cloudwatch, and/or DataDog.
Backend - Graphite, Prometheus, influxdb
Writing checks & scripts
Log/Application Level (Splunk, Elastic Search, Apache)
Ability to diagnose infrastructure as a whole
Administer and maintain MySQL and other open source databases
Write and perform basic queries to evaluate database stability, integrity and performance
Good to have knowledge in NoSql databases (Couchbase,mongodb etc)
Shell scripting (BASH)
Configuration management -
Chef or Ansible. Puppet
Provisioning - Packer, Terraform , Could Formation
Containerisation - Docker swarm or kubernetes or AWS ecs,eks
CI/CD Jenkins, AWS CI/CD
Source code management
Preferably Git or SVN
Bonus to have (Recommended, but not required):
Basic knowledge of containers. I.E [Docker/Kubernetes]