Senior Site Reliability Engineer

Engineering - Remote

Apply now

At Chattermill we use cutting-edge AI technology to give leading companies the key to improving their customer experience. We work with many of the most exciting companies in the world (Uber, HelloFresh, Transferwise, and Skyscanner to name a handful!) and are passionate about helping them put their customers at the heart of their decision making.

In our 6 years we’ve grown from two co-founders to a team of 50 (and counting) bright and diverse individuals. Chattermill was recently voted 16th in the Fastest growing tech companies in the UK by Deloitte and 77th in the fastest growing companies in Europe in the FT1000. We have ambitious plans to keep growing and are now looking for a Senior Site Reliability Engineer to join our Engineering team to ensure the stability and scalability of our platform.

One of our core company values is that We Act as Responsible Owners and we are hoping that the right person shares this belief and wants to take pride in the stability and scalability of our platform and wants to own the SRE function at Chattermill.  We are growing quickly and there is scope for the person coming into this role to progress towards a Lead position and scale the SRE team.

As a Senior Site Reliability Engineer, you will:

  • Take active part at all stages of our engineering process from design and implementation to support and maintenance
  • Help colleagues from different teams (software engineers, qa engineers, data scientists) to setup the right infrastructure for their workloads
  • Provide expertise and guidance to build self-healing system with high availability and horizontal scalability
  • Ensure the health of all environments by monitoring technical and business metrics, setting up alerts for things going wrong, acting proactively to prevent disasters, acting fast and effectively when they happen
  • Be the driver of our incident management process, apply root cause analysis in investigation of incidents with other engineers and help to define preventive measures to exclude the whole class of identified issues
  • Improve our CI/CD pipelines based on GitLab which involves firing up prod-like test environments to run our e2e tests, canary releases with automatic rollback based on metrics
  • Play a proactive role in identifying performance bottlenecks and other architectural issues and provide guidance on how to mitigate them in a planned and timely manner
  • Take an active role in improving the stability and scalability of our data pipeline based on Kafka to facilitate agile experimentation for our Data Science team and enable even more complex data integration with our clients

What we’re looking for:

  • Extensive experience managing production k8s clusters with data-intensive workloads in Google Cloud Platform
  • Experience in complex infrastructure migrations of mission critical systems with zero downtime
  • Operational experience with solutions in our stack (Postgres, ElasticSearch, Kafka, Redis).
  • Proficiency in more than one programming language (preferably in Go) and ability to identify and automate routine repetitive tasks
  • Strong architectural background in distributed systems
  • Experience in setting up central logging on ELK stack
  • Experience in managing highly available Prometheus (with Thanos), setting up alerts with Prometheus’s Alertmanager, creating dashboards in Grafana
  • Ability to define SLA and provide a viable plan on how to stay stick to it
  • Understanding of infrastructure as code principle and experience in its successful application with tools like Terraform
  • Ability to explain OSI model and to diagnose and debug network issues in a cloud environment
  • Deep knowledge of Linux and the ability to explain how it works under the hood
  • Good communication skills, interest in building effective relationships with colleagues, ability to explain things in simple terms to non-technical stakeholders.
  • Interest in providing a cutting-edge infrastructure, ability to assess new technologies, evaluate maintenance costs of different alternatives, prove their viability, willingness to facilitate adoption of new solutions within the team

Nice to have:

  • Experience working as a backend engineer
  • Experience in setting up data infrastructure for AI-companies

Why join us?

  • A competitive salary as well as the ability to share in the company’s success through options
  • We want you to grow with us, so we place huge importance on providing our people with great opportunities to develop and progress, such as a £500 (yearly) personal development budget, a progression framework, unlimited access to a fully stocked library and biweekly Breakfast and Learns
  • Great progression opportunities - we want you to grow with us!
  • A flexible Health & Wellness benefits budget that can be spent on health insurance, physical and mental health or other needs starting at £50pcm growing £25pcm for each year of service
  • 25 days holiday (in addition to bank holidays) + 1 day for your birthday + 1 day for every year of service up to 5 years
  • Perks including discounts on cinema tickets, utilities and more
  • Flexible working conditions and the opportunity to work from home 
  • Lovely office with great classes, events, and a rooftop terrace (when not in a pandemic!)
  • Regular company socials planned by our great colleagues!