Site Reliability Engineer - Databases – London, Tallinn at TransferWise
London, GB / Tallinn, EE

Site Reliability Engineer (Embedded in Database Engineering team)

We are looking for a Site Reliability Engineer to join our growing team in London or in Tallinn within our Service Platform team.  

The team is responsible for building the platform on which microservices run, ensuring it is robust and resilient to failure. We want to deliver sustainable solutions for product teams, with a healthy dose of paranoia knowing how complex, distributed systems can fail.  

Your Mission

You'll be working hands on in our Service Platform team, embedded in the Database Engineering team. We expect you to build and own the vision for Platform-as-a-Service, specifically Database as a Service, in TW - lead the way regarding piloting and implementing new solutions on both server and client aspects, track their impact and iteratively reduce overhead in all technical aspects of building new services. You will create tooling to automate database infrastructure deployment on any environment. We are invested in AWS and Kubernetes, so making the most of these abstractions. You'll also work with our product engineering teams to understand their needs and onboard them to new solutions.

Must Have

  • Practical experience managing databases or search engines, such as Postgres, MySQL, MongoDB, Cassandra or ElasticSearch.
  • Experience Scaling out databases.
  • Experience with building out scalable and automated Cloud platforms on preferably AWS
  • Experience in infrastructure automation tools (Ansible/Puppet/Chef)
  • Containerisation technology and various orchestration platforms e.g. Docker, Kubernetes
  • Basic sysadmin skills in debugging issues with disk, network, app/JVM performance etc
  • Will not settle at all for downtime and outages, do not want to be woken up in the middle of the night

Nice to Have

  • Knowledge and an eye on newer architectural concepts such as microservices, service mesh, lambda programming
  • Have ran performance and load tests at scale, and able to forecast capacity for the future
  • Experience with languages such as Java, Groovy, Python, Go etc
  • Handling of data or event pipelines
  • Sysadmin knowledge of DNS, network topology, switches, routers, CDN, Anti-DDOS

Key Areas of the role;

  • Define and create standard operating procedures that are compliant and auditable
  • Ownership of mission-critical shared infrastructure - run, maintain and schedule upgrades
  • Isolation of environments and work with various engineering teams to figure out how to best suit their needs
  • Sees failover and DR events as something that needs to happen with regularity and should be seamless
  • Engage regularly with our blameless postmortem culture, always focused on continuous improvement