We are looking for a Site Reliability Engineer - Messanging/Kafka to join our growing team in Budapest within our Service Platform team.
The team is responsible for stateful distributed systems, including Kafka and ZooKeeper, which is critical for crucial workloads in backend processing system, that powers and drives the horizontal scaling of the platform.
You’ll be working hands on designing, architecting and implementing our stateful systems on AWS, as well as on-premise data centres. In addition, you will also help our product engineers use the clients to interact with these systems, laying down best practices and rules. You should have a deep understanding of distributed systems, knowing the tradeoffs and failure scenarios that can happen.
It’s not a team that works in isolation, there are multiple teams in platform who will be able to help. You will be working closely with the platform-as-a-service and infrastructure-as-a-service teams, to ensure we are providing the best platform for product engineers.
- Messaging systems (Kafka/ActiveMQ/RabbitMQ)
- Leader election (ZooKeeper/etcd)
- Understanding of Java application code, and/or other JVM languages
- Sysadmin ability to debug issues with disk, network, app performance etc
- Experience in infrastructure automation tools (Ansible/Puppet/Chef)
- Experience with building out scalable and automated Cloud platforms on preferably AWS
Nice to Have
- Knowledge of Spring frameworks, especially Messaging
- Experience in Kafka streams, or other equivalent streaming tech
- Amazon’s messaging offerings, such as SQS, Kinesis
- Sysadmin knowledge of networks, firewall for AWS and on-prem
- Have ran performance and load tests at scale, and able to forecast capacity for the future
Key Areas of the role;
- Define and create standard operating procedures that are compliant and auditable
- Ownership of mission-critical shared infrastructure - run, maintain and schedule upgrades
- Isolation of environments and work with various engineering teams to figure out how to best suit their needs
- Sees failover and DR events as something that needs to happen with regularity and should be seamless