SRE Engineer
We're seeking a Site Reliability Engineer (SRE) team member to manage and oversee our cloud services hosted on AWS. This new colleague will play a critical role in ensuring system reliability, adheres to SRE principles, and responding to emergencies in a 24/7 on-call setup.
Responsibilities:
- Monitor and manage the SAFEQ Cloud print service, ensuring high availability and reliability within the AWS environment.
- Develop and implement tools and practices for automating routine tasks to improve system scalability and resilience. (Terraform, Ansible, Bamboo, Git, Cloudwatch, Kubernetes)
- Set up alerts and monitoring metrics for proactive identification and mitigation of system issues. (Cloudwatch, Prometheus, Alertmanager)
- Participate in capacity planning and performance tuning to enhance system performance.
- Collaborate with software engineering teams to ensure seamless deployment, efficient trouble-resolution, and effective crisis management. (on-calls, “war room”, L4 level bugs consultated)
- Conduct root cause analysis following system incidents - post mortems; define corrective actions and preventative measures. (and implement them - Terraform, Ansible improvements)
- In the backlog for next quarters: Deployment pipelines, Disaster recovery improvements, Automation of deployment improvements.
- Team is remote (CZE, UK, ARG), european standards of work and quality ensured.
- SRE team is enabler of environment for product which develop 15 R&D teams, SRE is the 16th.
- Possibility to try other teams work for few sprints to get the know-how and spirit.
Requirements:
- Senior engineer. You know AWS, you know SRE, you can do it.
- Fluent English, good communication skills.
- Experience in an SRE role.
- Proficiency with AWS and its various services and resources.
- Solid understanding of the software development life cycle, CI/CD pipelines.
- Problem-solving skills, with the ability to think systematically.
- Knowledge of networking, security, and database systems.
- Availability for on-call duties in a 24/7 setup. (duty rotates amnog team members**)**
- Bonus points for: Python/Go/Bash (deployment scripts)
What you get return: - Working with cutting-edge technologies daily, seamlessly blending them with time-tested, efficient methods. Check out ARTINs tech stack - TechRadar.
- We have a continuously growing number of exciting projects - each special in its unique way. When you feel ready for a new challenge, you can switch between projects allowing you to grow as you experiment in a new environment. Embrace our culture of learning by doing.
- Our team believes in creating a healthy work-life balance. We openly communicate to find a rhythm that works for both project and individual needs. We offer the ultimate flexibility with an unlimited vacation policy, allowing you to take as much time off as you need without worrying about hitting a specific limit.
- We are remote-first, which means you can work from anywhere you want. No mandatory face-to-face meetings are required, but you are always welcome to visit Brno and Prague offices.