Site Reliability Engineer
About the job
Digital Health Technology team powers digital experiences and engagement to enhance the lives of millions of people every day through connected care. We build, deliver and manage a portfolio of data management platforms and mobile offerings in support of our core businesses. We thrive on simple and elegant architecture and agility. You’ll be immersed in a dynamic high-growth environment and empowered to excel, take informed risks, and drive ingenuity across the enterprise.
Today is your day. At ResMed, we believe every day is a new day for ground breaking ideas and unparalleled opportunity. Ours is a culture focused on what we can accomplish today, and where it can lead us tomorrow. We are thinkers and innovators who constantly challenge ourselves to do everything better than the day before, so that people around the world can breathe easier. We are constantly thinking beyond the horizon and challenging ourselves to remain visionaries of the industry we created. Every ResMed employee at every level of the organization can do extraordinary work and deliver innovation. We have always done things our own way. With integrity. With passion. With the aspiration to improve lives with our expertise. That’s why we hire the best.
Let’s Talk About The Team
The ResMed Digital Health Technology team powers digital experiences and engagement to enhance the lives of millions of people every day through connected care. We are a group of passionate engineers who thrive on simple and elegant architecture and agility.
Innovation is at our core and we are embracing new and emerging technologies as we build next-generation solutions. As a member of our DHT team, you’ll be immersed in an exciting, agile and dynamic high-growth environment, where you’ll be empowered to excel and where taking informed risks is rewarded.
ResMed is seeking a Site Reliability Engineer – SRE to help define and execute against a Site Reliability Engineering strategy for its rapidly expanding Digital Health Technology group. The candidate will improve and influence engineering best practices across the operations & development community. Position requires extensive hands-on technical expertise coupled with broad industry knowledge, associated technology product knowledge, and excellent communication skills.
- Monitoring and metrics — establishing desired service behavior, measuring how the service is actually behaving (availability, latency, and overall system health), and correcting discrepancies
- Emergency response — noticing and responding effectively to service failures in order to preserve the service’s conformance to its SLA (service-level agreement)
- Change management — altering the behavior of a service while preserving service reliability
- Capacity planning — projecting future demand and ensuring that a service has enough computing resources in appropriate locations to satisfy that demand
- Performance — design, development, and engineering related to scalability, isolation, latency, throughput, and efficiency
- Scaling systems sustainably through mechanisms such as automation
- Evolving systems by pushing for changes that improve reliability and velocity
- Conducting incident responses and blameless postmortems
- Bachelor’s degree in Computer Science or Information Systems or equivalent technical discipline
- Minimum 5 years’ working experience in an enterprise 24/7 production environment supporting critical, real-time applications, ideally in public Cloud like AWS, Azure, GCP
- Minimum 3 years of experience focused on site reliability for high-traffic applications, solid understanding of SLO, SLI and SLA and implementing them
- Minimum 2 years of coding experience, preferably in Java or Python
- Systematic problem-solving approach, combined with strong communication skills and a sense of ownership and drive
- Strong analytical skills to identify and understand the root cause of critical issues
- Full-stack debugging and performance optimization ability, including knowledge of Cloud systems (load balancing, caching, content distribution, etc.), continuous integration/build systems, Java, SQL databases and in-memory data store like ElastiCache or Redis
- Strong experience with monitoring tools such as AppDynamics and Datadog
- Track record monitoring and analyzing system performance, isolating issues or bottlenecks that could impact reliability, performance and scalability
- Good verbal and written communication skills, and be able to work effectively with geographically remote teams
Good to have
- Experience using Atlassian tools like Bamboo, Confluence, JIRA and Stash
- Understanding of Product Development Life Cycle, including Agile SCRUM, TDD, BDD
- Experience with Machine Learning
Joining us is more than saying “yes” to making the world a healthier place. It’s discovering a career that’s challenging, supportive and inspiring. Where a culture driven by excellence helps you not only meet your goals, but also create new ones. We focus on creating a diverse and inclusive culture, encouraging individual expression in the workplace and thrive on the innovative ideas this generates. If this sounds like the workplace for you, apply now!