Senior Platform & Infrastructure Developer, Observability
We continue to expand our highly talented Infrastructure teams and are seeking a candidate who has large-scale, production experience with metrics, events, logs and traces. In this role, you’ll work with the Observability team to design, develop, and maintain solutions for R&D teams to monitor the behavior and performance of their workloads, reduce likelihood and impact of incidents, and better troubleshoot issues.
Candidates with an operations (DevOps/SysOps/TechOps) background who have supported infrastructure at scale are encouraged to apply for this role. If you are a firm believer in Infrastructure as Code, continuous deployment/delivery practices, and helping teams understand how their services behave in real-world scenarios, then you might be a great fit!
LOCATION: This role is open to Remote-Canada, Remote-United States, Hybrid-Waterloo, ON, CA, & Hybrid-Eden Prairie, MN, USA
Technical Responsibilities
- System design, configuration, integration, deployment, and operations of Observability systems and tools. These systems include collection of metrics/logs/events from many backend services deployed across multiple AWS accounts and regions and consumed by multiple teams
- Working with engineering teams to enable them to support their services from development to production
- Ensure our Observability platform exceeds goals for availability, capacity, efficiency, scalability, and performance as well as meeting our internal SLOs
- Build the next generation of observability using OpenTelemetry metrics, log aggregation tools, and tracing
- Write libraries and APIs that provide a simple, unified interface to other developers when they use our monitoring, logging, and event processing systems
- Enhance the existing alerting capabilities with Slack, Jira, and PagerDuty
- Helping build a continuous deployment system guided by metrics and data
- Bring anomaly detection into the observability stack
- Participate in 24×7 on-call rotation after at least 6 months of employment
What you know
- Strong with Python or Go
- Cloud of choice, preference for AWS – Lambda, CloudWatch, IAM, EC2, ECS, S3
- Solid understanding of Kubernetes
- Prometheus, PromQL, Thanos, AlertManager, Grafana, etc.
- Strong knowledge of standard monitoring protocols/frameworks – Prometheus/Influx line format, SNMP, JMX, etc.
- Log aggregation tooling, syslog, fluentbit, fluentd, CloudWatch Logs
- Comfortable working with git, Github, and common CI/CD approaches
- IAC tooling like CloudFormation or Terraform
How you do things
- Excited to use your expertise and be prescriptive about the right way forward
- Able to work well alongside SRE, platform and development teams
- Able to work independently and know when to reach out for support
- Passionate about automation – we do everything-as-code
Other interesting things we’d cheer about:
- Distributed tracing tools (eg: Jaeger, Sentry, Zipkin, Grafana Tempo)
- Java
- Some familiarity with open Observability initiatives (e.g., Open Tracing, Open Census, Open Metrics)
- Knowledge of Kafka
- Familiar with monitoring/observability in GCP and Azure
- AWS Certifications
- Comfortable with SQL