Senior Platform & Infrastructure Developer, Observability

We continue to expand our highly talented Infrastructure teams and are seeking a candidate who has large-scale, production experience with metrics, events, logs and traces. In this role, you’ll work with the Observability team to design, develop, and maintain solutions for R&D teams to monitor the behavior and performance of their workloads, reduce likelihood and impact of incidents, and better troubleshoot issues.

Candidates with an operations (DevOps/SysOps/TechOps) background who have supported infrastructure at scale are encouraged to apply for this role. If you are a firm believer in Infrastructure as Code, continuous deployment/delivery practices, and helping teams understand how their services behave in real-world scenarios, then you might be a great fit!

LOCATION: This role is open to Remote-Canada, Remote-United States, Hybrid-Waterloo, ON, CA, & Hybrid-Eden Prairie, MN, USA

Technical Responsibilities

System design, configuration, integration, deployment, and operations of Observability systems and tools. These systems include collection of metrics/logs/events from many backend services deployed across multiple AWS accounts and regions and consumed by multiple teams
Working with engineering teams to enable them to support their services from development to production
Ensure our Observability platform exceeds goals for availability, capacity, efficiency, scalability, and performance as well as meeting our internal SLOs
Build the next generation of observability using OpenTelemetry metrics, log aggregation tools, and tracing
Write libraries and APIs that provide a simple, unified interface to other developers when they use our monitoring, logging, and event processing systems
Enhance the existing alerting capabilities with Slack, Jira, and PagerDuty
Helping build a continuous deployment system guided by metrics and data
Bring anomaly detection into the observability stack
Participate in 24×7 on-call rotation after at least 6 months of employment

What you know

Strong with Python or Go
Cloud of choice, preference for AWS – Lambda, CloudWatch, IAM, EC2, ECS, S3
Solid understanding of Kubernetes
Prometheus, PromQL, Thanos, AlertManager, Grafana, etc.
Strong knowledge of standard monitoring protocols/frameworks – Prometheus/Influx line format, SNMP, JMX, etc.
Log aggregation tooling, syslog, fluentbit, fluentd, CloudWatch Logs
Comfortable working with git, Github, and common CI/CD approaches
IAC tooling like CloudFormation or Terraform

How you do things

Excited to use your expertise and be prescriptive about the right way forward
Able to work well alongside SRE, platform and development teams
Able to work independently and know when to reach out for support
Passionate about automation – we do everything-as-code

Other interesting things we’d cheer about:

Distributed tracing tools (eg: Jaeger, Sentry, Zipkin, Grafana Tempo)
Java
Some familiarity with open Observability initiatives (e.g., Open Tracing, Open Census, Open Metrics)
Knowledge of Kafka
Familiar with monitoring/observability in GCP and Azure
AWS Certifications
Comfortable with SQL

Job Portal

Senior Platform & Infrastructure Developer, Observability