Senior Platform & Infrastructure Developer, Observability – Digital Nova Scotia – Leading Digital Industry

Senior Platform & Infrastructure Developer, Observability

We continue to expand our highly talented Infrastructure teams and are seeking a candidate who has large-scale, production experience with metrics, events, logs and traces. In this role, you’ll work with the Observability team to design, develop, and maintain solutions for R&D teams to monitor the behavior and performance of their workloads, reduce likelihood and impact of incidents, and better troubleshoot issues.

Candidates with an operations (DevOps/SysOps/TechOps) background who have supported infrastructure at scale are encouraged to apply for this role. If you are a firm believer in Infrastructure as Code, continuous deployment/delivery practices, and helping teams understand how their services behave in real-world scenarios, then you might be a great fit!

LOCATION: This role is open to Remote-Canada, Remote-United States, Hybrid-Waterloo, ON, CA, & Hybrid-Eden Prairie, MN, USA

Technical Responsibilities

  • System design, configuration, integration, deployment, and operations of Observability systems and tools. These systems include collection of metrics/logs/events from many backend services deployed across multiple AWS accounts and regions and consumed by multiple teams
  • Working with engineering teams to enable them to support their services from development to production
  • Ensure our Observability platform exceeds goals for availability, capacity, efficiency, scalability, and performance as well as meeting our internal SLOs
  • Build the next generation of observability using OpenTelemetry metrics, log aggregation tools, and tracing
  • Write libraries and APIs that provide a simple, unified interface to other developers when they use our monitoring, logging, and event processing systems
  • Enhance the existing alerting capabilities with Slack, Jira, and PagerDuty
  • Helping build a continuous deployment system guided by metrics and data
  • Bring anomaly detection into the observability stack
  • Participate in 24×7 on-call rotation after at least 6 months of employment

What you know

  • Strong with Python or Go
  • Cloud of choice, preference for AWS – Lambda, CloudWatch, IAM, EC2, ECS, S3
  • Solid understanding of Kubernetes
  • Prometheus, PromQL, Thanos, AlertManager, Grafana, etc.
  • Strong knowledge of standard monitoring protocols/frameworks – Prometheus/Influx line format, SNMP, JMX, etc.
  • Log aggregation tooling, syslog, fluentbit, fluentd, CloudWatch Logs
  • Comfortable working with git, Github, and common CI/CD approaches
  • IAC tooling like CloudFormation or Terraform

How you do things

  • Excited to use your expertise and be prescriptive about the right way forward
  • Able to work well alongside SRE, platform and development teams
  • Able to work independently and know when to reach out for support
  • Passionate about automation – we do everything-as-code

Other interesting things we’d cheer about:

  • Distributed tracing tools (eg: Jaeger, Sentry, Zipkin, Grafana Tempo)
  • Java
  • Some familiarity with open Observability initiatives (e.g., Open Tracing, Open Census, Open Metrics)
  • Knowledge of Kafka
  • Familiar with monitoring/observability in GCP and Azure
  • AWS Certifications
  • Comfortable with SQL