Site Reliability Engineer

NTT DATA Services strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.

We are currently seeking a Site Reliability Engineerto join our team in Halifax, Nova Scotia (CA-NS), Canada (CA).

The Site Reliability Engineering team (SRE) drive the reliability, recoverability and operational efficiency of this product portfolio. Reporting to the SRE Lead, key features of this role include implementing advanced observability, troubleshooting complex systems, task automation, and technical debt management.

Members of the SRE team are expected to work closely with the TAI user community on day to day usage of the products, as well with our internal development and engineering squads, and the offshore support team that provide first line support.

Candidates will have the technical skills required to support these products on a Linux platform. Prior task automation experience in at least one programming language is expected. Hands-on experience with at least one pillar of observability is required and ideally experience in defining system monitoring, not just reacting to alerts.

Responsibilities include:

  • Building and maintaining knowledge front to back of the Technology Asset Inventory product portfolio, and then specializing in one or two of its systems
  • Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with product delivery engineers
  • Reduction of the cost of support (hours of effort) through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints
  • Identification and prioritization of technical debt that risks instability or creates wasteful operational toil
  • Consult with clients (the Firm’s internal development community) to maximize their productivity, including troubleshooting toolchain issues
  • Being operationally responsive, including sharing on-call rotation with the rest of a large, global team (with a time-off in lieu system)

Required Qualifications / Skills

  • 5 years of strong Linux troubleshooting skills
  • 5 years of task automation experience in any programming language
  • Practical experience of at least one pillar of observability (metrics, logs or traces)
  • Exhibit working knowledge in at least ONE of the following areas
  • Databases (Sybase, DB2, MSSQL, etc)
  • SQL
  • REST services (API)
  • Load balancing and networking
  • Performance troubleshooting and resolution
  • Confident collaboration skills

Desired Skills

  • Python development for task automation
  • Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction
  • Prior experience creating operational dashboards (Splunk, Grafana, etc)