Logo Animated3

Loading

Banner Default Image

Cloud Support Engineer-SRE

  • Location

    London/Glasgow

  • Sector:

    Digital & Technology, Finance & Banking

  • Job type:

    Contract

  • Salary/Rate:

    £449 max pay

  • Contact:

    Paul Smith

  • Contact email:

    psmith@skillfindergroup.com

  • Job ref:

    19723USER_38

  • Consultant:

    Paul Smith

AWS Site Reliability Engineer (Data Platform)

Fully onsite London or Glasgow

12-month contract Inside IR35

 

Role Summary

We are seeking an AWS Site Reliability Engineer (SRE) to support, scale, and improve a cloud-native data platform built on AWS, Snowflake, and Databricks. This role focuses on enhancing platform reliability through automation, disaster recovery testing, resiliency engineering, observability best practices, and proactive SLO/SLI/SLA management.

Key Responsibilities

  • Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using Infrastructure as Code (IaC) and CI/CD.
  • Lead resiliency and disaster recovery initiatives, including scheduled DR drills, fault injection, and validation of recovery processes across AWS and data platform components.
  • Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; leverage error budgets to guide reliability-focused improvements.
  • Build and operate end-to-end observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads.
  • Partner with data engineering and platform teams to embed reliability-by-design into architectural decisions and delivery practices.
  • Perform root cause analysis (RCA) and drive continuous improvement to reduce operational toil and enhance platform availability and performance.
  • Own and drive resolution of platform-related incidents and service requests, ensuring efficient operational support while identifying and automating recurring issues.

Required Skills & Experience

  • Strong practical understanding of SRE principles, including SLO/SLI/SLA design and error budget management.
  • Solid hands-on experience with AWS services (eg, EC2, S3, IAM, VPC, CloudWatch) in production environments.
  • Experience with observability tooling, monitoring, and alerting best practices.
  • Proficiency in automation and IaC using tools such as Terraform, CloudFormation, or CDK.
  • Scripting experience with Python and Bash.
  • Exposure to modern data platforms such as Snowflake and/or Databricks.

Nice to Have

  • Experience running DR tests, chaos engineering activities, or resiliency testing in cloud environments.
  • Familiarity with CI/CD pipelines and GitOps workflows.
  • Background supporting large-scale data or analytics platforms.

Technology Skill Level

  • Amazon Web Services (AWS): Intermediate (P2)