-
Location
London/Glasgow
-
Sector:
-
Job type:
-
Salary/Rate:
£449 max pay
-
Contact:
Amy Hughes
-
Contact email:
ahughes@skillfindergroup.com
-
Job ref:
19723USER_75
-
Consultant:
Amy Hughes
AWS Site Reliability Engineer (Data Platform)
Fully onsite London or Glasgow
12-month contract Inside IR35
Role Summary
We are seeking an AWS Site Reliability Engineer (SRE) to support, scale, and improve a cloud-native data platform built on AWS, Snowflake, and Databricks. This role focuses on enhancing platform reliability through automation, disaster recovery testing, resiliency engineering, observability best practices, and proactive SLO/SLI/SLA management.
Key Responsibilities
- Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using Infrastructure as Code (IaC) and CI/CD.
- Lead resiliency and disaster recovery initiatives, including scheduled DR drills, fault injection, and validation of recovery processes across AWS and data platform components.
- Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; leverage error budgets to guide reliability-focused improvements.
- Build and operate end-to-end observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads.
- Partner with data engineering and platform teams to embed reliability-by-design into architectural decisions and delivery practices.
- Perform root cause analysis (RCA) and drive continuous improvement to reduce operational toil and enhance platform availability and performance.
- Own and drive resolution of platform-related incidents and service requests, ensuring efficient operational support while identifying and automating recurring issues.
Required Skills & Experience
- Strong practical understanding of SRE principles, including SLO/SLI/SLA design and error budget management.
- Solid hands-on experience with AWS services (eg, EC2, S3, IAM, VPC, CloudWatch) in production environments.
- Experience with observability tooling, monitoring, and alerting best practices.
- Proficiency in automation and IaC using tools such as Terraform, CloudFormation, or CDK.
- Scripting experience with Python and Bash.
- Exposure to modern data platforms such as Snowflake and/or Databricks.
Nice to Have
- Experience running DR tests, chaos engineering activities, or resiliency testing in cloud environments.
- Familiarity with CI/CD pipelines and GitOps workflows.
- Background supporting large-scale data or analytics platforms.
Technology Skill Level
- Amazon Web Services (AWS): Intermediate (P2)
