Site Reliability Engineer – AI Agents

1001 - 5000 employees

Founded 2011

₿ Crypto

💸 Finance

💳 Fintech

Crypto • Finance • Fintech

Kraken Digital Asset Exchange is a cryptocurrency platform that facilitates the buying and selling of over 200 cryptocurrencies, including Bitcoin, Ethereum, and many others. Founded in 2011, Kraken provides a comprehensive suite of features for both beginner and advanced traders, such as advanced trading interfaces and margin trading. The platform emphasizes industry-leading security, deep liquidity, and 24/7 customer support, making it a trusted choice for users worldwide. Kraken caters to individual investors as well as institutional clients, offering services like OTC trading and custody. The company is committed to transparency with its proof of reserves and mission-driven values. Kraken operates globally, supporting clients in over 190 countries, with a quarterly trading volume exceeding $207 billion. However, users are advised of the high risk of crypto investments and the lack of regulation in some jurisdictions.

Site Reliability Engineer – AI Agents

🔥 0 minutes ago

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Docker

Kubernetes

Python

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Kraken Digital Asset Exchange

1001 - 5000 employees

Founded 2011

₿ Crypto

💸 Finance

💳 Fintech

Crypto • Finance • Fintech

📋 Description

• Design, build, and operate the infrastructure layer supporting AI agent workflows in production • Ensure reliability, scalability, and observability of agentic systems across internal and external products • Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services • Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution • Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads • Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components • Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems • Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems • Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services • Implement access controls and security best practices across AI infrastructure environments • Document architecture, runbooks, and best practices to support knowledge sharing across the team

🎯 Requirements

• 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale • Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design • Proficiency with Infrastructure as Code tools, particularly Terraform • Experience with containerization and orchestration, particularly Kubernetes and Docker • Solid understanding of cloud infrastructure, preferably AWS • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred) • Experience designing and operating observability, monitoring, and alerting systems • Experience implementing incident response procedures and participating in on-call rotations • Strong collaboration skills working across data, AI, and engineering teams • High ownership mindset in a fast-moving, high-stakes production environment

🏖️ Benefits

• Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution. • We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance. • Payward is powered by people from around the world and we celebrate the diverse talents, backgrounds, contributions, and unique perspectives that everyone brings to the table. We hire based on merit, seeking out people with the right abilities, knowledge, and skills for the job. We encourage you to apply for roles where you don't fully meet the listed requirements, especially if you're passionate or knowledgeable about crypto. • Unless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis.

Apply Now

Similar Jobs

DevOps Reliability Engineer

🔥 3 hours ago

Advanced Solutions International, Inc.

201 - 500

🤝 B2B

🤝 Non-profit

DevOps Reliability Engineer ensuring performance, scalability, and reliability of Azure-based SaaS platform at ASI. Collaborating with engineering teams to improve system efficiency and resilience.

🇬🇧 United Kingdom – Remote

💰 Venture Round on 2022-01

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Azure

Cloud

SQL

Designated Site Reliability Engineer

🔥 7 hours ago

Cohesity

1001 - 5000

🔒 Cybersecurity

Technical Support Engineer providing high-touch support for Cohesity NetBackup Software and Flex Appliances. Collaborating with teams to enhance data management for high-profile customers.

🇬🇧 United Kingdom – Remote

⏰ Full Time

🟢 Junior

🟡 Mid-level

⛑ DevOps & Site Reliability Engineer (SRE)

🇬🇧 UK Skilled Worker Visa Sponsor

🗣️🇪🇸 Spanish Required

🗣️🇧🇷🇵🇹 Portuguese Required

Cloud

Senior DevOps Engineer – DSA BAU

🕒 Yesterday

Capita

10,000+ employees

📋 Compliance

☁️ SaaS

🏢 Enterprise

Senior DevOps Engineer designing and managing automated CI/CD pipelines in Azure DevOps. Collaborating with Salesforce and QA teams to ensure smooth software delivery and deployment.

🇬🇧 United Kingdom – Remote

💰 Seed Round on 2018-01

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🇬🇧 UK Skilled Worker Visa Sponsor

Azure

Python

DevOps Engineer

🕒 2 days ago

Leidos

10,000+ employees

🔒 Cybersecurity

🔬 Science

DevOps Engineer handling database designs and AWS migrations for UK programmes. Contributing to Agile teams and utilizing various software tools and languages.

🇬🇧 United Kingdom – Remote

💵 £47.6k - £61k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🇬🇧 UK Skilled Worker Visa Sponsor

Ansible

AWS

Cognos

Jenkins

Python

SDLC

SQL

Terraform

Unix

Senior Site Reliability Engineer

🕒 2 days ago

Mozilla

501 - 1000

👥 B2C

🔒 Cybersecurity

Senior Site Reliability Engineer managing infrastructure and operations for Thunderbird. Collaborating with a distributed team to enhance system reliability and performance.

🇬🇧 United Kingdom – Remote

💵 £62k - £72k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Grafana

Kubernetes

Terraform