Principal DevOps Engineer

1001 - 5000 employees

Founded 2007

☁️ SaaS

🤖 Artificial Intelligence

🤝 B2B

💰 Post-IPO Debt on 2024-09

SaaS • Artificial Intelligence • B2B

Zeta Global is an AI-powered marketing cloud that leverages proprietary AI and trillions of consumer signals to acquire, grow, and retain customers more efficiently. The Zeta Marketing Platform (ZMP) offers a comprehensive suite of tools, including data management, customer data platforms (CDP), email service providers (ESP), and digital signal processing (DSP), to create individualized customer experiences and improve marketing outcomes. Zeta emphasizes omnichannel marketing, customer intelligence, and data-driven marketing strategies, partnering with brands, agencies, and publishers worldwide to accelerate brand growth and engagement. Their platform is designed to tackle complex marketing challenges with solutions for customer acquisition, growth, and retention through predictive AI and actionable consumer data.

Principal DevOps Engineer

🕒 May 8

🇺🇸 United States – Remote

💵 $180k - $210k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Apache

AWS

DNS

Docker

DynamoDB

EC2

Grafana

Java

JavaScript

Kafka

Kubernetes

MySQL

Node.js

Prometheus

Python

React

Ruby

TCP/IP

Terraform

Apply Now

Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Zeta Global

1001 - 5000 employees

Founded 2007

☁️ SaaS

🤖 Artificial Intelligence

🤝 B2B

💰 Post-IPO Debt on 2024-09

SaaS • Artificial Intelligence • B2B

📋 Description

• Design, build, and operate production-grade CI/CD pipelines enabling multiple developers on multiple teams to deploy concurrently to production, multiple times daily, with zero-downtime guarantees. • Implement and optimize advanced deployment strategies including canary releases, blue/green deployments, rolling updates, incremental rollouts, and feature flag-gated releases via Statsig. • Build self-service deployment tooling that empowers developers to own their release process while enforcing safety guardrails, automated rollback triggers, and automate compliance gates. • Establish deployment observability with real-time canary analysis, automated health scoring, and progressive delivery metrics integrated with Grafana, Prometheus, and Honeycomb. • Champion CI/CD workflows using GitLab CI/CD, Helm charts, and Terraform to ensure infrastructure and application deployments are version-controlled, auditable, and reproducible. • Define and enforce SLOs/SLIs/SLAs across services, establishing error budgets that balance velocity with reliability. • Lead incident response processes, including on-call rotations, runbook development, blameless postmortems, and incident command structure. • Design and implement robust observability stacks leveraging Grafana, Prometheus, Loki, and Honeycomb for metrics, logging, tracing, and alerting at scale. • Proactively identify and eliminate reliability risks through chaos engineering, load testing, capacity planning, and failure mode analysis. • Reduce operational toil through automation, self-healing infrastructure patterns, and intelligent alerting to minimize mean time to detection (MTTD) and recovery (MTTR). • Manage and optimize AWS infrastructure spanning EC2, SQS, DynamoDB, and related services with Infrastructure as Code (Terraform) best practices. • Design and operate Kafka-based event streaming infrastructure for high-throughput, low-latency data pipelines supporting real-time marketing and analytics workloads. • Ensure robust networking across the platform, including DNS management, service mesh configuration, load balancing, TCP/IP optimization, routing policies, and VPC architecture. • Manage containerization strategy using Docker, ensuring efficient image builds, vulnerability scanning, registry management, and runtime security. • Support data infrastructure operations across Snowflake, MySQL, and other database platforms, collaborating with data engineering teams on reliability and performance.

🎯 Requirements

• 10+ years of progressive experience in DevOps, SRE, Platform Engineering, or Infrastructure Engineering roles, with demonstrated impact at staff or principal level. • Expert-level Kubernetes knowledge, including cluster administration, Helm chart authoring, custom controllers/operators, network policies, RBAC, and multi-cluster management on AWS EKS. • Deep expertise in CI/CD pipeline architecture and advanced deployment strategies (canary, blue/green, progressive delivery, feature flag integration) at scale. • Strong proficiency with Infrastructure as Code using Terraform, including module design, state management, and multi-environment orchestration. • Expert knowledge of Docker containerization, including multi-stage builds, security hardening, image optimization, and container runtime management. • Production experience with Apache Kafka, including cluster management, topic design, consumer group strategies, and operational monitoring for high-throughput streaming workloads. • Strong networking fundamentals: DNS (Route 53, internal DNS), TCP/IP, routing, API Gateway, load balancing (ALB/NLB), service mesh, VPC peering, transit gateways, and network troubleshooting. • Extensive AWS experience spanning EKS, EC2, SQS, DynamoDB, IAM, VPC, CloudWatch, and related services in production environments. • Hands-on experience with observability platforms: Grafana (dashboards, alerting), Prometheus (metrics, PromQL), Loki (log aggregation), and Honeycomb (distributed tracing, BubbleUp analysis). • Working familiarity with multiple language stacks including Node.js, React, Python, Java, and Ruby, sufficient to understand build systems, dependency management, and runtime characteristics. • Experience operating within regulated environments, with practical knowledge of GDPR, CCPA, SOC 2, and compliance automation in MarTech or AdTech domains. • Proven ability to influence engineering culture, drive adoption of new practices, and communicate complex technical strategies clearly to both technical and non-technical stakeholders. • Demonstrated experience with GitLab CI/CD pipelines, including advanced pipeline features such as parent-child pipelines, dynamic environments, and security scanning integration.

🏖️ Benefits

• Unlimited PTO • Excellent medical, dental, and vision coverage • Employee Equity • Employee Discounts, Virtual Wellness Classes, and Pet Insurance And more!!

Apply Now

Similar Jobs

Director of SRE

🕒 April 30

Intus Care

11 - 50

⚕️ Healthcare Insurance

☁️ SaaS

🤖 Artificial Intelligence

Director of SRE managing reliability and operational excellence for a healthcare EMR platform. Leading a blended team for stability and operational efficiency in a high-stakes environment.

🇺🇸 United States – Remote

💵 $175k - $200k / year

💰 $13.1M Venture Round on 2023-01

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Azure

Cloud

Grafana

Kubernetes

Prometheus

DevOps Engineer

🕒 April 14

Creyos (formerly Cambridge Brain Sciences)

51 - 200

⚕️ Healthcare Insurance

☁️ SaaS

🔬 Science

DevOps Engineer focusing on enhancing the efficiency and reliability of software deployment processes at Creyos. Work on automating configuration management and implementing CI/CD pipelines.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

Python

Ruby

Ruby on Rails

Terraform

Executive Director – Central Head of DevSecOps

🕒 April 9

Las Vegas Sands Corp.

10,000+ employees

🎮 Gaming

Executive Director overseeing global DevSecOps functions, including infrastructure and application security for Sands. Leading teams to ensure compliance and optimize solutions while supporting IT initiatives.

🇺🇸 United States – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Cyber Security

Docker

JavaScript

Kubernetes

Python

Ruby

SDLC

DevOps Architect / SME, MultiCloud

🕒 April 8

EITACIES Inc.

51 - 200

🏢 Enterprise

🔒 Cybersecurity

🤖 Artificial Intelligence

DevOps Architect leading platform engineering standards across a multi-cloud, hybrid environment at Eitacies Inc. Focus on automation, infrastructure, and cloud architecture.

🇺🇸 United States – Remote

💵 $60 / hour

⏰ Full Time

🟠 Senior

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Cloud

DNS

Docker

DynamoDB

Firewalls

Google Cloud Platform

Kubernetes

Python

SQL

Terraform

Director of Global IT DevOps – AI Infrastructure

🕒 April 6

Endeavour. Inspired Infrastructure.

51 - 200

🤖 Artificial Intelligence

⚡ Energy

🏢 Enterprise

Director of Global IT DevOps & AI Infrastructure at Endeavour transforming global infrastructure. Overseeing AI product lifecycles and leading technical teams while ensuring operational excellence.

🇺🇸 United States – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Azure

Cloud

Distributed Systems

DNS

ERP