Principal DevOps Engineer

🕒 May 8

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Zeta Global

Zeta Global

1001 - 5000 employees

Founded 2007

☁️ SaaS

🤖 Artificial Intelligence

🤝 B2B

💰 Post-IPO Debt on 2024-09

SaaS • Artificial Intelligence • B2B

Zeta Global is an AI-powered marketing cloud that leverages proprietary AI and trillions of consumer signals to acquire, grow, and retain customers more efficiently. The Zeta Marketing Platform (ZMP) offers a comprehensive suite of tools, including data management, customer data platforms (CDP), email service providers (ESP), and digital signal processing (DSP), to create individualized customer experiences and improve marketing outcomes. Zeta emphasizes omnichannel marketing, customer intelligence, and data-driven marketing strategies, partnering with brands, agencies, and publishers worldwide to accelerate brand growth and engagement. Their platform is designed to tackle complex marketing challenges with solutions for customer acquisition, growth, and retention through predictive AI and actionable consumer data.

📋 Description

• Design, build, and operate production-grade CI/CD pipelines enabling multiple developers on multiple teams to deploy concurrently to production, multiple times daily, with zero-downtime guarantees. • Implement and optimize advanced deployment strategies including canary releases, blue/green deployments, rolling updates, incremental rollouts, and feature flag-gated releases via Statsig. • Build self-service deployment tooling that empowers developers to own their release process while enforcing safety guardrails, automated rollback triggers, and automate compliance gates. • Establish deployment observability with real-time canary analysis, automated health scoring, and progressive delivery metrics integrated with Grafana, Prometheus, and Honeycomb. • Champion CI/CD workflows using GitLab CI/CD, Helm charts, and Terraform to ensure infrastructure and application deployments are version-controlled, auditable, and reproducible. • Define and enforce SLOs/SLIs/SLAs across services, establishing error budgets that balance velocity with reliability. • Lead incident response processes, including on-call rotations, runbook development, blameless postmortems, and incident command structure. • Design and implement robust observability stacks leveraging Grafana, Prometheus, Loki, and Honeycomb for metrics, logging, tracing, and alerting at scale. • Proactively identify and eliminate reliability risks through chaos engineering, load testing, capacity planning, and failure mode analysis. • Reduce operational toil through automation, self-healing infrastructure patterns, and intelligent alerting to minimize mean time to detection (MTTD) and recovery (MTTR). • Manage and optimize AWS infrastructure spanning EC2, SQS, DynamoDB, and related services with Infrastructure as Code (Terraform) best practices. • Design and operate Kafka-based event streaming infrastructure for high-throughput, low-latency data pipelines supporting real-time marketing and analytics workloads. • Ensure robust networking across the platform, including DNS management, service mesh configuration, load balancing, TCP/IP optimization, routing policies, and VPC architecture. • Manage containerization strategy using Docker, ensuring efficient image builds, vulnerability scanning, registry management, and runtime security. • Support data infrastructure operations across Snowflake, MySQL, and other database platforms, collaborating with data engineering teams on reliability and performance.

🎯 Requirements

• 10+ years of progressive experience in DevOps, SRE, Platform Engineering, or Infrastructure Engineering roles, with demonstrated impact at staff or principal level. • Expert-level Kubernetes knowledge, including cluster administration, Helm chart authoring, custom controllers/operators, network policies, RBAC, and multi-cluster management on AWS EKS. • Deep expertise in CI/CD pipeline architecture and advanced deployment strategies (canary, blue/green, progressive delivery, feature flag integration) at scale. • Strong proficiency with Infrastructure as Code using Terraform, including module design, state management, and multi-environment orchestration. • Expert knowledge of Docker containerization, including multi-stage builds, security hardening, image optimization, and container runtime management. • Production experience with Apache Kafka, including cluster management, topic design, consumer group strategies, and operational monitoring for high-throughput streaming workloads. • Strong networking fundamentals: DNS (Route 53, internal DNS), TCP/IP, routing, API Gateway, load balancing (ALB/NLB), service mesh, VPC peering, transit gateways, and network troubleshooting. • Extensive AWS experience spanning EKS, EC2, SQS, DynamoDB, IAM, VPC, CloudWatch, and related services in production environments. • Hands-on experience with observability platforms: Grafana (dashboards, alerting), Prometheus (metrics, PromQL), Loki (log aggregation), and Honeycomb (distributed tracing, BubbleUp analysis). • Working familiarity with multiple language stacks including Node.js, React, Python, Java, and Ruby, sufficient to understand build systems, dependency management, and runtime characteristics. • Experience operating within regulated environments, with practical knowledge of GDPR, CCPA, SOC 2, and compliance automation in MarTech or AdTech domains. • Proven ability to influence engineering culture, drive adoption of new practices, and communicate complex technical strategies clearly to both technical and non-technical stakeholders. • Demonstrated experience with GitLab CI/CD pipelines, including advanced pipeline features such as parent-child pipelines, dynamic environments, and security scanning integration.

🏖️ Benefits

• Unlimited PTO • Excellent medical, dental, and vision coverage • Employee Equity • Employee Discounts, Virtual Wellness Classes, and Pet Insurance And more!!

Apply Now

Similar Jobs

🕒 May 7

Scribe

51 - 200

☁️ SaaS

⚡ Productivity

🏢 Enterprise

Staff Database Reliability Engineer managing data infrastructure and leading database initiatives at Scribe. Ensuring operational excellence and driving observability across database systems.

Amazon Redshift

AWS

BigQuery

Django

Kafka

Postgres

Python

RabbitMQ

Redis

SQL

Terraform

Go

🕒 May 4

NVIDIA

10,000+ employees

🤖 Artificial Intelligence

🎮 Gaming

Site Reliability Engineer at NVIDIA designing and maintaining large scale Kubernetes clusters. Ensuring system reliability and operational efficiency through automation and monitoring practices.

Cloud

Distributed Systems

Kubernetes

Linux

Perl

Python

Ruby

Go

🕒 May 3

1Password

501 - 1000

🔒 Cybersecurity

☁️ SaaS

⚡ Productivity

Staff Security Engineer leading DevSecOps within Corporate Security team at 1Password. Responsible for securing developer environments and overseeing GitHub security.

Python

Terraform

🕒 May 2

Ad Hoc LLC

501 - 1000

🏛️ Government

🤖 Artificial Intelligence

🔌 API

Staff DevOps Engineer responsible for leading and improving cloud infrastructure for VA services. Collaborating with stakeholders and mentoring team members in software engineering best practices.

Ansible

Terraform

🕒 May 2

National Resident Matching Program® (NRMP®)

11 - 50

📚 Education

⚕️ Healthcare Insurance

Manager, DevOps responsible for software delivery practices and cloud platform oversight at NRMP. Leading release management and cross-functional team coordination in a complex environment.

AWS

Cloud

SDLC