Principal DevOps Engineer

🕒 vor 1 Monat

🇺🇸 Vereinigte Staaten – Remote

💵 $180.000 - $210.000 / Jahr

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Logo of Zeta Global

Zeta Global

1001 - 5000 Mitarbeiter

Gegründet 2007

☁️ SaaS

🤖 Künstliche Intelligenz

🤝 B2B

💰 Post-IPO Debt im 2024-09

SaaS • Artificial Intelligence • B2B

Zeta Global ist eine KI-gestützte Marketing-Cloud, die proprietäre KI und Billionen von Verbrauchersignalen nutzt, um Kunden effizienter zu gewinnen, zu binden und zu halten. Die Zeta Marketing Platform (ZMP) bietet eine umfassende Suite von Tools, darunter Datenmanagement, Customer Data Platforms (CDP), E-Mail-Service-Provider (ESP) und Digital Signal Processing (DSP), um individualisierte Kundenerlebnisse zu schaffen und die Marketingergebnisse zu verbessern. Zeta legt Wert auf Omnichannel-Marketing, Kundenintelligenz und datengesteuerte Marketingstrategien, arbeitet mit Marken, Agenturen und Verlagen weltweit zusammen, um das Markenwachstum und die Kundenbindung zu beschleunigen. Ihre Plattform ist darauf ausgelegt, komplexe Marketingherausforderungen mit Lösungen für Kundenakquise, Wachstum und Bindung durch prädiktive KI und umsetzbare Verbraucherdaten zu bewältigen.

Beschreibung

• Design, build, and operate production-grade CI/CD pipelines enabling multiple developers on multiple teams to deploy concurrently to production, multiple times daily, with zero-downtime guarantees. • Implement and optimize advanced deployment strategies including canary releases, blue/green deployments, rolling updates, incremental rollouts, and feature flag-gated releases via Statsig. • Build self-service deployment tooling that empowers developers to own their release process while enforcing safety guardrails, automated rollback triggers, and automate compliance gates. • Establish deployment observability with real-time canary analysis, automated health scoring, and progressive delivery metrics integrated with Grafana, Prometheus, and Honeycomb. • Champion CI/CD workflows using GitLab CI/CD, Helm charts, and Terraform to ensure infrastructure and application deployments are version-controlled, auditable, and reproducible. • Define and enforce SLOs/SLIs/SLAs across services, establishing error budgets that balance velocity with reliability. • Lead incident response processes, including on-call rotations, runbook development, blameless postmortems, and incident command structure. • Design and implement robust observability stacks leveraging Grafana, Prometheus, Loki, and Honeycomb for metrics, logging, tracing, and alerting at scale. • Proactively identify and eliminate reliability risks through chaos engineering, load testing, capacity planning, and failure mode analysis. • Reduce operational toil through automation, self-healing infrastructure patterns, and intelligent alerting to minimize mean time to detection (MTTD) and recovery (MTTR). • Manage and optimize AWS infrastructure spanning EC2, SQS, DynamoDB, and related services with Infrastructure as Code (Terraform) best practices. • Design and operate Kafka-based event streaming infrastructure for high-throughput, low-latency data pipelines supporting real-time marketing and analytics workloads. • Ensure robust networking across the platform, including DNS management, service mesh configuration, load balancing, TCP/IP optimization, routing policies, and VPC architecture. • Manage containerization strategy using Docker, ensuring efficient image builds, vulnerability scanning, registry management, and runtime security. • Support data infrastructure operations across Snowflake, MySQL, and other database platforms, collaborating with data engineering teams on reliability and performance.

🎯 Anforderungen

• 10+ years of progressive experience in DevOps, SRE, Platform Engineering, or Infrastructure Engineering roles, with demonstrated impact at staff or principal level. • Expert-level Kubernetes knowledge, including cluster administration, Helm chart authoring, custom controllers/operators, network policies, RBAC, and multi-cluster management on AWS EKS. • Deep expertise in CI/CD pipeline architecture and advanced deployment strategies (canary, blue/green, progressive delivery, feature flag integration) at scale. • Strong proficiency with Infrastructure as Code using Terraform, including module design, state management, and multi-environment orchestration. • Expert knowledge of Docker containerization, including multi-stage builds, security hardening, image optimization, and container runtime management. • Production experience with Apache Kafka, including cluster management, topic design, consumer group strategies, and operational monitoring for high-throughput streaming workloads. • Strong networking fundamentals: DNS (Route 53, internal DNS), TCP/IP, routing, API Gateway, load balancing (ALB/NLB), service mesh, VPC peering, transit gateways, and network troubleshooting. • Extensive AWS experience spanning EKS, EC2, SQS, DynamoDB, IAM, VPC, CloudWatch, and related services in production environments. • Hands-on experience with observability platforms: Grafana (dashboards, alerting), Prometheus (metrics, PromQL), Loki (log aggregation), and Honeycomb (distributed tracing, BubbleUp analysis). • Working familiarity with multiple language stacks including Node.js, React, Python, Java, and Ruby, sufficient to understand build systems, dependency management, and runtime characteristics. • Experience operating within regulated environments, with practical knowledge of GDPR, CCPA, SOC 2, and compliance automation in MarTech or AdTech domains. • Proven ability to influence engineering culture, drive adoption of new practices, and communicate complex technical strategies clearly to both technical and non-technical stakeholders. • Demonstrated experience with GitLab CI/CD pipelines, including advanced pipeline features such as parent-child pipelines, dynamic environments, and security scanning integration.

🏖️ Vorteile

• Unlimited PTO • Excellent medical, dental, and vision coverage • Employee Equity • Employee Discounts, Virtual Wellness Classes, and Pet Insurance And more!!

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 1 Monat

Scribe

51 - 200

☁️ SaaS

⚡ Produktivität

🏢 Unternehmen

Staff Database Reliability Engineer managing data infrastructure and leading database initiatives at Scribe. Ensuring operational excellence and driving observability across database systems.

🇺🇸 Vereinigte Staaten – Remote

💵 $225.000 - $250.000 / Jahr

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

NVIDIA

10.000+ Mitarbeiter

🤖 Künstliche Intelligenz

🎮 Gaming

Site Reliability Engineer at NVIDIA designing and maintaining large scale Kubernetes clusters. Ensuring system reliability and operational efficiency through automation and monitoring practices.

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

1Password

501 - 1000

🔒 Cybersecurity

☁️ SaaS

⚡ Produktivität

Staff Security Engineer leading DevSecOps within Corporate Security team at 1Password. Responsible for securing developer environments and overseeing GitHub security.

🇺🇸 Vereinigte Staaten – Remote

💵 $192.000 - $278.000 / Jahr

💰 €620.000.000 Series C im 2022-01

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

Ad Hoc LLC

501 - 1000

🏛️ Regierung

🤖 Künstliche Intelligenz

🔌 API

Staff DevOps Engineer responsible for leading and improving cloud infrastructure for VA services. Collaborating with stakeholders and mentoring team members in software engineering best practices.

🇺🇸 Vereinigte Staaten – Remote

💵 $120.000 - $135.000 / Jahr

⏰ Vollzeit

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

National Resident Matching Program® (NRMP®)

11 - 50

📚 Bildung

⚕️ Krankenversicherung

Manager, DevOps responsible for software delivery practices and cloud platform oversight at NRMP. Leading release management and cross-functional team coordination in a complex environment.

🇺🇸 Vereinigte Staaten – Remote

💵 $157.600 - $173.700 / Jahr

⏰ Vollzeit

🟠 Senior

🔴 Experte

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich