Senior Incident Manager

Emploi pas sur LinkedIn

🕒 il y a 7 jours

🏄 California – Distant

info

💵 $125 000 - $195 000 / an

⏰ Temps Plein

🟠 Senior

👔 Manager

🦅 Parrain de Visa H1B

info

🗣️🇺🇸🇬🇧 Anglais requis

Postuler Maintenant
Trouver des Emplois à Distance Similaires

📊 Vérifiez votre score de CV pour ce poste

Améliorez vos chances d'obtenir un entretien en vérifiant votre score de CV avant de postuler.

Logo of Lambda

Lambda

51 - 200 employés

🤖 Intelligence artificielle

☁️ SaaS

🔧 Matériel

💰 €39 700 000 Venture Round en 2022-11

Artificial Intelligence • SaaS • Hardware

Lambda est une entreprise d'informatique en nuage qui fournit des instances GPU à la demande et des clusters adaptés pour l'entraînement et l'inférence en IA. Elle propose une variété de produits GPU, tels que des instances GPU en nuage à la demande facturées à la minute, des clusters GPU privés à grande échelle et des serveurs PCIe avec des GPU NVIDIA Tensor Core personnalisables. Lambda est connue pour son cloud dédié aux développeurs IA, permettant aux développeurs d'IA de lancer des instances GPU en mettant l'accent sur le matériel le plus récent de NVIDIA. L'entreprise propose également des produits de station de travail configurés avec des GPU NVIDIA conçus pour l'apprentissage profond et d'autres applications d'IA.

Description

• Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations. • Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams. • Act as the liaison between leadership and external teams during incidents/post-incidents to provide updates and status summaries. • Own the incident response lifecycle including: • - Assisting Technical Triage • - Escalation • - Coordination • - Resolution • Ensure timely and accurate communication with internal stakeholders and leadership. • Maintain incident response documentation and operational playbooks. • Conduct analysis on incidents and identify patterns/trends for improvement in response and systems reliability. • Work in an On-Call Rotation to respond to, lead, and coordinate incidents • Drive alignment during outages involving multiple infrastructure layers. • Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions.

🎯 Exigences

• 8+ years experience in incident management, site reliability engineering, or infrastructure operations • Experience managing incidents in large-scale distributed infrastructure environments • Strong understanding of: • - Data center operations • - GPU compute clusters • - Networking and storage infrastructure • - Cloud or hybrid infrastructure platforms • Proven ability to lead high-pressure incident response situations • Experience with incident management frameworks (ITIL, SRE, or equivalent) • Excellent communication and stakeholder management skills • Experience with incident tracking and monitoring tools such as: • - PagerDuty • - ServiceNow • - Jira • - Datadog • - Prometheus / Grafana

🏖️ Avantages

• Health, dental, and vision coverage for you and your dependents • Wellness and commuter stipends for select roles • 401k Plan with 2% company match (USA employees) • Flexible paid time off plan that we all actually use

Postuler Maintenant

Emplois Similaires

🕒 il y a 7 jours

Gray

1001 - 5000

🤝 B2B

🛍️ eCommerce

📱 Médias

Site Quality Manager overseeing quality control for construction projects across multiple US locations. Responsibilities include managing quality programs and conducting inspections/audits for compliance.

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 7 jours

Airbnb

5001 - 10000

👥 B2C

🛍️ eCommerce

Platform Product Manager leading product vision and strategy for equity products at Airbnb. Collaborating with cross-functional teams to enhance safety and integrity for Airbnb users.

🇺🇸 États-Unis – Télétravail

💵 $179 000 - $207 000 / an

💰 Post-IPO Equity en 2020-12

⏰ Temps Plein

🟠 Senior

👔 Manager

🦅 Parrain de Visa H1B

info

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 7 jours

LoanCare

1001 - 5000

💸 Finance

👥 B2C

🏠 Immobilier

Escrow Manager leading escrow operations including staff management at LoanCare. Responsible for vendor oversight and regulatory compliance in mortgage servicing.

🇺🇸 États-Unis – Télétravail

💵 $64 800 - $121 500 / an

⏰ Temps Plein

🟠 Senior

🔴 Expert

👔 Manager

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 7 jours

Theoria Medical

1001 - 5000

⚕️ Assurance santé

☁️ SaaS

📡 Télécommunications

Manager of Revenue Cycle Management overseeing billing and collections at Theoria Medical. Responsible for optimizing revenue cycle operations and communication with stakeholders.

🗣️🇺🇸🇬🇧 Anglais requis

🕒 il y a 7 jours

HealthEdge

1001 - 5000

⚕️ Assurance santé

☁️ SaaS

💳 Fintech

Manager of Professional Services overseeing and developing Business Consultants at HealthEdge. Delivering consulting services across software implementation and expansion engagements with a focus on healthcare payers.

🇺🇸 États-Unis – Télétravail

💵 $132 000 - $147 000 / an

⏰ Temps Plein

🟠 Senior

🔴 Expert

👔 Manager

🦅 Parrain de Visa H1B

info

🗣️🇺🇸🇬🇧 Anglais requis