Site Reliability Engineer

🕒 vor 1 Monat

🗣️🇺🇸🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Logo of Mistral AI

Mistral AI

11 - 50 Mitarbeiter

Schnelle, quelloffene und sichere Sprachmodelle. Erleichterte Spezialisierung von Modellen auf geschäftliche Anwendungsfälle durch Nutzung privater Daten und Feedback zur Nutzung.

Beschreibung

• Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads. • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters. • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.). • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime. • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs. • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences. • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform. • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments. • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure. • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.). • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements. • Document processes and procedures to ensure consistency and knowledge sharing across the team. • Contribute to open-source projects, research publications, blog articles and conferences.

🎯 Anforderungen

• Master’s degree in Computer Science, Engineering or a related field • 7+ years of experience in a DevOps/SRE role • Strong experience with cloud computing and highly available distributed systems • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...) • Experience working against reliability KPIs (observability, alerting, SLAs) • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...) • Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...) • Familiarity with infrastructure-as-code tools like Terraform or CloudFormation • Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices • Strong understanding of networking, security, and system administration concepts • Excellent problem-solving and communication skills • Self-motivated and able to work well in a fast-paced startup environment • Your application will be all the more interesting if you also have: • experience in an AI/ML environment • experience of high-performance computing (HPC) systems and workload managers (Slurm) • worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

🏖️ Vorteile

• 💰 Competitive salary and equity • 🚑 Healthcare: Medical/Dental/Vision covered for you and your family • 👴🏻 401K : 6% matching • 🏝️ PTO : 18 days • 🚗 Transportation: Reimburse office parking charges, or $120/month for public transport • 🏀 Sport: $120/month reimbursement for gym membership • 🥕 Meal stipend: $400 monthly allowance for meals • 🌎 Visa sponsorship • 🤝 Coaching: we offer BetterUp coaching on a voluntary basis

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 1 Monat

TrueML

51 - 200

💳 Fintech

💸 Finanzen

👥 B2C

Senior Manager, DevOps leading infrastructure and platform engineering efforts at TrueML. Focus on cloud architecture and CI/CD standards for machine learning-driven products.

🇺🇸 Vereinigte Staaten – Remote

💵 $150.000 - $220.000 / Jahr

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

Sweed POS

11 - 50

🛒 Einzelhandel

🛍️ eCommerce

🤝 B2B

DevOps Engineer optimizing infrastructure and implementing automation for Sweed's cannabis retail platform. Collaborate with global teams to enhance development and deployment processes.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

Cyngn

51 - 200

🚗 Transport

☁️ SaaS

🔧 Hardware

Deployment Engineer optimizing autonomy for Cyngn's autonomous robotic systems deployed across North America. Leading on-site deployments and ensuring customer satisfaction in a diverse team environment.

🇺🇸 Vereinigte Staaten – Remote

💵 $100.000 - $125.000 / Jahr

💰 €20.000.000 Post-IPO Equity im 2022-04

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

URBN (Urban Outfitters, Anthropologie Group, Free People & Nuuly)

10.000+ Mitarbeiter

👥 B2C

🛒 Einzelhandel

👗 Mode

Senior DevOps Engineer optimizing cloud infrastructure on GCP for Nuuly. Leading CI/CD initiatives and collaborating with developers to enhance system performance.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 1 Monat

GitLab

1001 - 5000

🤖 Künstliche Intelligenz

🏢 Unternehmen

☁️ SaaS

Site Reliability Engineer for GitLab focusing on Environment Automation and managing isolated environments. Collaborating with the team to ensure reliability, scalability, and security of services.

🇺🇸 Vereinigte Staaten – Remote

💵 $103.600 - $222.000 / Jahr

💰 Secondary Market im 2020-11

⏰ Vollzeit

🟡 Mittelstufe

🟠 Senior

⛑ DevOps- und Site Reliability Engineer (SRE)

🗣️🇺🇸🇬🇧 Englisch erforderlich