Site Reliability Engineer

Trouver des Emplois à Distance Similaires

11 - 50 employés

🔧 Matériel

🏢 Entreprise

🤖 Intelligence artificielle

💰 €10 000 000 Seed Round en 2022-04

Hardware • Enterprise • Artificial Intelligence

Hydra Host est un fournisseur de solutions de calcul haute performance, offrant un accès à des serveurs GPU dédiés à nu, optimisés pour les charges de travail d'IA et HPC. Leur plateforme permet aux utilisateurs d'accéder et de louer des GPU de premier plan à l'échelle mondiale, offrant des performances inégalées, une sécurité renforcée et des options de personnalisation complètes. L'infrastructure de Hydra Host inclut un marché, connu sous le nom de Brokkr, qui propose un large éventail de configurations et de solutions GPU adaptées aux applications critiques telles que l'intelligence artificielle, le big data et l'apprentissage automatique. Grâce à leurs solutions robustes, sécurisées et évolutives, Hydra Host garantit que les clients bénéficient d'un contrôle total sur leurs environnements serveurs, avec des options pour l'évolutivité et la préparation future. Les offres de l'entreprise sont reconnues par les entreprises leaders cherchant des solutions informatiques efficaces et innovantes.

Site Reliability Engineer

Emploi pas sur LinkedIn

🕒 il y a 7 mois

🐊 Florida – Distant

💵 $140 000 - $200 000 / an

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Cloud

Grafana

Kubernetes

Prometheus

Python

Postuler Maintenant

📊 Vérifiez votre score de CV pour ce poste

Améliorez vos chances d'obtenir un entretien en vérifiant votre score de CV avant de postuler.

Hydra Host

11 - 50 employés

🔧 Matériel

🏢 Entreprise

🤖 Intelligence artificielle

💰 €10 000 000 Seed Round en 2022-04

Hardware • Enterprise • Artificial Intelligence

Description

• Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments • Evaluate and integrate monitoring and QA tools to find the right tools for the job • Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.) • Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms • Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments • Integrate all fore-mentioned systems to create holistic platform health statistics reporting • Design disaster-recovery processes in collaboration with devops • Ensure we are meeting uptime SLAs across all platform deployments • Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs • Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks • Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability • Drive incident response, root cause analysis, and post‑mortems • Guide incident turn-around into tooling and process improvements • Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what’s going on • Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets.

🎯 Exigences

• 5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments • Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix • Strong experience with service orchestration in mutli-region environment (Nomad, Kubernetes, cloud VMs, distributed databases) • Track record of managing production system uptime and SLAs and building tools to support it • Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process • Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks • Strong proficiency with infrastructure as code and devops workflows • Experience with distributed tracing, log aggregation, and alert tuning • Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently • Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders.

🏖️ Avantages

• Competitive compensation: base salary + performance bonus + equity • Exposure to high-performance computing and state-of-the-art GPU environments • A core role in ensuring our systems are reliable, observable, and meet customer SLAs • Remote work environment with a strong culture of ownership and autonomy • No red tape: find the right solution, work with the team, get feedback, and get the job done.

Postuler Maintenant

Emplois Similaires

Intermediate DevOps Engineer

🕒 il y a 7 mois

AbacusNext

201 - 500

☁️ SaaS

🤝 B2B

DevOps Engineer designing and implementing automation processes at CARET, enhancing efficiency for legal and accounting firms. Collaborates with different teams leveraging cloud technologies for optimal service delivery.

🇺🇸 États-Unis – Télétravail

💵 $90 000 - $110 000 / an

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Ansible

AWS

Azure

Cloud

DNS

Docker

Kubernetes

MongoDB

Python

Redis

SQL

Terraform

Senior DevOps Engineer – Application Deployment

🕒 il y a 8 mois

SkillTude Talent Solutions

1 - 10

🎯 Recrutement

👥 RH Tech

☁️ SaaS

Senior DevOps Engineer responsible for application deployments and cloud infrastructure management. Collaborating with teams to automate processes and ensure high performance of applications.

🇺🇸 États-Unis – Télétravail

💵 $10 000 / mois

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

AWS

Azure

Cloud

Docker

Grafana

Kubernetes

Prometheus

Python

Terraform

Vault

Web3 Infrastructure DevOps Engineer

🕒 il y a 8 mois

Generative AI

51 - 200

DevOps Engineer managing decentralized infrastructure involving blockchain systems at Loti AI, Inc. Engage in on-chain likeness rights enforcement and deploy IPFS and Arweave solutions.

🇺🇸 États-Unis – Télétravail

⏰ Temps Plein

🟡 Intermédiaire

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Docker

IPFS

Kubernetes

Senior Full Stack Developer – TypeScript, DevOps

🕒 il y a 8 mois

Ten Mile Square Technologies, LLC.

11 - 50

🏢 Entreprise

☁️ SaaS

🔒 Cybersecurity

Senior Full Stack TypeScript Developer designing a loan origination system for a financial services enterprise. Collaborating with tech leads and product teams using cutting-edge technologies.

🇺🇸 États-Unis – Télétravail

💵 $161 000 / an

⏰ Temps Plein

🟠 Senior

⛑ Ingénieur DevOps & SRE

🗣️🇺🇸🇬🇧 Anglais requis

Angular

Apollo

AWS

GraphQL

Jenkins

Linux

Node.js

SQL

TypeScript

DevOps Engineer

🕒 il y a 8 mois

Resolve Tech Solutions

501 - 1000

☁️ SaaS

🏢 Entreprise

🤖 Intelligence artificielle