Staff Software Engineer, GPU Infrastructure – HPC

🕒 Janeiro 16

🏄 California – Remoto

info

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

Candidatar-se
Encontrar Vagas Remotas Similares

📊 Verifique sua pontuação de currículo para esta vaga

Melhore suas chances de conseguir uma entrevista verificando sua pontuação de currículo antes de se candidatar.

Logo of Cohere

Cohere

11 - 50 funcionários

🤖 Inteligência Artificial

🏢 Corporativo

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

A Cohere é uma plataforma de IA líder, fornecendo às empresas modelos de linguagem avançada e um espaço de trabalho integrado projetado para eficiência e segurança. Com uma família de modelos generativos e de recuperação de alto desempenho, a Cohere permite que as organizações simplifiquem fluxos de trabalho, melhorem a segurança dos dados e descubram insights em diversas indústrias por meio de capacidades multilingues. Seu foco em soluções de IA personalizadas garante a proteção de dados críticos, facilitando a integração perfeita nos processos organizacionais existentes.

Descrição

• Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads. • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects. • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows. • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently. • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions. • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient. • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

🎯 Requisitos

• Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments. • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads. • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions. • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads. • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges. • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

🏖️ Benefícios

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Candidatar-se

Vagas Similares

🕒 Janeiro 15

Principal Software Engineer focused on rebuilding ecares, a healthcare coordination platform. Leading development efforts in a startup environment to optimize patient care delivery and collaboration.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $175.000 - $250.000 / ano

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 14

CivicPlus

501 - 1000

📋 Conformidade

🏛️ Governo

☁️ SaaS

Principal Software Engineer role at CivicPlus involves leading technical integration across acquisitions, fostering engineering culture, and providing architectural leadership for SaaS systems.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $145.000 - $225.000 / ano

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 12

Agiloft

201 - 500

🏢 Corporativo

☁️ SaaS

🤖 Inteligência Artificial

Software Engineer developing core platform features for Agiloft's data-first contract lifecycle management software. Building scalable backend services and collaborating with design and product teams.

🇺🇸 Estados Unidos – Remoto (EUA)

💰 $45.000.000 Private Equity Round em 2020-08

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 12

Iterable

501 - 1000

🤖 Inteligência Artificial

🤝 B2B

Principal Engineer driving technical strategy and architectural coherence at Iterable. Responsible for engineering excellence across several key product areas.

🇺🇸 Estados Unidos – Remoto (EUA)

💰 $200.000.000 Series E em 2021-06

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

info

🗣️🇺🇸🇬🇧 Inglês obrigatório

🕒 Janeiro 12

Prefect

51 - 200

🤖 Inteligência Artificial

☁️ SaaS

🏢 Corporativo

Staff Product Engineer at Prefect developing end-to-end solutions for AI-driven workflows. Collaborating with senior engineers and leadership to innovate in the fast-evolving AI tooling space.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $225.000 - $280.000 / ano

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🗣️🇺🇸🇬🇧 Inglês obrigatório