Staff Software Engineer, GPU Infrastructure – HPC

11 - 50 funcionários

🤖 Inteligência Artificial

🏢 Corporativo

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

A Cohere é uma plataforma de IA líder, fornecendo às empresas modelos de linguagem avançada e um espaço de trabalho integrado projetado para eficiência e segurança. Com uma família de modelos generativos e de recuperação de alto desempenho, a Cohere permite que as organizações simplifiquem fluxos de trabalho, melhorem a segurança dos dados e descubram insights em diversas indústrias por meio de capacidades multilingues. Seu foco em soluções de IA personalizadas garante a proteção de dados críticos, facilitando a integração perfeita nos processos organizacionais existentes.

Staff Software Engineer, GPU Infrastructure – HPC

🕒 Janeiro 16

🏄 California – Remoto

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

🗣️🇺🇸🇬🇧 Inglês obrigatório

Cloud

Kubernetes

Linux

Python

PyTorch

Tensorflow

Encontrar Vagas Remotas Similares

📊 Verifique sua pontuação de currículo para esta vaga

Melhore suas chances de conseguir uma entrevista verificando sua pontuação de currículo antes de se candidatar.

Cohere

11 - 50 funcionários

🤖 Inteligência Artificial

🏢 Corporativo

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Descrição

• Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads. • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects. • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows. • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently. • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions. • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient. • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

🎯 Requisitos

• Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments. • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads. • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions. • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads. • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges. • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

🏖️ Benefícios

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Vagas Similares

Principal Software Engineer – Digital Care Platform

🕒 Janeiro 15

Essen Health Care

501 - 1000

🏥 Saúde

Principal Software Engineer focused on rebuilding ecares, a healthcare coordination platform. Leading development efforts in a startup environment to optimize patient care delivery and collaboration.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $175.000 - $250.000 / ano

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🗣️🇺🇸🇬🇧 Inglês obrigatório

Principal Software Engineer

🕒 Janeiro 9

HappyCo

51 - 200

☁️ SaaS

🏠 Imobiliário

Principal Software Engineer leading architectural design and technical strategy at HappyCo. Focused on modernizing SaaS platform while ensuring system stability and efficiency.

🇺🇸 Estados Unidos – Remoto (EUA)

💰 $52.000.000 Venture Round em 2022-01

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🗣️🇺🇸🇬🇧 Inglês obrigatório

Staff Software Engineer

🕒 Janeiro 9

Kin Insurance

501 - 1000

🛡️ Seguros

💸 Finanças

👥 B2C

Staff Engineer developing products and features at Kin, a digital insurer focused on smart and fast home insurance solutions. Leading engineering projects and mentoring team members in a Ruby on Rails environment.

🇺🇸 Estados Unidos – Remoto (EUA)

💵 $152.000 - $200.000 / ano

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

🗣️🇺🇸🇬🇧 Inglês obrigatório

AWS

JavaScript

NoSQL

RDBMS

Ruby

Ruby on Rails

SQL

TypeScript

Staff Software Engineer

🕒 Janeiro 8

Zippy

51 - 200

🛡️ Seguros

💸 Finanças

💳 Fintech

Staff Engineer at Zippy architecting large, scalable technical solutions for the digital loan platform in manufactured housing. Leading technical initiatives and mentoring engineers across the organization.

🇺🇸 Estados Unidos – Remoto (EUA)

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🗣️🇺🇸🇬🇧 Inglês obrigatório

Cloud

Distributed Systems

Staff Software Engineer

🕒 Janeiro 8

SOCKET

51 - 200

📡 Telecomunicações

Staff Software Engineer contributing to Socket web application development and team building. Engaging with design teams for user experience and helping to shape product roadmap.

🇺🇸 Estados Unidos – Remoto (EUA)

⏰ Tempo Integral

🔴 Especialista

🧑‍💻 Engenheiro Full-stack

🦅 Patrocina Visto H1B

🗣️🇺🇸🇬🇧 Inglês obrigatório

ElasticSearch

GraphQL

JavaScript

Node.js

Postgres

React

TypeScript