Staff Software Engineer, GPU Infrastructure – HPC

Ähnliche Remote-Jobs finden

11 - 50 Mitarbeiter

🤖 Künstliche Intelligenz

🏢 Unternehmen

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Cohere ist eine führende KI-Plattform, die Unternehmen fortschrittliche Sprachmodelle und einen integrierten Arbeitsbereich bietet, der auf Effizienz und Sicherheit ausgelegt ist. Mit einer Reihe von leistungsstarken generativen und Retrieval-Modellen ermöglicht Cohere Organisationen die Optimierung von Arbeitsabläufen, die Verbesserung der Datensicherheit und das Erschließen von Erkenntnissen über verschiedene Branchen hinweg durch mehrsprachige Fähigkeiten. Ihr Fokus auf maßgeschneiderte KI-Lösungen gewährleistet den Schutz kritischer Daten und erleichtert die nahtlose Integration in bestehende organisatorische Prozesse.

Staff Software Engineer, GPU Infrastructure – HPC

🕒 vor 6 Monaten

🏄 California – Remote

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🦅 H1B-Visum-Sponsor

🗣️🇺🇸🇬🇧 Englisch erforderlich

Cloud

Kubernetes

Linux

Python

PyTorch

Tensorflow

Jetzt Bewerben

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Cohere

11 - 50 Mitarbeiter

🤖 Künstliche Intelligenz

🏢 Unternehmen

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Beschreibung

• Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads. • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects. • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows. • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently. • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions. • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient. • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

🎯 Anforderungen

• Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments. • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads. • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions. • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads. • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges. • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

🏖️ Vorteile

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Jetzt Bewerben

Ähnliche Jobs

Principal Software Engineer – Digital Care Platform

🕒 vor 6 Monaten

Essen Health Care

501 - 1000

🏥 Gesundheitswesen

Principal Software Engineer focused on rebuilding ecares, a healthcare coordination platform. Leading development efforts in a startup environment to optimize patient care delivery and collaboration.

🇺🇸 Vereinigte Staaten – Remote

💵 $175.000 - $250.000 / Jahr

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🗣️🇺🇸🇬🇧 Englisch erforderlich

Principal Software Engineer

🕒 vor 6 Monaten

HappyCo

51 - 200

☁️ SaaS

🏠 Immobilien

Principal Software Engineer leading architectural design and technical strategy at HappyCo. Focused on modernizing SaaS platform while ensuring system stability and efficiency.

🇺🇸 Vereinigte Staaten – Remote

💰 €52.000.000 Venture Round im 2022-01

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🗣️🇺🇸🇬🇧 Englisch erforderlich

Staff Software Engineer

🕒 vor 6 Monaten

Kin Insurance

501 - 1000

🛡️ Versicherung

💸 Finanzen

👥 B2C

Staff Engineer developing products and features at Kin, a digital insurer focused on smart and fast home insurance solutions. Leading engineering projects and mentoring team members in a Ruby on Rails environment.

🇺🇸 Vereinigte Staaten – Remote

💵 $152.000 - $200.000 / Jahr

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🦅 H1B-Visum-Sponsor

🗣️🇺🇸🇬🇧 Englisch erforderlich

AWS

JavaScript

NoSQL

RDBMS

Ruby

Ruby on Rails

SQL

TypeScript

Staff Software Engineer

🕒 vor 6 Monaten

Zippy

51 - 200

🛡️ Versicherung

💸 Finanzen

💳 Fintech

Staff Engineer at Zippy architecting large, scalable technical solutions for the digital loan platform in manufactured housing. Leading technical initiatives and mentoring engineers across the organization.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🗣️🇺🇸🇬🇧 Englisch erforderlich

Cloud

Distributed Systems

Staff Software Engineer

🕒 vor 6 Monaten

SOCKET

51 - 200

📡 Telekommunikation

Staff Software Engineer contributing to Socket web application development and team building. Engaging with design teams for user experience and helping to shape product roadmap.

🇺🇸 Vereinigte Staaten – Remote

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🦅 H1B-Visum-Sponsor

🗣️🇺🇸🇬🇧 Englisch erforderlich

ElasticSearch

GraphQL

JavaScript

Node.js

Postgres

React

TypeScript