Staff Software Engineer, GPU Infrastructure – HPC

🕒 vor 4 Monaten

🏄 California – Remote

info

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🦅 H1B-Visum-Sponsor

info

🗣️🇺🇸🇬🇧 Englisch erforderlich

Jetzt Bewerben
Ähnliche Remote-Jobs finden

📊 Überprüfen Sie Ihre Lebenslauf-Bewertung für diese Stelle

Verbessern Sie Ihre Chancen auf ein Vorstellungsgespräch, indem Sie Ihre Lebenslauf-Bewertung vor der Bewerbung überprüfen.

Logo of Cohere

Cohere

11 - 50 Mitarbeiter

🤖 Künstliche Intelligenz

🏢 Unternehmen

☁️ SaaS

Artificial Intelligence • Enterprise • SaaS

Cohere ist eine führende KI-Plattform, die Unternehmen fortschrittliche Sprachmodelle und einen integrierten Arbeitsbereich bietet, der auf Effizienz und Sicherheit ausgelegt ist. Mit einer Reihe von leistungsstarken generativen und Retrieval-Modellen ermöglicht Cohere Organisationen die Optimierung von Arbeitsabläufen, die Verbesserung der Datensicherheit und das Erschließen von Erkenntnissen über verschiedene Branchen hinweg durch mehrsprachige Fähigkeiten. Ihr Fokus auf maßgeschneiderte KI-Lösungen gewährleistet den Schutz kritischer Daten und erleichtert die nahtlose Integration in bestehende organisatorische Prozesse.

Beschreibung

• Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads. • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects. • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows. • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently. • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions. • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient. • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

🎯 Anforderungen

• Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments. • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads. • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions. • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads. • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges. • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

🏖️ Vorteile

• An open and inclusive culture and work environment • Work closely with a team on the cutting edge of AI research • Weekly lunch stipend, in-office lunches & snacks • Full health and dental benefits, including a separate budget to take care of your mental health • 100% Parental Leave top-up for up to 6 months • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend • 6 weeks of vacation (30 working days!)

Jetzt Bewerben

Ähnliche Jobs

🕒 vor 4 Monaten

Principal Software Engineer focused on rebuilding ecares, a healthcare coordination platform. Leading development efforts in a startup environment to optimize patient care delivery and collaboration.

🇺🇸 Vereinigte Staaten – Remote

💵 $175.000 - $250.000 / Jahr

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 4 Monaten

CivicPlus

501 - 1000

📋 Compliance

🏛️ Regierung

☁️ SaaS

Principal Software Engineer role at CivicPlus involves leading technical integration across acquisitions, fostering engineering culture, and providing architectural leadership for SaaS systems.

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 4 Monaten

Prefect

51 - 200

🤖 Künstliche Intelligenz

☁️ SaaS

🏢 Unternehmen

Staff Product Engineer at Prefect developing end-to-end solutions for AI-driven workflows. Collaborating with senior engineers and leadership to innovate in the fast-evolving AI tooling space.

🇺🇸 Vereinigte Staaten – Remote

💵 $225.000 - $280.000 / Jahr

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 5 Monaten

HappyCo

51 - 200

☁️ SaaS

🏠 Immobilien

Principal Software Engineer leading architectural design and technical strategy at HappyCo. Focused on modernizing SaaS platform while ensuring system stability and efficiency.

🇺🇸 Vereinigte Staaten – Remote

💰 €52.000.000 Venture Round im 2022-01

⏰ Vollzeit

🔴 Experte

🧑‍💻 Full-Stack-Entwickler

🗣️🇺🇸🇬🇧 Englisch erforderlich

🕒 vor 5 Monaten

Kin Insurance

501 - 1000

💸 Finanzen

👥 B2C

Staff Engineer developing products and features at Kin, a digital insurer focused on smart and fast home insurance solutions. Leading engineering projects and mentoring team members in a Ruby on Rails environment.

🗣️🇺🇸🇬🇧 Englisch erforderlich