• AI Chip Architect - Reliability Focus China
  • Michael Page in China, , China
  • jobs
  • 4 weeks ago

jobs description

About Our Client

Our client is a fast developing AI chip company and they plan to go IPO in early 2025.

Job Description

1. Design, develop, and optimize the reliability architecture of our AI supercomputing systems to meet high-availability and performance objectives.
2. Collaborate with multi-disciplinary teams to understand system requirements and devise strategies to meet these goals.
3. Drive root cause analysis of reliability issues and devise plans to improve system robustness and uptime.
4. Develop reliability prediction models and carry out regular system risk assessments.
5. Analyze failure modes, predict future failures, and develop strategies to minimize downtime.
6. Support the creation and execution of test strategies to verify system performance and reliability.
7. Stay abreast of advancements in AI, supercomputing, and reliability engineering to help inform future system design.

The Successful Applicant

1. A master's degree in Computer Engineering, Electrical... Engineering, or a related field. An advanced degree would be a plus.
2. Proven experience in system architecture, with a focus on AI supercomputing and reliability engineering.
3. Strong knowledge of GPU architecture, high performance computing (HPC), and deep learning applications.
4. Familiarity with hardware testing, fault detection and fault-tolerant systems.
5. Strong analytical and problem-solving skills.
6. Excellent communication skills to articulate complex technical issues to diverse teams.

What's on Offer

Opportunity to make a big impact on the AI chip market


Apply - AI Chip Architect - Reliability Focus China