Loading...

@

  • jobs
  • 1 month ago

jobs description

We are a leading financial institution dedicated to pioneering trustworthy, reliable, and human-centric AI systems, revolutionizing the banking industry for the better. Capital One has been at the forefront of leveraging machine learning to craft intelligent, real-time, and automated customer experiences. From alerting customers about unusual transactions to providing instant responses to... inquiries, our AI and ML applications prioritize simplicity and empathy in banking. With substantial investments in public cloud infrastructure and machine learning platforms, we are uniquely positioned to harness AI's transformative potential. Our commitment to fostering exceptional applied science and engineering teams drives our pursuit of breakthrough product experiences and scalable, high-performance AI infrastructure. Join us at Capital One and contribute to reshaping how we serve our beloved customers and businesses through emerging AI capabilities.

Role Overview

As a Senior Distinguished Engineer in AI Systems, you'll play a pivotal role in establishing the bedrock of our enterprise AI capabilities. Your responsibilities will span diverse initiatives, including designing robust and secure infrastructure, constructing large-scale distributed training clusters, deploying advanced AI models for real-time applications, and supporting cutting-edge AI research and development—all within our public cloud infrastructure. Collaborating closely with a team of AI engineers and researchers, you'll help envision the future state of our capabilities while spearheading the design and implementation of key services. Sample projects you'll tackle include:

Architecting fault-tolerant infrastructure to support large-scale training tasks resiliently, leveraging containerization and checkpointing libraries.

Developing infrastructure for deploying and serving large ML models in our public cloud environment.

Orchestrating a thousand-node training cluster optimized for storage and networking efficiency, with tightly integrated training pipelines for parallelism strategies.

Designing and executing performance benchmarks for AI software systems, informing technology selection and optimization efforts.

Creating applications that harness Large Language Models (LLMs) and Fine-tuning Models (FMs).

Establishing capabilities to facilitate MLOps for foundational models.

Basic Qualifications

Bachelor's degree in Computer Science, Computer Engineering, or a related technical field.

Minimum of 7 years' experience designing and building distributed computing High-Performance Computing (HPC) and large-scale ML systems.

Minimum of 5 years' experience developing AI and ML algorithms in Python or C/C++.

Minimum of 3 years' experience with the full ML development lifecycle using AI and ML frameworks in public cloud environments.

Preferred Qualifications

Master's degree or PhD in Engineering, Computer Science, or a related technical field, or equivalent practical experience focusing on modern AI techniques.

Experience designing large-scale distributed platforms and systems in cloud environments such as AWS, Azure, or GCP.

Expertise in architecting cloud systems for security, availability, performance, scalability, and cost-effectiveness.

Proficiency in delivering large models through the MLOps lifecycle, from exploration to serving.

Hands-on experience building GPU clusters in public cloud environments with tightly integrated storage and networking.

Familiarity with the complete stack for distributed training of large models, including ML compilers and frameworks such as PyTorch, TensorFlow, and Lightning.

Experience in various areas of the AI technology stack, including prompt engineering, guardrails, vector databases/knowledge bases, LLM hosting, and fine-tuning.

Record of research publications in top peer-reviewed conferences or notable contributions to neural networks, distributed training, and SystemsML within the industry.

Note: Capital One is open to considering sponsorship for employment authorization for qualified applicants.

Salary Range: The minimum and maximum full-time annual salaries for this role vary by location. Please reach out for specific details.

Employment Type: Full-Time
Atlanta GA USA

salary-criteria

Apply - Distinguished Engineer, Generative AI Systems - Remote | WFH Atlanta