AI Engineer - Model Performance
Fathom
San Francisco, CA
Engineering, Information Technology
About Fathom
We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity. From instant, searchable call summaries to seamless CRM updates and team-wide sharing, Fathom transforms meetings from a source of friction into a place for alignment and momentum.
Weโre a small company that creates magical experiences through the hard work of focused builders. We try to live our values - Care Deeply, Seek Leverage, Share Ownership, Sustain Urgency, and Be Tenacious - in everything we do, every day.
We started Fathom to rid us all of the tyranny of note-taking, and people seem to really love what we've built so far:
๐ฅ #1 Most Used App of the Year on HubSpot for 2025
๐ฅ #1 Rated on G2 with 4,500+ reviews and a perfect 5/5 rating
๐ฅ #1 Product of the Day and #2 AI Product of the Year
๐ Most installed AI meeting assistant on both the Zoom and HubSpot marketplaces
๐ Weโre hitting revenue and usage records every week
We think youโll be pretty excited about Fathom too if you give it a try. Sign up today (itโs free)!
Role Overview
We're hiring a Model Performance Engineer to own the speed, cost, and reliability of our model inference stack, and to build the fine-tuning infrastructure that makes the rest of the AI team faster.
This is not a research role. You'll be optimizing real systems serving millions of meetings โ choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable.
You'll own two things:
Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware, and ship a production config that gets 1.3x speedup with <1% quality degradationEvaluate serving frameworks (vLLM vs SGLang) with speculative decoding โ discover that ngram speculation degrades ASR quality while EAGLE3 draft models don't, and that torch.compile makes certain GPUs 7% slowerBuild a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving, so a teammate can train a small classifier in an afternoon instead of a weekOptimize GPU spend โ know which GPU families are best for batch workloads (stable under high concurrency) vs latency-sensitive paths (40% faster, but tail latency blows up under load), and when a 30% cost premium isn't worth itDebug production inference issues โ trace a quality regression to a serving framework upgrade that changed the default attention backend, or find that audio format handling in the multimodal pipeline silently drops segments
Requirements
Hard Skills:
Include a brief write-up or demo of inference optimization or model serving work you've done. We care about the reasoning behind your decisions โ why you chose a specific quantization strategy, how you diagnosed a performance regression, what tradeoffs you navigated. A GitHub repo, blog post, or even a few paragraphs in your cover letter works.
We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity. From instant, searchable call summaries to seamless CRM updates and team-wide sharing, Fathom transforms meetings from a source of friction into a place for alignment and momentum.
Weโre a small company that creates magical experiences through the hard work of focused builders. We try to live our values - Care Deeply, Seek Leverage, Share Ownership, Sustain Urgency, and Be Tenacious - in everything we do, every day.
We started Fathom to rid us all of the tyranny of note-taking, and people seem to really love what we've built so far:
๐ฅ #1 Most Used App of the Year on HubSpot for 2025
๐ฅ #1 Rated on G2 with 4,500+ reviews and a perfect 5/5 rating
๐ฅ #1 Product of the Day and #2 AI Product of the Year
๐ Most installed AI meeting assistant on both the Zoom and HubSpot marketplaces
๐ Weโre hitting revenue and usage records every week
We think youโll be pretty excited about Fathom too if you give it a try. Sign up today (itโs free)!
Role Overview
We're hiring a Model Performance Engineer to own the speed, cost, and reliability of our model inference stack, and to build the fine-tuning infrastructure that makes the rest of the AI team faster.
This is not a research role. You'll be optimizing real systems serving millions of meetings โ choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable.
You'll own two things:
- Inference performance. You'll make our models faster and cheaper โ speculative decoding, quantization, serving configuration, GPU selection, batching strategies, cold start mitigation, adapter swapping. Our traffic is extremely spiky (meetings end in 30-minute blocks), so you need to think about throughput curves. Our team greatly values offering a fast product.
- Fine-tuning pipelines. The AI team constantly fine-tunes models for new tasks โ distilling large teacher models for classification, training adapters for domain-specific behavior, DPO for preference tuning. Right now each project reinvents the training loop. You'll build repeatable infrastructure so an AI Engineer can go more quickly from dataset to deployed model.
Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware, and ship a production config that gets 1.3x speedup with <1% quality degradationEvaluate serving frameworks (vLLM vs SGLang) with speculative decoding โ discover that ngram speculation degrades ASR quality while EAGLE3 draft models don't, and that torch.compile makes certain GPUs 7% slowerBuild a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving, so a teammate can train a small classifier in an afternoon instead of a weekOptimize GPU spend โ know which GPU families are best for batch workloads (stable under high concurrency) vs latency-sensitive paths (40% faster, but tail latency blows up under load), and when a 30% cost premium isn't worth itDebug production inference issues โ trace a quality regression to a serving framework upgrade that changed the default attention backend, or find that audio format handling in the multimodal pipeline silently drops segments
Requirements
Hard Skills:
- Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) โ not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
- Hands-on quantization experience โ you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
- Production fine-tuning experience โ LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
- Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines โ not notebooks
- Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead
- Cost modeling for GPU infrastructure โ you've had to choose between GPU types and justify the tradeoff
- Experience with multimodal models (audio/vision encoders + LLM decoders)
- Experience with Modal, Ray Serve, or similar serverless GPU platforms
- Understanding of audio processing (codecs, chunking, sample rates)
- Experience building internal tooling that other engineers use โ this role succeeds when the rest of the team ships faster
- ML research background or publications
- Prompt engineering expertise (we have a team for that)
- Frontend or full-stack experience
- Masters/PhD (though it's fine if you have one)
- The opportunity to shape the foundational software services of a growing company
- A role that balances innovation and incremental improvement
- A dynamic and collaborative engineering team
- Competitive compensation and benefits
- A supportive environment that encourages innovation and personal growth
- Opportunity for impact. Weโre established enough to ship instead of fighting fires and early enough that your work will have a real impact.
- Startup experience. Youโll work closely with our CEO, a 2X Founder/CEO with a background in computer science and product design.
- We embrace being fully remote. We schedule meetings sparingly and instead heavily use async comms (Slack, Notion, Loom)
- Youโll meet the entire team. We think itโs important that you get to meet everyone youโll be working with.
- No bullshit. Ask us anything you like. Weโve never understood why companies pretend theyโre something that theyโre not in the hiring process - youโre going to find out eventually so weโd rather you know who we are up front so we can both make sure this is a good fit for all involved.
- Quick turnaround time. We know you have lots of options so we move fast usually in less than a week from start to finish.
Include a brief write-up or demo of inference optimization or model serving work you've done. We care about the reasoning behind your decisions โ why you chose a specific quantization strategy, how you diagnosed a performance regression, what tradeoffs you navigated. A GitHub repo, blog post, or even a few paragraphs in your cover letter works.
About the company
Hospitals and Health Care
Fathom is on a mission to structure clinical data to bring efficiency and efficacy to healthcare systems around the world. We are the leader in autonomous medical coding, applying cutting-edge deep learning and NLP to code patient encounters with the highest automation rates and the broadest specialty coverage. Our AI technology delivers reduced costs, increased accuracy, and rapid turnaround times. We are backed by world-class investors including Lightspeed, Alkeon Capital, Google Ventures, 8VC, and Stanford as well as leading health systems and healthcare executives such as Cedars-Sinai and Jonathan Bush.



