TPUs Go Brrr

Hi, I am Simon.


I am a Computer Science Ph.D. student at Stanford University, part of the Scaling Intelligence Lab with Prof. Azalia Mirhoseini. I have also worked / collaborated with Prof. Christopher Ré in Hazy Research Lab, and Prof. Tatsunori Hashimoto through the rotation program.

I studied Electrical Engineering and Computer Sciences during my undergrad at Berkeley. I was luckily involved in the SLICE lab working with Prof. Sophia Shao, and RISE lab working with Prof. Ion Stoica.

My work centers on the intersection of computer systems and machine learning, with a focus on enabling scaling and progressing toward self-improvement. Most recently, I spent some time pre-training language models at Cohere and experimenting with evolutionary algorithms at Sakana AI. Previously, I designed GPUs at Apple, scaled out distributed systems at Anyscale, and make cars drive themselves at NVIDIA DRIVE.

If you are interested in my journey, please check out the rest of this site. Feel free to contact me at simonguo [@] stanford dot edu.

Resume CV

Research


Currently I'm interested in the interplay of machine learning and systems that enables self-improvement, as well as improving models’ code generation capability through post-training and scaling synthetic data. I’ve published at machine learning and computer systems conferences.

Highlights:

Kevin: Multi-Turn RL for Generating CUDA Kernels

Carlo Baronio*, Pietro Marsella*, Ben Pan*, Simon Guo, Silas Alberti

EXAIT & ES-FoMo-III Workshop at International Conference on Machine Learning (ICML), 2025

Multi-Turn RL Training for Generating CUDA Kernels


arxiv
KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang*, Simon Guo*, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, Azalia Mirhoseini

* indicates equal contribution

International Conference on Machine Learning (ICML), 2025
DL4C (Best Paper) & SSI-FM Workshop at International Conference on Learning Representations (ICLR), 2025

Benchmark and environment to evaluate LLMs' ability to generate efficient GPU Kernels


arxiv
BAM! Just Like that, Simple and Efficient Parameter Upcycling for Mixture of Experts

Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Üstün, Acyr Locatelli

Conference on Neural Information Processing Systems (NeurIPS), 2024
NGSM (Spotlight) and ES-FoMo-II Workshop at International Conference on Machine Learning (ICML), 2024

Upcycling MoE with Mixture-of-Attention for more efficient MoE pre-training


arxiv
Parallelism in Bundle Adjustment for SLAM

Simon Zirui Guo, Yakun Sophia Shao

ACM Student Research Competition at IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022

Speed up SLAM by exploiting structural sparsity and custom kernels on Tensor Cores


Extended Abstract
Gemmini: An Open-Source, Full-System DNN Accelerator Design and Evaluation Platform

Hasan Genc, Seah Kim, Vadim Vadimovich Nikiforov, Simon Zirui Guo, Borivoje Nikolić, Krste Asanović, Yakun Sophia Shao

First Workshop on Open-Source Computer Architecture Research (OSCAR) at ACM/IEEE International Symposium on Computer Architecture (ISCA), 2022

Design Space Exploration for DNN accelerators across the stack


Workshop Presentation
D3: A Dynamic Deadline-Driven Approach for Building Autonomous Vehicles

Ionel Gog, Sukrit Kalra, Peter Schafhalter*, Joseph E. Gonzalez, Ion Stoica

* worked as undergraduate research assistant for author

In Proceedings of European Conference on Computer Systems (EuroSys), 2022

OS for self-driving cars and robots


ACM