Member of Technical Staff @ Anthropic ๐Ÿœ

avatar

Mike Merrill

About Me

I am a Member of Technical Staff at Anthropic, where I work on building new evals.

Previously I was a postdoc at Stanford Computer Science with Ludwig Schmidt, where I co-created Terminal-Bench. Terminal-Bench has been featured on model cards from virtually every frontier lab.

I did my PhD at the Paul G. Allen School of Computer Science & Engineering at The University of Washington, where I was advised by Tim Althoff. I was a Student Researcher at Google Research, and an ML Research Intern at Apple Health AI.

Contact

Publications

This list may be out of date. See my Google Scholar for an up-to-date list.

...

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap, Jan-Lucas Uslu, Jeffrey Li, Jianbo Wu, Minghao Yan, Song Bian, Vedang Sharma, Ke Sun, Steven Dillmann, Akshay Anand, Andrew Lanpouthakoun, Bardia Koopah, Changran Hu, Etash Guha, Gabriel H. S. Dreiman, Jiacheng Zhu, Karl Krauth, Li Zhong, Niklas Muennighoff, Robert Amanfu, Shangyin Tan, Shreyas Pimpalgaonkar, Tushar Aggarwal, Xiangning Lin, Xin Lan, Xuandong Zhao, Yiqing Liang, Yuanli Wang, Zilong Wang, Changzhi Zhou, David Heineman, Hange Liu, Harsh Trivedi, John Yang, Junhong Lin, Manish Shetty, Michael Yang, Nabil Omi, Negin Raoof, Shanda Li, Terry Yue Zhuo, Wuwei Lin, Yiwei Dai, Yuxin Wang, Wenhao Chai, Shang Zhou, Dariush Wahdany, Ziyu She, Jiaming Hu, Zhikang Dong, Yuxuan Zhu, Sasha Cui, Ahson Saiyed, Arinbjรถrn Kolbeinsson, Jesse Hu, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Alex Dimakis, Andy Konwinski, and Ludwig Schmidt

ICLR, 2026 [PDF] [Website]

...

Language Models Still Struggle to Zero-shot Reason about Time Series

Mike A. Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff

EMNLP, 2024 [PDF] [Data & Code]

...

BLADE: Benchmarking Language Model Agents for Data-Driven Science

Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, and Tim Althoff

EMNLP, 2024 [PDF] [Data & Code]

...

Are Language Models Actually Useful for Time Series Forecasting?

Mingtian Tan, Mike A. Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen

NeurIPS [Spotlight ๐Ÿ”Ž], 2024 [PDF]

...

Transforming Wearable Data into Health Insights using Large Language Model Agents

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, and Xin Liu

Nature Communications, 2024 [PDF] [Google Research Blog]

...

Homekit2020: A Benchmark for Time Series Classification on a Large Mobile Sensing Dataset with Laboratory Tested Ground Truth of Influenza Infections

Mike A. Merrill, Esteban Safranchik, Arinbjorn Kolbeinsson, Piyusha Gade, Ernesto Ramirez, Ludwig Schmidt, Luca Foshchini, and Tim Althoff

CHIL, 2023 [PDF]

...

Self-supervised Pretraining and Transfer Learning Enable Flu and COVID-19 Predictions in Small Mobile Sensing Datasets

Mike A. Merrill and Tim Althoff

CHIL, 2023 [PDF]

...

CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis

Ge Zhang, Mike A. Merrill, Yang Liu, Jeffrey Heer, and Tim Althoff

EPJ Data Science, 2022 [PDF] *Co-First Author

...

Globem dataset: Multi-year datasets for longitudinal human behavior modeling generalization

Xuhai Xu, Han Zhang, Yasaman Sefidgar, Yiyi Ren, Xin Liu, Woosuk Seo, Jennifer Brown, Kevin Kuehn, Mike A. Merrill, Paula Nurius, Shwetak Patel, Tim Althoff, Margaret E. Morris, Eve Riskin, Jennifer Mankoff, and Anind K. Dey

NeurIPS, 2022 [PDF]

...

MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches

Mike A. Merrill, Ge Zhang, and Tim Althoff

KDD, 2021 [PDF]

...

CrossCheck: Integrating self-report, behavioral sensing, and smartphone use to identify digital indicators of psychotic relapse

Dror Ben-Zeev, Rachel Brian, Rui Wang, Weichen Wang, Andrew T. Campbell, Min S. H. Aung, Michael Merrill, Vincent W. S. Tseng, Tanzeem Choudhury, Marta Hauser, John M. Kane, and Emily A. Scherer

Psychiatric Rehabilitation Journal, 2017 [PDF]

...

CrossCheck: toward passive sensing and detection of mental health changes in people with schizophrenia

Rui Wang, Min S. H. Aung, Saeed Abdullah, Rachel Brian, Andrew T. Campbell, Tanzeem Choudhury, Marta Hauser, John Kane, Michael Merrill, Emily A. Scherer, Vincent W. S. Tseng, and Dror Ben-Zeev

Ubicomp, 2016 [PDF]

...

Assessing mental health issues on college campuses: Preliminary findings from a pilot study

Vincent W. S. Tseng, Michael Merrill, Franziska Wittleder, Saeed Abdullah, Min Hane Aung, and Tanzeem Choudhury

Ubicomp, 2016 [PDF]