I am a Member of Technical Staff at Anthropic, where I work on building new evals.
Previously I was a postdoc at Stanford Computer Science with Ludwig Schmidt, where I co-created Terminal-Bench. Terminal-Bench has been featured on model cards from virtually every frontier lab.
I did my PhD at the Paul G. Allen School of Computer Science & Engineering at The University of Washington, where I was advised by Tim Althoff. I was a Student Researcher at Google Research, and an ML Research Intern at Apple Health AI.
This list may be out of date. See my Google Scholar for an up-to-date list.
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap, Jan-Lucas Uslu, Jeffrey Li, Jianbo Wu, Minghao Yan, Song Bian, Vedang Sharma, Ke Sun, Steven Dillmann, Akshay Anand, Andrew Lanpouthakoun, Bardia Koopah, Changran Hu, Etash Guha, Gabriel H. S. Dreiman, Jiacheng Zhu, Karl Krauth, Li Zhong, Niklas Muennighoff, Robert Amanfu, Shangyin Tan, Shreyas Pimpalgaonkar, Tushar Aggarwal, Xiangning Lin, Xin Lan, Xuandong Zhao, Yiqing Liang, Yuanli Wang, Zilong Wang, Changzhi Zhou, David Heineman, Hange Liu, Harsh Trivedi, John Yang, Junhong Lin, Manish Shetty, Michael Yang, Nabil Omi, Negin Raoof, Shanda Li, Terry Yue Zhuo, Wuwei Lin, Yiwei Dai, Yuxin Wang, Wenhao Chai, Shang Zhou, Dariush Wahdany, Ziyu She, Jiaming Hu, Zhikang Dong, Yuxuan Zhu, Sasha Cui, Ahson Saiyed, Arinbjรถrn Kolbeinsson, Jesse Hu, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Alex Dimakis, Andy Konwinski, and Ludwig Schmidt
Language Models Still Struggle to Zero-shot Reason about Time Series
, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff
EMNLP, 2024 [PDF] [Data & Code]
BLADE: Benchmarking Language Model Agents for Data-Driven Science
Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, , Jeffrey Heer, and Tim Althoff
EMNLP, 2024 [PDF] [Data & Code]
Are Language Models Actually Useful for Time Series Forecasting?
Mingtian Tan, , Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen
NeurIPS [Spotlight ๐], 2024 [PDF]
Transforming Wearable Data into Health Insights using Large Language Model Agents
, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, and Xin Liu
Nature Communications, 2024 [PDF] [Google Research Blog]
Homekit2020: A Benchmark for Time Series Classification on a Large Mobile Sensing Dataset with Laboratory Tested Ground Truth of Influenza Infections
, Esteban Safranchik, Arinbjorn Kolbeinsson, Piyusha Gade, Ernesto Ramirez, Ludwig Schmidt, Luca Foshchini, and Tim Althoff
CHIL, 2023 [PDF]
Self-supervised Pretraining and Transfer Learning Enable Flu and COVID-19 Predictions in Small Mobile Sensing Datasets
and Tim Althoff
CHIL, 2023 [PDF]
CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis
Ge Zhang, , Yang Liu, Jeffrey Heer, and Tim Althoff
EPJ Data Science, 2022 [PDF] *Co-First Author
Globem dataset: Multi-year datasets for longitudinal human behavior modeling generalization
Xuhai Xu, Han Zhang, Yasaman Sefidgar, Yiyi Ren, Xin Liu, Woosuk Seo, Jennifer Brown, Kevin Kuehn, , Paula Nurius, Shwetak Patel, Tim Althoff, Margaret E. Morris, Eve Riskin, Jennifer Mankoff, and Anind K. Dey
NeurIPS, 2022 [PDF]
MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches
, Ge Zhang, and Tim Althoff
KDD, 2021 [PDF]
CrossCheck: Integrating self-report, behavioral sensing, and smartphone use to identify digital indicators of psychotic relapse
Dror Ben-Zeev, Rachel Brian, Rui Wang, Weichen Wang, Andrew T. Campbell, Min S. H. Aung, , Vincent W. S. Tseng, Tanzeem Choudhury, Marta Hauser, John M. Kane, and Emily A. Scherer
Psychiatric Rehabilitation Journal, 2017 [PDF]
CrossCheck: toward passive sensing and detection of mental health changes in people with schizophrenia
Rui Wang, Min S. H. Aung, Saeed Abdullah, Rachel Brian, Andrew T. Campbell, Tanzeem Choudhury, Marta Hauser, John Kane, , Emily A. Scherer, Vincent W. S. Tseng, and Dror Ben-Zeev
Ubicomp, 2016 [PDF]
Assessing mental health issues on college campuses: Preliminary findings from a pilot study
Vincent W. S. Tseng, , Franziska Wittleder, Saeed Abdullah, Min Hane Aung, and Tanzeem Choudhury
Ubicomp, 2016 [PDF]