About Me
Hello, I am Tinghao Xie 谢廷浩, a 3rd year ECE PhD candidate at Princeton, advised by Prof. Prateek Mittal. Previously, I received my Bachelor degree from Computer Science and Technology at Zhejiang University.
My Research
I hope to fully explore the breadth and depth of safe, secure, robust, and reliable AI systems. Specifically:
- I am currently working around large (language / vision) model safety and security.
- Been working on multi-modality safety and security. Draft will be shared soon!
- ✨ Check out our new LLM safety benchmark, 🥺SORRY-Bench – evaluate LLM safety refusal systematically!
- How easy can current text-to-image systems generate copyrighted content (and how to prevent them from such generations)? Check our 🐱CopyCat project.
- “AI safety” and “AI security” are different! See our position paper 📖 AI Risk Management Should Incorporate Both Safety and Security.
- LLM safety is brittle🫙 – removing barely 3% parameter / 2.5% rank will compromise model safety.
- Do you know 🚨fine-tuning aligned LLM can compromise safety, even when users do not intend to? Checkout our work on 🚨LLM Fine-tuning Risks [website] [paper] [code], which was exclusively reported on 📰New York Times!
- I also have extensive research experience on DNN backdoor attacks and defenses for CV models:
- Check my 📦backdoor-toolbox @ Github, which has helped many backdoor researchers!
News & Facts
- [2024/05] I’m interning at Meta GenAI (Menlo Park, CA) this summer 🏖️. Feel free to reach out if you are nearby!
- [2024/05] Officially a PhD candidate now!
- [2023/10] Two papers accepted by ICLR 2024:
- [2023/10] Our preprint 🚨Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! is available. It is exclusively reported by 📰New York Times and covered by many other social medias!
- I enjoy: 🧗 Rock Climbing, Skiing, Open Water Diving, Basketball, Swimming, Billiards, Bowling…
Publications/Manuscripts
📖 SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Tinghao Xie*, Xiangyu Qi*, Yi Zeng*, Yangsibo Huang*, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
Preprint (Under Review)
📖 Fantastic Copyrighted Beasts and How (Not) to Generate Them
Luxi He*, Yangsibo Huang*, Weijia Shi*, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter Henderson
Preprint (Under Review)
📖 AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek Mittal
Preprint (Under Review)
📖 Assessing the brittleness of safety alignment via pruning and low-rank modifications
Boyi Wei*, Kaixuan Huang*, Yangsibo Huang*, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
ICML 2024
📖 Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi*, Yi Zeng*, Tinghao Xie*, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal$^†$, Peter Henderson$^†$
ICLR 2024 (oral)
📰 This work was exclusively reported by New York Times, and covered by many other social medias!
📖 BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal
ICLR 2024
📖 Towards A Proactive ML Approach for Detecting Backdoor Poison Samples
Xiangyu Qi, Tinghao Xie, Jiachen T. Wang, Tong Wu, Saeed Mahloujifar, Prateek Mittal
USENIX Security 2023
📖 Revisiting the Assumption of Latent Separability for Backdoor Defenses
Xiangyu Qi*, Tinghao Xie*, Yiming Li, Saeed Mahloujifar, Prateek Mittal
ICLR 2023
📖 Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Xiangyu Qi*, Tinghao Xie*, Ruizhe Pan, Jifeng Zhu, Yong Yang, Kai Bu
CVPR 2022 (oral)