About Me
Hello, I am Tinghao Xie 谢廷浩, a 3rd year ECE PhD candidate at Princeton, advised by Prof. Prateek Mittal. Previously, I received my Bachelor degree from Computer Science and Technology at Zhejiang University.
My Research
I hope to fully explore the breadth and depth of safe, secure, robust, and reliable AI systems. Specifically:
I currently work around large (language / vision) model safety and security.
- I’m recently working on multi-modality safety and security. We all know generative AI has safety issues, but can we utilize generative AI tools for better safety? In our recent [paper], we show that GenAI tools like T2I models and LLMs can help us automatically identify vulnerabilities of (NSFW) image classifiers!
- 🔍Our red-teaming results indicate that current NSFW image classifiers can be fooled by context shifts of benign visual elements – e.g., GPT-4o cannot recognize nude content when the image atmosphere appears misty serene.
- 🚨Such issues directly translate to safety vulnerabilities of commercial T2I products – we show it’s possible to rewrite NSFW prompts (by adding certain benign elements) and lure DALLE-3 (or ChatGPT) to generate nude images!
- ✨ Check out our new LLM safety benchmark, 🥺SORRY-Bench – evaluate LLM safety refusal systematically!
- How easy can current text-to-image systems generate copyrighted content (and how to prevent them from such generations)? Check our 🐱CopyCat project.
- “AI safety” and “AI security” are different! See our position paper 📖 AI Risk Management Should Incorporate Both Safety and Security.
- LLM safety is brittle🫙 – removing barely 3% parameter / 2.5% rank will compromise model safety.
- Do you know 🚨fine-tuning aligned LLM can compromise safety, even when users do not intend to? Checkout our work on 🚨LLM Fine-tuning Risks [website] [paper] [code], which was exclusively reported on 📰New York Times!
- I’m recently working on multi-modality safety and security. We all know generative AI has safety issues, but can we utilize generative AI tools for better safety? In our recent [paper], we show that GenAI tools like T2I models and LLMs can help us automatically identify vulnerabilities of (NSFW) image classifiers!
I also have extensive research experience on DNN backdoor attacks and defenses for CV models:
- Check my 📦backdoor-toolbox @ Github, which has helped many backdoor researchers!
News & Facts
- [2025/03] Looking for internship opportunities, 2025 🏝️summer or 🍂fall!
- [2025/01] Three papers accepted by ICLR 2025!
- [2024/05] I’m interning at Meta GenAI (Menlo Park, CA) this summer 🏖️. Feel free to reach out if you are nearby!
- [2024/05] Officially a PhD candidate now!
- [2023/10] Two papers accepted by ICLR 2024:
- [2023/10] Our preprint 🚨Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! is available. It is exclusively reported by 📰New York Times and covered by many other social medias!
- I enjoy: 🧗 Rock Climbing, Skiing, Open Water Diving, Basketball, Swimming, Billiards, Bowling…
Publications/Manuscripts
📖 Red-teaming NSFW Image Classifiers with Generative AI Tools
Tinghao Xie, Yueqi Xie, Alireza Zareian, Shuming Hu, Felix Juefei-Xu, Xiaowen Lin, Ankit Jain, Prateek Mittal, Li Chen
Preprint (Under Review)
📖 SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Tinghao Xie*, Xiangyu Qi*, Yi Zeng*, Yangsibo Huang*, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
ICLR 2025
📖 On Evaluating the Durability of Safeguards for Open-Weight LLMs
Xiangyu Qi*, Boyi Wei*, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson
ICLR 2025
📖 Fantastic Copyrighted Beasts and How (Not) to Generate Them
Luxi He*, Yangsibo Huang*, Weijia Shi*, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter Henderson
ICLR 2025
📖 AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek Mittal
Preprint (Under Review)
📖 Assessing the brittleness of safety alignment via pruning and low-rank modifications
Boyi Wei*, Kaixuan Huang*, Yangsibo Huang*, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
ICML 2024
📖 Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi*, Yi Zeng*, Tinghao Xie*, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal$^†$, Peter Henderson$^†$
ICLR 2024 (oral)
📰 This work was exclusively reported by New York Times, and covered by many other social medias!
📖 BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal
ICLR 2024
📖 Towards A Proactive ML Approach for Detecting Backdoor Poison Samples
Xiangyu Qi, Tinghao Xie, Jiachen T. Wang, Tong Wu, Saeed Mahloujifar, Prateek Mittal
USENIX Security 2023
📖 Revisiting the Assumption of Latent Separability for Backdoor Defenses
Xiangyu Qi*, Tinghao Xie*, Yiming Li, Saeed Mahloujifar, Prateek Mittal
ICLR 2023
📖 Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Xiangyu Qi*, Tinghao Xie*, Ruizhe Pan, Jifeng Zhu, Yong Yang, Kai Bu
CVPR 2022 (oral)