About Me

Hello, I am Tinghao Xie 谢廷浩, a 4th year ECE PhD candidate at Princeton, advised by Prof. Prateek Mittal. Previously, I was a research intern at Meta (GenAI). I earned my Bachelor degree in Computer Science at Zhejiang University.

My Research

I hope to fully explore the breadth and depth of safe, secure, robust, and reliable AI systems. Specifically:

I currently work around large (language / vision) model safety and security.
- I’m recently working on safety and security problems in multi-modal systems. E.g., NSFW image classifiers are a typical safeguard for text-to-image (T2I) systems nowadays – they check whether a generated image is safe and block anything NSFW. However, in our recent [paper]:
  - We show these classifiers can be systemtatically fooled when benign visual elements of an image are shifted. For instance, while a NSFW image of “🖼️a nude person in an empty scene” can be easily blocked by most NSFW classifiers, a stealthier one that depicts “🖼️a nude person blending in a group of dressed people” may evade detection.
  - 🚨Alarmingly, we show these failures translate to real-world T2I(V) systems, including DALL-E 3, Sora, Gemini, and Grok, where bad users can rewrite NSFW prompts (by adding certain benign elements) to jailbreak these systems into generating nude images. For example, querying DALL-E 3 and Imagen 3 with prompts rewritten by our approach increases the chance of obtaining NSFW images from 0 to over 44%.
- ✨ Check out our new LLM safety benchmark, 🥺SORRY-Bench – evaluate LLM safety refusal systematically!
- How easy can current text-to-image systems generate copyrighted content (and how to prevent them from such generations)? Check our 🐱CopyCat project.
- “AI safety” and “AI security” are different! See our position paper 📖 AI Risk Management Should Incorporate Both Safety and Security.
- LLM safety is brittle🫙 – removing barely 3% parameter / 2.5% rank will compromise model safety.
- Do you know 🚨fine-tuning aligned LLM can compromise safety, even when users do not intend to? Checkout our work on 🚨LLM Fine-tuning Risks [website] [paper] [code], which was exclusively reported on 📰New York Times!
I also have extensive research experience on DNN backdoor attacks and defenses for CV models:
- Check my 📦backdoor-toolbox @ Github, which has helped many backdoor researchers!

News & Facts

Looking for internship opportunities, 2026 🏝️summer or 🌸spring!
[2025/01] Three papers accepted by ICLR 2025!
[2024/05] I’m interning at Meta GenAI (Menlo Park, CA) this summer 🏖️. Feel free to reach out if you are nearby!
[2024/05] Officially a PhD candidate now!
[2023/10] Two papers accepted by ICLR 2024:
- 📖 Oral (Top 1.2%) Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- 📖 BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
[2023/10] Our preprint 🚨Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! is available. It is exclusively reported by 📰New York Times and covered by many other social medias!
I enjoy: 🧗 Rock Climbing, Skiing, Open Water Diving, Basketball, Swimming, Billiards, Bowling…

Publications/Manuscripts

📖 Red-teaming NSFW Image Classifiers as Text-to-Image Safeguards
Tinghao Xie, Yueqi Xie, Alireza Zareian, Shuming Hu, Felix Juefei-Xu, Xiaowen Lin, Ankit Jain, Prateek Mittal, Li Chen
Preprint (Under Review)

📖 SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Tinghao Xie*, Xiangyu Qi*, Yi Zeng*, Yangsibo Huang*, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
ICLR 2025

📖 On Evaluating the Durability of Safeguards for Open-Weight LLMs
Xiangyu Qi*, Boyi Wei*, Nicholas Carlini, Yangsibo Huang, Tinghao Xie, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson
ICLR 2025

📖 Fantastic Copyrighted Beasts and How (Not) to Generate Them
Luxi He*, Yangsibo Huang*, Weijia Shi*, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter Henderson
ICLR 2025

📖 AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek Mittal
Preprint (Under Review)

📖 Assessing the brittleness of safety alignment via pruning and low-rank modifications
Boyi Wei*, Kaixuan Huang*, Yangsibo Huang*, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
ICML 2024

📖 Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi*, Yi Zeng*, Tinghao Xie*, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal$^†$, Peter Henderson$^†$
ICLR 2024 (oral)
📰 This work was exclusively reported by New York Times, and covered by many other social medias!

📖 BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal
ICLR 2024

📖 Towards A Proactive ML Approach for Detecting Backdoor Poison Samples
Xiangyu Qi, Tinghao Xie, Jiachen T. Wang, Tong Wu, Saeed Mahloujifar, Prateek Mittal
USENIX Security 2023

📖 Revisiting the Assumption of Latent Separability for Backdoor Defenses
Xiangyu Qi*, Tinghao Xie*, Yiming Li, Saeed Mahloujifar, Prateek Mittal
ICLR 2023

📖 Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Xiangyu Qi*, Tinghao Xie*, Ruizhe Pan, Jifeng Zhu, Yong Yang, Kai Bu
CVPR 2022 (oral)

Tinghao Xie谢廷浩

My Research

News & Facts

Publications/Manuscripts

Tinghao Xie
谢廷浩