Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Ensemble of Narrow DNN Chains

Tinghao Xie

[pdf] Manuscript (Course Essay for Machine Learning at the University of Oxford), 2021

We propose the extremely small Ensemble of Narrow DNN Chains (ENDC) framework.

A Handbook for Deep Learning with their Piecemeal Intuitions from Causal Theory

Tinghao Xie

[pdf] Manuscript (Course Essay for Computational Learning Theory at the University of Oxford), 2022

We summarize the intuitions from causality embraced by recent deep learning works.

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks

Xiangyu Qi*, Tinghao Xie*, Ruizhe Pan, Jifeng Zhu, Yong Yang and Kai Bu

[pdf] CVPR 2022 (oral)

We propose the first gray-box and physically realizable weights attack algorithm for backdoor injection, namely subnet replacement attack (SRA).

Revisiting the Assumption of Latent Separability for Backdoor Defenses

Xiangyu Qi*, Tinghao Xie*, Yiming Li, Saeed Mahloujifar, Prateek Mittal

[pdf] ICLR 2023

Is the latent separation unavoidable for backdoor poisoning attacks? In this paper, we design adaptive backdoor poisoning attacks to present counter-examples against this assumption.

Towards A Proactive ML Approach for Detecting Backdoor Poison Samples

Xiangyu Qi, Tinghao Xie, Jiachen T. Wang, Tong Wu, Saeed Mahloujifar, Prateek Mittal

[pdf] USENIX Security 2023

We suggest a proactive backdoor defense paradigm in which defenders engage proactively with the entire model training and poison detection pipeline, directly enforcing and magnifying distinctive characteristics of the post-attacked model to facilitate poison detection. We introduce the technique of Confusion Training (CT) as a concrete instantiation of our framework. CT applies an additional poisoning attack to the already poisoned dataset, actively decoupling benign correlation while exposing backdoor patterns to detection.

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal

[pdf] Preprint (Under Review)

We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward – finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi*, Yi Zeng*, Tinghao Xie*, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal$^†$, Peter Henderson$^†$

[pdf] [project website] Preprint (Under Review)

📰This work was exclusively reported by New York Times and many other social medias!

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta’s open release of Llama models and OpenAI’s APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI’s APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing — even if a model’s initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.

Red-teaming NSFW Image Classifiers as Text-to-Image Safeguards

Tinghao Xie, Yueqi Xie, Alireza Zareian, Shuming Hu, Felix Juefei-Xu, Xiaowen Lin, Ankit Jain, Prateek Mittal, Li Chen

[pdf] ACL 2026 Findings

Not Safe for Work (NSFW) image classifiers play a critical role in safeguarding text-to-image (T2I) systems. However, a concerning phenomenon has emerged in T2I systems – changes in text prompts that manipulate benign image elements can result in failed detection by NSFW classifiers – dubbed “context shifts.” For instance, while a NSFW image of “🖼️a nude person in an empty scene” can be easily blocked by most NSFW classifiers, a stealthier one that depicts “🖼️a nude person blending in a group of dressed people” may evade detection. We ask: how to systematically reveal NSFW image classifiers’ failure against such context shifts?

Towards this end, we present an automated red-teaming framework that leverages a set of generative AI tools. We propose an exploration-exploitation approach: First, in the exploration stage, we synthesize a diverse and massive 36K NSFW image dataset that facilitates our study of context shifts. We find that varying fractions (e.g., 4.1% to 36% nude and sexual content) of the dataset are misclassified by NSFW image classifiers like GPT-4o and Gemini. Second, in the exploitation stage, we leverage these failure cases to train a specialized LLM that rewrites unseen seed prompts into more evasive versions, increasing the likelihood of detection evasion by up to 6 times. Alarmingly, we show these failures translate to real-world T2I and even T2V systems like DALL-E 3, Sora, Nano Banana, and Veo 3 – beyond the open-weight image generators in our main study. For example, querying DALL-E 3 with prompts rewritten by our approach increases the chance of obtaining NSFW images from 0 to over 50%.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Tinghao Xie
谢廷浩

Posts by Collection

portfolio

Portfolio item number 1

Portfolio item number 2

publications

Ensemble of Narrow DNN Chains

A Handbook for Deep Learning with their Piecemeal Intuitions from Causal Theory

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks

Revisiting the Assumption of Latent Separability for Backdoor Defenses

Towards A Proactive ML Approach for Detecting Backdoor Poison Samples

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Red-teaming NSFW Image Classifiers as Text-to-Image Safeguards

talks

Talk 1 on Relevant Topic in Your Field

Tutorial 1 on Relevant Topic in Your Field

Talk 2 on Relevant Topic in Your Field

Conference Proceeding talk 3 on Relevant Topic in Your Field

teaching

Teaching experience 1

Teaching experience 2

Tinghao Xie谢廷浩

Posts by Collection

portfolio

publications

talks

teaching

Tinghao Xie
谢廷浩