Schedule and Reading List

This schedule is tentative and subject to change. All times are in CET (Central European Time).

April 17: Deadline for seminar registration. Interested students should register through https://seminars.cs.uni-saarland.de/.
April 30 (9:00 – 10:30): Introduction to the seminar course.
May 27 (12:00 (noon)): Deadline for submitting the reports and presentation slides for the papers in the first batch.
May 28 (9:00 – 11:00): Presentation and discussion session for the papers in the first batch and office hours.
June 18 (9:00 – 10:30): A tutoring session for the course project. The description of the course project sent via email.
June 24 (12:00 (noon)): Deadline for submitting the reports and presentation slides for the papers in the second batch.
June 25 (9:00 – 11:00): Presentation and discussion session for the papers in the second batch and office hours.
July 15 (12:00 (noon)): Deadline for submitting the reports and presentation slides for the papers in the third batch.
July 16 (9:00 – 11:00): Presentation and discussion session for papers in the third batch and office hours.
August 12 (12:00 (noon)): Deadline for submitting the project report and code.

Complete Reading List

1st Batch: Red teaming and adversarial testing

Red Teaming Language Models with Language Models
by Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving, in EMNLP 2022
Curiosity-driven Red-teaming for Large Language Models
by Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, and Pulkit Agrawal, in ICLR 2024
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
by Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson, in ICLR 2024
On Evaluating Adversarial Robustness of Large Vision-Language Models
by Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-man Cheung, and Min Lin, in NeurIPS 2023

2nd Batch: Fake content generation and watermarking

On the Risk of Misinformation Pollution with Large Language Models
by Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang, in Findings of ACL: EMNLP 2023
A Watermark for Large Language Models
by John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein, in ICML 2023
Tree-Rings Watermarks: Invisible Fingerprints for Diffusion Images
by Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein, in NeurIPS 2023
Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks
by Mehrdad Saberi, Vinu Sankar Sadasivan, Keivan Rezaei, Aounon Kumar, Atoosa Chegini, Wenxiao Wang, and Soheil Feizi , in ICLR 2024

3rd Batch: Poisoning attacks and robust training

Poisoning Language Models During Instruction Tuning
by Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein, in ICML 2023
Universal Jailbreak Backdoors from Poisoned Human Feedback
by Javier Rando and Florian Tramèr, in ICLR 2024
Jailbroken: How Does LLM Safety Training Fail?
by Alexander Wei, Nika Haghtalab, and Jacob Steinhardt, in NeurIPS 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
by Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang, in ICLR 2024