This schedule is tentative and subject to change. All times are in CET (Central European Time).
- April 17: Deadline for seminar registration. Interested students should register through https://seminars.cs.uni-saarland.de/.
- April 30 (9:00 – 10:30): Introduction to the seminar course.
- May 27 (12:00 (noon)): Deadline for submitting the reports and presentation slides for the papers in the first batch.
- May 28 (9:00 – 11:00): Presentation and discussion session for the papers in the first batch and office hours.
- June 18 (9:00 – 10:30): A tutoring session for the course project. The description of the course project sent via email.
- June 24 (12:00 (noon)): Deadline for submitting the reports and presentation slides for the papers in the second batch.
- June 25 (9:00 – 11:00): Presentation and discussion session for the papers in the second batch and office hours.
- July 15 (12:00 (noon)): Deadline for submitting the reports and presentation slides for the papers in the third batch.
- July 16 (9:00 – 11:00): Presentation and discussion session for papers in the third batch and office hours.
- August 12 (12:00 (noon)): Deadline for submitting the project report and code.
Complete Reading List
1st Batch: Red teaming and adversarial testing
- Red Teaming Language Models with Language Models
by Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving, in EMNLP 2022 - Curiosity-driven Red-teaming for Large Language Models
by Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, and Pulkit Agrawal, in ICLR 2024 - Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
by Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson, in ICLR 2024 - On Evaluating Adversarial Robustness of Large Vision-Language Models
by Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-man Cheung, and Min Lin, in NeurIPS 2023
2nd Batch: Fake content generation and watermarking
- On the Risk of Misinformation Pollution with Large Language Models
by Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang, in Findings of ACL: EMNLP 2023 - A Watermark for Large Language Models
by John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein, in ICML 2023 - Tree-Rings Watermarks: Invisible Fingerprints for Diffusion Images
by Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein, in NeurIPS 2023 - Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks
by Mehrdad Saberi, Vinu Sankar Sadasivan, Keivan Rezaei, Aounon Kumar, Atoosa Chegini, Wenxiao Wang, and Soheil Feizi , in ICLR 2024
3rd Batch: Poisoning attacks and robust training
- Poisoning Language Models During Instruction Tuning
by Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein, in ICML 2023 - Universal Jailbreak Backdoors from Poisoned Human Feedback
by Javier Rando and Florian Tramèr, in ICLR 2024 - Jailbroken: How Does LLM Safety Training Fail?
by Alexander Wei, Nika Haghtalab, and Jacob Steinhardt, in NeurIPS 2023 - Safe RLHF: Safe Reinforcement Learning from Human Feedback
by Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang, in ICLR 2024