Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Development of a Red-Teaming Dataset for Defending Large Language Models against Attacks

https://doi.org/10.15514/ISPRAS-2024-36(5)-10

Abstract

Modern large language models are huge systems with complex internal mechanisms implementing black-box response generation. Although aligned large language models have built-in defense mechanisms against attacks, recent studies demonstrate the vulnerability of large language models to attacks. In this study, we aim to expand the existing malicious datasets obtained from attacks so that similar vulnerabilities in large language models can be addressed in the future through the alignment procedure. In addition, we conduct experiments with modern large language models on our malicious dataset, which demonstrates the existing weaknesses in the models. 

About the Authors

Irina Sergeevna ALEKSEEVSKAIA
Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Programmer at the Trusted Artificial Intelligence Research Center, postgraduate student at the ISP RAS in the field of artificial intelligence and machine learning. Research interests: large language models, adversarial attacks, backdoor attacks, alignment of large language models.



Konstantin Vladimirovich ARKHIPENKO
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Junior Research Fellow at the Trusted Artificial Intelligence Research Center, and a specialist at the Department of System Programming at Lomonosov Moscow State University. Research interests: defenses of machine learning models from adversarial attacks, explainable artificial intelligence.



Denis Yuryevich TURDAKOV
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Cand. Sci. (Phys.-Math.), Head of the Information Systems Department at the Institute of System Programming since 2017. Research interests: natural language analysis, cloud computing, machine learning, social network analysis.



References

1. Achiam J. et al. Gpt-4 technical report //arXiv preprint arXiv:2303.08774. – 2023.

2. Roziere B. et al. Code llama: Open foundation models for code //arXiv preprint arXiv:2308.12950. – 2023.

3. Qian J. et al. A Liver Cancer Question-Answering System Based on Next-Generation Intelligence and the Large Model Med-PaLM 2 //International Journal of Computer Science and Information Technology. – 2024. – Т. 2. – №. 1. – С. 28-35.

4. Ebrahimi J. et al. Hotflip: White-box adversarial examples for text classification //arXiv preprint arXiv:1712.06751. – 2017.

5. Zou A. et al. Universal and transferable adversarial attacks on aligned language models //arXiv preprint arXiv:2307.15043. – 2023.

6. Jones E. et al. Automatically auditing large language models via discrete optimization //International Conference on Machine Learning. – PMLR, 2023. – С. 15307-15329.

7. Alekseevskaia I., Arkhipenko K. OrderBkd: Textual backdoor attack through repositioning //2023 Ivannikov Ispras Open Conference (ISPRAS). – IEEE, 2023. – С. 1-6.

8. Xu J. et al. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models //arXiv preprint arXiv:2305.14710. – 2023.

9. Li Y. et al. Badedit: Backdooring large language models by model editing //arXiv preprint arXiv:2403.13355. – 2024.

10. Kshetri N. Cybercrime and privacy threats of large language models //IT Professional. – 2023. – Т. 25. – №. 3. – С. 9-13.

11. Zamfirescu-Pereira J. D. et al. Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts //Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. – 2023. – С. 1-21.

12. Lyu H. et al. Llm-rec: Personalized recommendation via prompting large language models //arXiv preprint arXiv:2307.15780. – 2023.

13. Azeem R. et al. LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions //arXiv preprint arXiv:2406.08824. – 2024.

14. Liu X. et al. Autodan: Generating stealthy jailbreak prompts on aligned large language models //arXiv preprint arXiv:2310.04451. – 2023.

15. Harte J. et al. Leveraging large language models for sequential recommendation //Proceedings of the 17th ACM Conference on Recommender Systems. – 2023. – С. 1096-1102.

16. Bai Y. et al. Constitutional ai: Harmlessness from ai feedback //arXiv preprint arXiv:2212.08073. – 2022.

17. Bai Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback //arXiv preprint arXiv:2204.05862. – 2022.

18. Dai J. et al. Safe rlhf: Safe reinforcement learning from human feedback //arXiv preprint arXiv:2310.12773. – 2023.

19. Rafailov R. et al. Direct preference optimization: Your language model is secretly a reward model //Advances in Neural Information Processing Systems. – 2024. – Т. 36.

20. Wang C. et al. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints //arXiv preprint arXiv:2309.16240. – 2023.

21. Azar M. G. et al. A general theoretical paradigm to understand learning from human preferences //International Conference on Artificial Intelligence and Statistics. – PMLR, 2024. – С. 4447-4455.

22. Ethayarajh K. et al. Kto: Model alignment as prospect theoretic optimization //arXiv preprint arXiv:2402.01306. – 2024.

23. Hejna J. et al. Contrastive prefence learning: Learning from human feedback without rl //arXiv preprint arXiv:2310.13639. – 2023.

24. Chao P. et al. Jailbreaking black box large language models in twenty queries //arXiv preprint arXiv:2310.08419. – 2023.

25. Mehrotra A. et al. Tree of attacks: Jailbreaking black-box llms automatically //arXiv preprint arXiv:2312.02119. – 2023.

26. Sitawarin C. et al. Pal: Proxy-guided black-box attack on large language models //arXiv preprint arXiv:2402.09674. – 2024.

27. Hussein Abbass, Axel Bender, Svetoslav Gaidow, and Paul Whitbread. Computational red teaming: Past, present and future. IEEE Computational Intelligence Magazine, 6(1):30–42, 2011.

28. Shen X. et al. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models //Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. – 2024. – С. 1671-1685.

29. Ganguli D. et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned //arXiv preprint arXiv:2209.07858. – 2022.

30. Radharapu B. et al. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications //arXiv preprint arXiv:2311.08592. – 2023.

31. Ji J. et al. Beavertails: Towards improved safety alignment of llm via a human-preference dataset //Advances in Neural Information Processing Systems. – 2024. – Т. 36.

32. Bhardwaj R., Poria S. Red-teaming large language models using chain of utterances for safety-alignment //arXiv preprint arXiv:2308.09662. – 2023.

33. Schulman J. et al. Proximal policy optimization algorithms //arXiv preprint arXiv:1707.06347. – 2017.

34. Mazeika M. et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal //arXiv preprint arXiv:2402.04249. – 2024.


Review

For citations:


ALEKSEEVSKAIA I.S., ARKHIPENKO K.V., TURDAKOV D.Yu. Development of a Red-Teaming Dataset for Defending Large Language Models against Attacks. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(5):143-152. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(5)-10



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)