This study investigates how targeted attacks can compromise the reliability and applications of large language models (LLMs) in educational assessment, highlighting security vulnerabilities that are frequently underestimated in current AI-supported learning environments. As LLMs and other AI tools are increasingly being integrated into grading, providing feedback, and supporting the evaluation workflow, educators are adopting them for their potential to increase efficiency and scalability. However, this rapid adoption also introduces new risks. An unexplored threat is prompt injection, whereby a student acting as an attacker embeds malicious instructions within seemingly regular assignment submissions to influence the model’s behaviour and obtain a more favourable evaluation. To the best of our knowledge, this is the first systematic comparative study to investigate the vulnerability of popular LLMs within a real-world educational context. We analyse a significant representative scenario involving prompt injection in exam assessment to highlight how easily such manipulations can bypass the teacher’s oversight and distort results, thereby disrupting the entire evaluation process. By modelling the structure and behavioural patterns of LLMs under attack, we aim to clarify the underlying mechanisms and expose their limitations when used in educational settings.

When AI Is Fooled: Hidden Risks in LLM-Assisted Grading

Milani A.
;
2025-01-01

Abstract

This study investigates how targeted attacks can compromise the reliability and applications of large language models (LLMs) in educational assessment, highlighting security vulnerabilities that are frequently underestimated in current AI-supported learning environments. As LLMs and other AI tools are increasingly being integrated into grading, providing feedback, and supporting the evaluation workflow, educators are adopting them for their potential to increase efficiency and scalability. However, this rapid adoption also introduces new risks. An unexplored threat is prompt injection, whereby a student acting as an attacker embeds malicious instructions within seemingly regular assignment submissions to influence the model’s behaviour and obtain a more favourable evaluation. To the best of our knowledge, this is the first systematic comparative study to investigate the vulnerability of popular LLMs within a real-world educational context. We analyse a significant representative scenario involving prompt injection in exam assessment to highlight how easily such manipulations can bypass the teacher’s oversight and distort results, thereby disrupting the entire evaluation process. By modelling the structure and behavioural patterns of LLMs under attack, we aim to clarify the underlying mechanisms and expose their limitations when used in educational settings.
2025
AI misuse detection
education
educational evaluation
generative AI
human-in-the-loop AI
large language models
prompt injection
trustworthy AI
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14085/57763
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact