This paper presents a model for the automatic classification of writing proficiency in Italian as a second language (L2) according to the Common European Framework of Reference (CEFR) for languages. The proposed method integrates lexical and morphosyntactic quantitative analysis with phraseological dimensions. Phraseological aspects include the ability to use and understand fixed expressions, idioms, and other multiword units that are common in a language and reflect the depth of language comprehension typically manifested by native speakers. Specific techniques for encoding phraseological features have been introduced, and basic phraseological statistics, previously unavailable for Italy, have been extracted from an Italian corpus. The proposed model was experimentally compared with widely used machine-learning models using a dataset of written texts produced by non-native speakers for official Italian CEFR certification exams. The experimental results outperformed previous work on the CEFR classification of Italian L2 proficiency in terms of accuracy and all relevant prediction metrics, demonstrating the effectiveness of the proposed approach, which integrates morphosyntactic and phraseological features.

Morpho-Phraseological Based Classification of CEFR Italian L2 Learner Writing Proficiency

Milani, Alfredo
2024-01-01

Abstract

This paper presents a model for the automatic classification of writing proficiency in Italian as a second language (L2) according to the Common European Framework of Reference (CEFR) for languages. The proposed method integrates lexical and morphosyntactic quantitative analysis with phraseological dimensions. Phraseological aspects include the ability to use and understand fixed expressions, idioms, and other multiword units that are common in a language and reflect the depth of language comprehension typically manifested by native speakers. Specific techniques for encoding phraseological features have been introduced, and basic phraseological statistics, previously unavailable for Italy, have been extracted from an Italian corpus. The proposed model was experimentally compared with widely used machine-learning models using a dataset of written texts produced by non-native speakers for official Italian CEFR certification exams. The experimental results outperformed previous work on the CEFR classification of Italian L2 proficiency in terms of accuracy and all relevant prediction metrics, demonstrating the effectiveness of the proposed approach, which integrates morphosyntactic and phraseological features.
2024
Machine learning
Classification algorithms
Complexity measures
Text complexity
Language proficiency
L2 learners
NLP
Complexity theory
Syntactics
Linguistics
Feature extraction
Accuracy
Standards
Machine learning
Vocabulary
Europe
Current measurement
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14085/42691
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact