Learning to Classify Text Complexity for the Italian Language Using Support Vector Machines

Santucci, V.; Forti, L.; Santarelli, F.; Spina, S.; Milani, A.

doi:10.1007/978-3-030-58802-1_27

Natural language processing is undoubtedly one of the most active fields of research in the machine learning community. In this work we propose a supervised classification system that, given in input a text written in the Italian language, predicts its linguistic complexity in terms of a level of the Common European Framework of Reference for Languages (better known as CEFR). The system was built by considering: (i) a dataset of texts labeled by linguistic experts was collected, (ii) some vectorisation procedures which transform any text to a numerical representation, and (iii) the training of a support vector machine’s model. Experiments were conducted following a statistically sound design and the experimental results show that the system is able to reach a good prediction accuracy.