Algoritmos de aprendizaje automático para la predicción del logro académico Machine Learning Algorithms for Predicting of Academic Achievement

In this research, two machine learning classifiers were implemented, a multilayer perceptron (MLP) and a gradient boosting model (GB), to predict the degree of academic achievement in Spanish and mathematics of basic education students in two stages, sixth of primary (2008) and third of secondary (2011), based on contextual variables obtained from the Enlace test of the state of Tlaxcala, Mexico. Thirteen input variables were considered. The relative importance of these was determined by the random forest (RF) classifier. MLP and GB classifiers were trained and tested with a dataset of 11 036 records of students who remained


Abstract
In this research, two machine learning classifiers were implemented, a multilayer perceptron (MLP) and a gradient boosting model (GB), to predict the degree of academic achievement in Spanish and mathematics of basic education students in two stages, sixth of primary (2008) and third of secondary (2011), based on contextual variables obtained from the Enlace test of the state of Tlaxcala, Mexico.Thirteen input variables were considered.The relative importance of these was determined by the random forest (RF) classifier.MLP and GB in the school system from 2008 to 2011.The models were trained and tested in prediction for 2008 and 2011.In Spanish MLP outperformed GB with a global classification accuracy (PG) of 70.1 % in 2008 and 61.1 % in 2011.GB obtained better performance in mathematics with a PG of 68.8 % in 2008 and 63.5 % in 2011.It was observed that the score in Spanish has a strong association with the degree of academic achievement in mathematics.Scores in Spanish and mathematics have greater relative importance with respect to contextual factors considered as sex, scholarship, school shift, and so on.In the population of students analyzed, it is observed that, in Spanish and mathematics, the proportion of women is higher than the proportion of men in achievement levels 1 (elementary) and 2 (good or excellent); in contrast, in both subjects this proportion is reversed at achievement level 0 (insufficient).
Keywords: supervised learning, decision trees, school context, artificial neural networks, cross validation.

Resumo
Nesta pesquisa, dois classificadores de aprendizado de máquina, uma rede neural multicamada (multilayer perceptron [MLP]) e um modelo de potenciação de gradiente (GB), foram implementados para prever o grau de desempenho acadêmico nas disciplinas de espanhol e matemática de alunos do ensino médio.sexta série ( 2008 The Enlace test was applied on a census basis starting in 2006 to all students from third to sixth grade of primary school and all three years of secondary school.In 2008 it was applied to all three years of high school.The objective of this test was to evaluate the academic achievement of the students in the subjects of Spanish and mathematics; the benchmark for this test is the national curriculum (Martínez, 2015).Link results are measured on a standardized scale that ranges from 200 to 800 points and has four levels of achievement (0 = Poor, 1 = Elementary, 2 = Good, 3 = Excellent).The information has served as support for teachers to compare the results of their school with others with similar characteristics and identify curricular content that students did not acquire in order to take pertinent actions (SEP, 2008).Along with the Enlace test, questionnaires were applied to a sample of students, parents, teachers and directors of the schools that were included in the test to find out personal characteristics, family environment, reading habits, housing characteristics , school infrastructure and teaching methods, and identify limiting factors associated with learning.
Applications of machine learning models to assess school performance are reported in the literature.Đambić, Krajcar and Bele (2016) Altabrawee, Osama, and Qaisir (2019) applied four machine learning algorithms (neural network, Bayesian decision, decision trees, and logistic regression) to predict student performance in a computer course based on the use of the Internet as a learning medium; Some variables considered were the time spent on social networks, hours of study, sex, education of family members.The neural network achieved the best performance, with an overall accuracy of 77.0% and an area under the AUCROC curve of 0.807.
In Mexico, related work has been carried out to evaluate the effect of factors external and internal to the school on the academic achievement of students.Fernández (2003) used indices of global family capital (educational level of the mother, comfort equipment in the home, availability of books and computers), of the sociocultural context of the school and of the organizational climate; Hierarchical analysis was applied to assess learning in Spanish and mathematics.This author reports that an increase in the global family capital index affects the increase in performance in Spanish and mathematics; however, when housing deprivation is widespread (illiteracy, low income) student outcomes are low.
In this research, the following objectives were proposed, firstly, to implement two supervised machine learning classifiers, namely, a multilayer neural network and a gradient boosting algorithm, to predict the degree of academic achievement (0: insufficient, 1: elementary, 2: good or excellent) in the subjects of Spanish and mathematics of sixth graders

Data collection and preparation
The database of academic records used in this study (2008)(2009)(2010)(2011) for the state of Tlaxcala corresponds to a subset of the national database of the Enlace test that was applied from 2006 to 2014 to all students in the third year of primary school.third year of high school, whose purpose was to generate information for parents, teachers, managers and society in general about the academic achievement of students in the educational system in Spanish and mathematics (SEP, 2008).Data records were stored by student, application year, educational level, and grade; with data on the academic achievement of the students (  where the school is located; as well as the scores in Spanish and mathematics (Table 2).Table 3 presents the description of the four data sets analyzed by subject and year for the same school population.In order to reduce the imbalance between classes or levels of academic achievement, the good and excellent levels were grouped in level two of academic achievement.
Tabla 3. Descripción de los conjuntos datos analizados por asignatura, año y niveles de logro académico en el estado de Tlaxcala para un total de 11 036 registros de estudiantes Año y asignatura evaluada Niveles de logro académico Figure 1 shows the distribution of achievement levels by gender.In ESP2008 there is a higher proportion of women in class 2 and in MAT2008 they also excel in classes 1 and 2.
The same is observed in ESP2011 and in MAT2011, where women excel in both classes.

Machine learning classifiers
The purpose of supervised machine learning classifiers is to predict a target class from input variables or features.In this work, three supervised learning classifiers multilayer perceptron [MLP] and gradient boosting [GB] were implemented to predict the level of academic achievement, and random forest [RF].acronym in English] to determine the relative importance of the predictor variables.

Multilayer perceptron
The MLP classifier is a network of neurons connected by weights or parameters, structured in an input layer (in), one or more hidden layers (h) and an output layer (out).The basic architecture of MLP consists of three layers (figure 2).The more layers the network has, the more complex it is, and the greater the ability to solve complex problems.(Borkar y Rajeswari, 2014). (ℎ) =  ()  (ℎ) ,  (ℎ) = ϕ( (ℎ) ) As  () is an array of features or samples  () ;  (ℎ) is the weight matrix, and ϕ(⋅) is the activation function.Similarly, the activation of the output layer is generated: () ,  () = ϕ( () ) As  () is an array of output weights; and  () is a probability matrix with the predicted responses or classes of the network.To determine the classification error, the target class is compared with the predicted class.The back propagation algorithm is used to distribute the errors and partial derivatives are obtained with respect to the network weights to update the model.(Raschka y Mirjalili, 2017).

Gradient Boosting Classifier
The GB classifier consists of a set of individual decision trees that are trained sequentially, in such a way that each tree improves the errors of the previous trees.To predict a new observation, the predictions of all the individual trees in the model are added.GB can use any loss function as long as it is differentiable.A model is fitted, for example, f_1 to predict the response variable, then the errors are calculated y − f 1 (x); then, a model f_2 is fitted that tries to predict the errors of the previous model; again a model f_3 is fitted that tries to correct the errors of the previous models and this is repeated m times.To avoid overfitting the model, a regularization parameter is used, which is called the learning rate (λ), that limits the influence of each model in the assembly set.
The idea behind boosting is to sequentially tune multiple simple models, where each model uses information from the previous model to "learn from its mistakes" and improve with each iteration; The average of the predictions is taken as the final value.(Rogers y Gunn, 2005).

Random forest
RF is an ensemble of decision trees and its goal is to average multiple decision trees to build a more robust model that has better generalization and is less susceptible to overfitting (Raschka and Mirjalili, 2017).To predict the class, the rules of each tree are used and they are assigned by majority vote (Breiman, 2001).The RF algorithm is summarized as follows: A sample of size n is selected from the set of predictor variables (by random sampling without replacement, bootstrap), the tree grows from an initial sample; for each node d features are selected without replacement; the node is divided with the function that provides the best division according to the information gain (IG) objective function, which is defined by: where f is the characteristic to perform the division;   is the total number of samples in the parent node;   is the number of samples in the left child node;   is the number of samples in the right child node; I is the measure of impurity (gini, entropy or classification error);   is the dataset in the parent node;   is the dataset at the left child node, and   is the dataset in the right child node.
RF is also used to determine the importance of a set of variables in the model.The RF algorithm creates classifiers with a random selection of features; this achieves a good exploration of subsets of these, where those variables with greater importance are selected (Rogers y Gunn, 2005).

Performance criteria of prediction models
To evaluate the performance of the MLP, GB and RF classification models, the metrics are obtained from a confusion matrix (MC) which describes the count of true The AUCROC value (area under the ROC curve) is interpreted as the probability that in two samples, one positive and one negative, the test assigns a higher probability to the positive sample, correct classification (Mandrekar, 2010).Its value ranges between zero and one; the higher the AUCROC, the better the classification, a value close to 0.50 indicates a poor classification.The P-S curve is the result of plotting P versus S.This allows us to observe from which S there is a degradation of P and vice versa.The ideal result is a curve that approaches the upper right corner (high P and S), which generates an area under the AUCP-S curve, the closer to one, the better the model.(Saito y Rehmsmeier, 2015).Fuente: Elaboración propia, con base en datos de SEP ( 2008)

Fuente: Elaboración propia
After the selection stage of the optimal hyperparameters of the MLP and GB classifiers, the final evaluation of their performance was carried out.To determine the PG of each model, the complete data set of each input scenario (Table 3), the optimal combination of hyperparameters selected and the VC procedure (k = 5) were considered.In each iteration, a PG value of the model is obtained; At the end of the k iterations, the average of the PG values and their standard deviation, and the other metrics proposed in the study, were calculated.

Relative importance of input characteristics
RF was used to determine the relative importance of the input features or variables in predicting the target class.RF builds a large number of classifiers based on randomly selected subsets of variables.At each RF node an input variable is selected that is used to partition the node and maximize the information gain (performance metric).Variable importance measures are used to determine the performance of the machine learning model (Rogers and Gunn, 2005).
To calculate the relative importance of the 13 predictor variables with the RF model, the VC procedure (k = 5) was applied to select the optimal hyperparameters.Subsequently, the feature importance option was applied to the RF model using the Python Scikit-learn library.This stage was carried out for the four data sets: ESP2008, ESP2011, MAT2008 and MAT2011.

Optimal hyperparameters
The optimal hyperparameters selected with cross validation and grid search of the MLP and GB classifiers for each dataset are illustrated in Table 6.The optimal hyperparameters of each classifier, in general, depend on the analyzed dataset.regularizador; ta: tasa de aprendizaje; mi: máximo de iteraciones; pa: profundidad de árbol; ha: hojas por árbol.

Classifier performance
The average performance and its standard deviation in prediction, of the MLP and GB models for the four scenarios analyzed in the Enlace test, in terms of global classification accuracy, MLP was superior to GB with the ESP2008 scenario and GB was superior to MLP with ESP2008. the MAT2008 scenario (table 7).7).
In the subject of Spanish, MLP and GB obtained better performance in 2008 to classify classes 1 and 2 than class 0 (F1 of 77.0% and 75.0%versus 33.0%, respectively).On the other hand, in 2011, MLP and GB obtained better performance to classify classes 0 and 1, than class 2. This is reinforced when observing the results of AUCP-S with values of 0.70 and 0.60 for classes 0 and 1 (table 8, figure 3).Using the 2011 math dataset, GB was slightly better than MLP (GB had an average PG of 63.5%).GB outperformed MLP to classify class 0 with F1 of 77%.Both models had low performance to classify classes 1 and 2. From the results of AUCP-S, it is verified that both models had a good performance to classify class 0, and low for classes 1 and 2 (0.37)

Discussion
The results obtained show the association of the 13 input variables with the three classes or levels of academic achievement (insufficient, elementary, and good or excellent) obtained by the students of the state of Tlaxcala.One of the most important variables was the score obtained in mathematics to predict the degree or level of academic achievement in Spanish and the score obtained in Spanish to predict the degree of academic achievement in mathematics.In order of importance, although to a lesser extent, the geographical location of the school, the population of the locality where the school is located and the sex of the student are also listed.The observed influence of the pun_esp and pun_mat variables on the results obtained for the classification of students in achievement levels is highlighted.Fernández (2003) points out that the learning of the Spanish language results from a process of accumulation of pedagogical experiences that the student has during his stay at school and the learning of mathematics as a constructive process that is related to the formulation or understanding of concepts with problem solving.It is important to observe the grades of the students in Spanish in order to predict the classification of achievement levels in mathematics and vice versa, that is, it is convenient to know what happened in the other subject as a general measure of the students' ability.
With the different metrics selected to compare the two models, the area under the AUCP-S curve provided more information to assess the performance of the classifiers in discriminating the percentage of correctly classified samples, and the results are similarly reflected in the confusion matrix.The GB and MLP classifiers represent an alternative approach to identify the variables or contextual factors that favor or limit the academic achievement of students and constitute a decision-making support tool to identify low-performing students and propose focused solutions to structural problems.such as school dropout.

Future lines of research
To complement the work, it is important to consider other contextual variables such as the education of the parents, characteristics of the family, characteristics of the school, etc., and that can be obtained by crossing information on school performance from the Link test with the questionnaires of context that rose alongside the test.
These new variables can be analyzed with machine learning models to assess their influence on school performance.In addition, it is possible to test other classification algorithms such as support vector machines (SVM) or k-nearest neighbor (KNN).Likewise, with the approach applied in this study, it is possible to select school information from other entities or regions of the country to evaluate school performance in other socioeconomic contexts.
) e terceira série (2011) com base em variáveis contextuais obtidas nos Exames Nacionais de Desempenho Acadêmico nas Escolas (Enlace) do estado de Tlaxcala, México.13 variáveis de entrada foram consideradas e sua importância relativa foi determinada usando o algoritmo Random Forest (RF).Os classificadores MLP e GB foram treinados e testados com um conjunto de dados de 11.036 prontuários de alunos que permaneceram na rede escolar de 2008 a 2011.Os modelos foram treinados e testados em previsão para 2008 e 2011.Em espanhol, o MLP foi superior ao GB com uma precisão geral de notas (GP) de 70,1% em 2008 e 61,1% em 2011.GB teve um desempenho melhor em matemática com um GP de 68,8% em 2008 e 63,5% em 2011.A pontuação em espanhol mostrou ter uma forte associação com o grau de desempenho acadêmico em matemática.Os escores em espanhol e matemática tiveram maior importância relativa em relação aos fatores contextuais analisados como: gênero, escolaridade, turno escolar.Na população de alunos analisada, observou-se que em espanhol e matemática a proporção de mulheres é maior do que a proporção de homens no ensino Introduction The evaluation of student learning through large-scale tests (state or national) allows information to be obtained about their degree of academic achievement and the associated contextual variables.The Organization for Economic Cooperation and Development [OECD] (2005) found evidence of how factors such as the school context, school supplies and processes are related to the students' learning process.Mexico began using standardized tests to measure the academic achievement of students in the last two decades.The Ministry of Public Education (SEP) has databases of students who enroll annually at each educational level and the results of the tests that are applied at the national level, such as the Educational Quality and Achievement Examinations (Excale), National Evaluation of Achievement Academic in Schools (Link) or at an international level such as the International Program for Student Assessment (PISA) (National Institute for the Evaluation of Education [INEE], 2019).
applied a logistic regression model for the early detection of students with performance problems in a computer course.The model obtained a classification error of 19.0%.Ray et al. (2020) used two classification models (random forest and support vector machine) to predict the school performance of a group of university students based on input variables such as: sex, hours of study, percentage of class attendance, and income family monthly.The random forest model obtained a global classification accuracy of 94.0% and the support vector machine of 79.0%.For their part,

(
2008) and third graders (2011) in the state of Tlaxcala based on data from the Enlace test; second, to compare the degree of academic achievement in Spanish and mathematics in 2008 and 2011; and third, to determine the relative importance of 13 predictor variables in the classification of academic achievement.The predictor variables were math score, Spanish score, scholarship, school shift, support, type of location, gender, type of school, size of location, marginalization, geographic location (altitude, latitude, and longitude).

Fuente:
Fuente: PruebaEnlace 2008-2013(SEP, 2008) From the information available for the state of Tlaxcala, the subset of students who took the Enlace test during four consecutive years (2008 to 2011) from sixth grade to third grade of secondary school was selected; this period marks the beginning of an educational

Figura 2 .
Figura 2. Arquitectura del clasificador perceptrón multicapa positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).The rows represent the number of samples in the observed class and the columns the number of predictions in each class.The MC diagonal corresponds to the number of samples that the algorithm correctly classifies in each class.If MC only has positive values on the diagonal, it indicates that the classifier correctly classifies all samples.The overall classification accuracy (GP) metric measures the overall proportion of well-classified samples in each class and is calculated as:  =  +   +  +  +  The metrics to measure the performance of the classifier in each class are precision (P), sensitivity (S), specificity (E), and F1 score.They are defined with the following expressions:In this case, the value of F1 summarizes P and S in a single metric, is an appropriate estimator in unbalanced classes, and varies between zero and one.The receiver operating characteristics (ROC) curve is a curve that relates values of S versus 1-E.The different points on the curve correspond to the cut-off points used to determine if the test results are positive.
To train the models, each data set was divided into two random partitions, 80% for training and 20% for testing.For each dataset (ESP2008, MAT2008, ESP2011 and MAT2011) the RF classifier was applied to assess the relative importance of the input Tabla 4. Número de observaciones de los conjuntos de prueba por clase objetivo o nivel de logro académico, en español y matemáticas 2008 insuficiente; ¶ 1: elemental; § 2: bueno o excelente.
performance.The selection of the optimal hyperparameters consists of finding the combination of values of the hyperparameters that maximizes the performance of the classifier based on a metric, in this study PG was used.The selection of hyperparameters for each classifier was performed through cross validation (CV).Training was performed with a random sample of 80% of the total data set.The CV method consists of randomly subdividing the training set into k disjoint subsets of the same size.Then, for each combination of hyperparameter values (table5), the model is executed k times.In each iteration k, one of the disjoint subsets is used as a validation set and the rest as a training set (80% training and 20% validation) and a value of the PG performance metric is obtained.After evaluating different combinations of hyperparameter values, the combination of hyperparameter values that maximizes the average PG obtained from VC is selected.(k = 5).Tabla 5. Valores para la búsqueda y selección de hiperparámetros de los clasificadores perceptrón multicapa (MLP) y potenciación del gradiente (GB)

Fuente:
Fuente: Elaboración propia MLP and GB obtained low performance to classify class 2 with F1 = 0.41.This is confirmed by observing the P-S curves with a low value of AUCP-S for class 2 (figure3).

Figura 5 .
Figura 5. Importancia relativa de variables de entrada obtenida con el clasificador bosque aleatorio (RF) A: español 2008, B: matemáticas 2008 The two machine learning algorithms (MLP and GB) obtained an overall performance of correct classification (PG) greater than 60.0%).A limitation of the work to increase the performance of the classifiers was the presence of unbalanced target classes, the classifier tends to give greater importance to the majority classes.To improve the work, additional context variables can be considered and see if they improve the classification.Alvarez et al.(2007) used variables associated with students referring to socioeconomic indicators, characteristics of the school and institutional aspects (state pedagogy, union influence, etc.) to determine which factors influence school performance in mathematics, science and reading of PISA, likewise,Hussain and Qasim (2021) used historical grade data to predict student grades using machine learning algorithms.ConclusionsMachine learning multilayer perceptron (MLP) and gradient boosting (GB) classifiers obtained comparable performances in terms of overall classification accuracy (GP) in predicting levels of academic achievement (0: insufficient, 1: elementary, and 2: good or excellent) of elementary and middle school students in the state of Tlaxcala, based on contextual variables extracted from the Enlace test (National Assessment of Academic Achievement in Schools).In math, GB had a PG of68.8% in 2008, and 63.5% in 2011;   likewise, in 2008, MLP and GB performed  better in classifying classes 1 and 2 than class 0 (insufficient).In contrast, in 2011, in both subjects, MLP and GB performed better at classifying classes 0 and 1 than class 2.The contextual variables used in this study showed an association with the levels of academic achievement; in particular, the variables intern, sex and school shift.The score in Spanish obtained by a student influences the level of academic achievement in mathematics and vice versa.These results show the importance of machine learning algorithms to identify relevant factors that affect the school performance of students from the analysis of massive data of existing school information in the Ministry of Public Education.
table 1), score obtained in the test (scale from 200 to 800), scholarship holder, shift, type of support and geographic location of the school.Intervalos de puntajes para determinar la clase o nivel de logro académico en