The Use of Machine Learning to Provide an Early Indication of Programming Students in Need of Support

Kerr, Oliver orcid iconORCID: 0000-0001-7531-3659 (2024) The Use of Machine Learning to Provide an Early Indication of Programming Students in Need of Support. Doctoral thesis, University of Central Lancashire.

[thumbnail of Thesis]
Preview
PDF (Thesis) - Submitted Version
Available under License Creative Commons Attribution Non-commercial.

2MB

Digital ID: http://doi.org/10.17030/uclan.thesis.00052793

Abstract

Programming is a core component of any university-level computer science course. When learning to program, students’ efforts can be hampered by a variety of misconceptions pertaining to fundamental programming concepts. These can range from a complete misunderstanding of a concept, to small, yet frequent mistakes that can lead to logical errors within programs. The misconceptions students hold can prevent them from developing appropriate mental models of concepts, which can ultimately create a barrier to students’ learning. These misconceptions can cause issues in terms of students’ understanding of the content they are being taught and can also have a detrimental impact on students’ confidence. As such, it is necessary to identify students who are likely to require support with learning to program at the earliest possible opportunity. The present research, therefore, intends to establish a deeper understanding of the mental models students hold of core programming concepts prior to starting their degrees, and how they develop during the first semester of teaching within an introductory programming module. How students’ mental models relate to their prior experiences and their perceived levels of confidence is also explored as part of this work, as well as how these factors link to students’ performance within their programming module.

There are two distinct parts to this investigation. The first part focuses on the design and development of an aptitude test, termed the Programming Checkup, which is the main data collection mechanism for this research. The Programming Checkup was subsequently issued to students at two occasions, with the first being at the beginning of their courses and the second being towards the end of the first semester, therefore, allowing for an examination of students’ progress throughout the initial stage of their introductory programming module. The second part of the investigation explores the potential for using machine learning and students’ responses to the Programming Checkup at the beginning of their courses, as a means to predict students’ results in their first introductory programming assessment.

The findings from the analysis conducted during this investigation indicate that there is a clear benefit to students in terms of their likelihood of holding appropriate mental models and their levels of confidence and anxiety surrounding learning to program by having prior programming experience. Likewise, having previously studied computer science also benefits students, although not as substantially as prior programming experience. It is apparent that previously studying a mathematics-based subject after leaving school, does not benefit students in ways directly represented in their likelihood of holding appropriate mental models, nor in terms of their levels of confidence or anxiety surrounding learning to program, to the same extent as previously studying computer science or having prior programming experience. Furthermore, factors that point to some students being intrinsically motivated, wherein students intend to work in a software engineering role after they graduate or consider themselves to be “self-taught programmers”, are seen to relate to higher levels of confidence and for students being more likely to hold appropriate mental models.

One of the main intentions of this investigation was to explore how students’ responses to the Programming Checkup at the beginning of their course can be used to help identify students who are likely to require support with learning to program. As such, an exploration of how machine learning can be utilised to predict the results students achieve in their first introductory programming assessment was undertaken, with both classification and regression approaches being considered. The results of this evaluation found that the best performing regression model was the Random Forest Regressor, which achieved an average RMSE of 0.1686 when trained on the full training dataset, and 0.1687 when evaluated on the holdout testing dataset. This, therefore, demonstrates that the training data have not been overfitted, and that the model is capable of making predictions with a level of accuracy that is sufficient to provide an indication of a student’s performance, and as such, used as a guide for identifying students who likely benefit from additional support. Similarly, the Random Forest Classifier was found to be the best performing classification model, achieving an average AUC of 0.7400 when trained on the full training dataset. However, an average AUC of only 0.6595 was achieved when evaluated on the holdout test set, thus indicating a substantial amount of overfitting, potentially due to the inherent imbalance within the dataset when a result of 50% is used as a threshold. There is, therefore, a clear need for future work to establish a more appropriate threshold, as well as to explore ways of improving the performance of both the regression and classification models. However, this investigation has demonstrated the potential of this approach, which can be improved and expanded upon within future research stemming from this work.


Repository Staff Only: item control page