Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams

Shippey, Thomas, Bowes, David and Hall, Tracy (2018) Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams. Information and Software Technology, 106 . pp. 142-160. ISSN 0950-5849

[thumbnail of Author Accepted Manuscript]
PDF (Author Accepted Manuscript) - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.


Official URL:


Context: Identifying defects in code early is important. A wide range of static code metrics have been evaluated as potential defect indicators. Most of these metrics offer only high level insights and focus on particular pre-selected features of the code. None of the currently used metrics clearly performs best in defect prediction. Objective: We use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance. Method: Our approach is bottom-up and does not rely on pre-selecting any specific features of code. We use non-parametric testing to determine relationships between AST n-grams and faults in both open source and commercial systems. We build defect prediction models using three machine learning techniques. Results: We show that AST n-grams are very significantly related to faults in some systems, with very large effect sizes. The occurrence of some frequently occurring AST n-grams in a method can mean that the method is up to three times more likely to contain a fault. AST n-grams can have a large effect on the performance of defect prediction models. Conclusions: We suggest that AST n-grams offer developers a promising approach to identifying potentially defective code.

Repository Staff Only: item control page