'What is this corpus about?': Using topic modelling to explore a specialised corpus

Murakami, A, Thompson, P, Hunston, S and Vajn, Dominik orcid iconORCID: 0000-0001-8047-0026 (2017) 'What is this corpus about?': Using topic modelling to explore a specialised corpus. Corpora, 12 (2). pp. 243-277. ISSN 1749-5032

[thumbnail of Author Accepted Manuscript]
PDF (Author Accepted Manuscript) - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.


Official URL: https://doi.org/10.3366/cor.2017.0118


This paper introduces topic modelling, a machine learning technique that automatically identifies 'topics' in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling.We believe that topic modelling is particularly useful in the initial exploration of a corpus.

Repository Staff Only: item control page