Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/115169
Type: Theses
Title: A methodology for predictive topic modelling; or, any excuse to watch Love Actually
Author: Glenny, Vanessa Grace
Issue Date: 2018
School/Discipline: School of Mathematical Sciences
Abstract: Topic modelling is an area of natural language processing (NLP) in which a corpus of text documents is summarised by an underlying structure of `topics', or themes. Due to the incredibly complex nature of human language, we often require ways to meaningfully summarise the information contained in a piece of text. Topic models provide a method in which we are able to keep substantial semantic information, but still work with a small number of variables. Topic modelling has mainly been applied to machine learning problems, with little emphasis on prediction from text. This thesis provides a statistical framework for prediction from, or about, text, using topic models as a data reduction method and the topics themselves as predictors. The results of this thesis show that while using individual words as predictors in a regression model remains the most accurate method, it is far too computationally expensive to apply to large corpora. However, the topic regression models proposed here perform comparably, and at a much lower computational cost. We also show that incorporating more information, such as the structure of language, into topic model inference improves the predictive capability of the topics. This thesis therefore proposes a computationally viable, well-performing method for prediction from text. From here, we may consider adapting additional topic models to a regression framework, depending on the problem at hand and its requirements. These methods, while tested in this thesis on relatively small corpora, would also be applicable to big data problems.
Advisor: Bean, Nigel Geoffrey
Mitchell, Lewis
Tuke, Simon Jonathan
Dissertation Note: Thesis (MPhil) -- University of Adelaide, School of Mathematical Sciences, 2018
Keywords: Topic modelling
natural language processing
prediction
statistical modelling
Provenance: This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at http://www.adelaide.edu.au/legals
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
Glenny2018_MPhil.pdf1.09 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.