Importance of data preparation when analysing written responses to open-ended questions: An empirical assessment and comparison with manual coding

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

In a world where consumer texts grow more numerous each day, automated text analysis can deliver valuable insights about consumer attitudes and behaviours. The present research was methodological in nature and focused on pre-processing of text data, which generally is the most time-consuming stage of analysis. Using responses to an open-ended question from 4341 consumers, document-term matrices (DTM) were created from varying combinations of n-grams (unigrams, bigrams, trigrams and combinations hereof), stemming (yes or no) and low-frequency term thresholding (retaining all terms or excluding those used < 0.1%, <1% or < 5%). By comparison to a fixed standard – manually derived content coded of respondents’ answers – the relative impact of the three pre-processing steps were assessed. PLS-DA was used to do so, and classifier performance was evaluated using AUC-ROC scores. Inclusion of bigrams and trigrams in DTMs did not influence classification performance and stemming had only a minor impact. Inclusion of all and very rare features (<0.1%) improved classification performance. The results were invariant of sample size and replicated in subsets of 2000, 1000 and 500 participants. The results may be specific to the short length of the answers (median words = 4), although they held in a sub-sample of the 500 longest answers (median words = 41). Future research should directly test the influence of these pre-processing steps, for example, through topic modelling.

OriginalsprogEngelsk
Artikelnummer104270
TidsskriftFood Quality and Preference
Vol/bind93
Antal sider14
ISSN0950-3293
DOI
StatusUdgivet - 2021

ID: 272576808