The website Footnote dos was used as a way to gather tweet-ids Footnote step three , this amazing site brings experts which have metadata out of an excellent (third-party-collected) corpus off Dutch tweets (Tjong Kim Sang and you may Van den Bosch, 2013). elizabeth., the newest historic restrict whenever asking for tweets predicated on a venture query). The brand new R-plan ‘rtweet’ and you can subservient ‘lookup_status’ mode were used to get tweets in the JSON structure. The brand new JSON file constitutes a table on tweets’ suggestions, including the design go out, the fresh new tweet text message, together with source (i.elizabeth., style of Twitter client).
Investigation clean up and preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The latest tweet texts was indeed transformed into ASCII security. URLs, line breaks, tweet headers, display screen names, and you can recommendations to monitor names were got rid of. URLs enhance the reputation number whenever receive during the tweet. not, URLs don’t add to the character matter when they are located at the end of an effective tweet. To eliminate a misrepresentation of your own genuine reputation restrict one to users suffered with, tweets with URLs (although not news URLs eg additional photo otherwise movies) was omitted.
Token and you may bigram research
The R package Footnote 5 ‘quanteda’ was used to help you tokenize this new tweet texts towards tokens (we.elizabeth., isolated terms and conditions, punctuation s. As well, token-frequency-matrices have been computed that have: the newest frequency pre-CLC [f(token pre)], the latest cousin regularity pre-CLC[P (token pre)], the new volume post-CLC [f(token blog post)], brand new relative frequency blog post-CLC and you will T-results. The fresh new T-shot is similar to a basic T-statistic and you may exercises this new analytical difference in mode (we.e., the newest relative term frequencies). Bad T-score indicate a comparatively higher density of an effective token pre-CLC, whereas self-confident T-scores suggest a relatively high occurrence from good token article-CLC. The newest T-get formula utilized in the research try exhibited given that Eq. (1) and you can (2). Letter ‘s the final number out-of tokens for each and every dataset (i.elizabeth., before and after-CLC). So Manchester sugar daddy it formula will be based upon the process to possess linguistic calculations because of the Chapel ainsi que al. (1991; Tjong Kim Sang, 2011).
Part-of-speech (POS) studies
The brand new R bundle Footnote six ‘openNLP’ was utilized to help you identify and you may matter POS kinds regarding tweets (we.elizabeth., adjectives, adverbs, stuff, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and miscellaneous). This new POS tagger operates using a maximum entropy (maxent) possibilities design to predict the new POS class centered on contextual possess (Ratnaparkhi, 1996). The newest Dutch maxent design utilized for this new POS class is taught on the CoNLL-X Alpino Dutch Treebank data (Buchholz and you can ). The fresh openNLP POS design might have been reported which have a precision get regarding 87.3% whenever used in English social networking data (Horsmann et al., 2015). An ostensible limit of your latest investigation ‘s the reliability of the newest POS tagger. However, comparable analyses was did for both pre-CLC and article-CLC datasets, meaning the precision of your own POS tagger shall be uniform more than both datasets. Thus, we assume there are not any logical confounds.