Dating site voor knappe mensen. Gender recognition on dutch tweets - pdf
For such high numbers of features, it is known that k-nn learning is unlikely to yield useful results Beyer et al. Normalized 1-gram About features. For only one feature type, character trigrams, LP with PCA manages to reach a higher accuracy than SVR, but the difference is not statistically significant.
After this, we examine the classification of individual authors Section 5.
However, our starting point will always be SVR with token unigrams, this being the best performing combination. Results In this section, we will present the overall results of the gender recognition.
The unigrams do not judge him to write in an extremely female way, but all other feature types do. One gets the impression that gender recognition is more sociological than linguistic, showing what women and men were blogging about back in A later study Goswami et al.
A group which is very active in studying gender recognition among other traits on the basis of text is that around Moshe Koppel. Instead, we will just look at the distribution of the various features over the female and male texts.
We will focus on the token n-grams and the normalized character 5-grams. Original 4-gram About K features. SVR tends to place him clearly in the male area with all the feature types, with unigrams at the extreme with a score of SVR with PCA on the other hand, is less convinced, and even classifies him as female for unigrams 1.
A model, called profile, is constructed for each individual class, and the system determines for each author to which degree they are similar to the class profile.
Top Function 4: In this way, we derived a classification score for each author without the system having any direct or indirect access to the actual gender of the author. For this reason, we did all classification with SVR and LP twice, once building a male model and once a female model.
However, it does not manage to achieve good results with the principal components that were best for the other two systems.
There is an extreme number of misspellings even for Twitterwhich may possibly confuse the systems models. Roughly speaking, it classifies on the basis of noticeable over- and underuse of specific features.
This means that the content of the n-grams is more important than their form. Figure 5 shows all token unigrams. However, we cannot conclude that what is wiped away by the normalization, use of diacritics, capitals and spacing, holds no information for the gender recognition.
This apparently colours not only the discussion topics, which Dating site voor knappe mensen be expected, but also the general language use. Again, we take the token unigrams as a starting point. However, as any collection that is harvested automatically, its usability is reduced by a lack of reliable metadata.
From each user s tweets, we removed all retweets, as these did not contain original text by the author.
From the aboutusers who are assigned a gender by TwiQS, we took a random selection in such a manner that the volume distribution i. For SVR, one would expect symmetry, as both classes are modeled simultaneously, and differ merely in the sign of the numeric class identifier.
Interestingly, it is SVR that degrades at higher numbers of principal components, while TiMBL, said to need fewer dimensions, manages to hold on to the recognition quality.
They used lexical features, and present a very good breakdown of various word types.
If we search for the word parlement parliament in our corpus, which is used 40 times by Sargentini, we find two more female authors each using it onceas compared to 21 male authors with up to 9 uses.
Recognition accuracy as a function of the number of principal components provided to the Dating site voor knappe mensen, using normalized character 5-grams.
In the example tweet, e. Currently the field is getting an impulse for further development now that vast data sets of user generated data is becoming available. And, obviously, it is unknown to which degree the information that is present is true. In this paper, we start modestly, by attempting to derive just the gender of the authors 1 automatically, purely on the basis of the content of their tweets, using author profiling techniques.
The most obvious male is authorwith a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: Feature type Unigram 1: These percentages are presented below in Section Profiling Strategies In this section, we describe the strategies that we investigated for the gender recognition task.