The best recognizable female, authoris not as focused as her male counterpart. For the other feature types, we see some variation, but most scores are found near the top of the lists.

This may support ourhypothesis that allfeature types aredoingmore orlessthe same. The tokenizer is able to identify hashtags and Twitter user names to the extent that these conform to the conventions used in Twitter, i.

Although LIWC appears a very interesting addition, it hardly adds anything to the classification. They used League of legends needs better matchmaking features, and present a very good breakdown of various word types.

On re examination, we see a clearly male first name and also profile photo.

The class separation value is a variant of Cohen s d Cohen If we search for the word parlement parliament in our corpus, which is used 40 times by Sargentini, we find two more female authors each using it onceas compared to 21 male authors with up to 9 uses.

In scores, too, we see far more variation. This number was treated as just another hyperparameter to be selected. Then, as Poolse mannen dating of our features were based on tokens, we tokenized all text samples, using our own specialized tokenizer for tweets. Where Cohen assumes the two distributions have the same standard deviation, we use the sum of the two, practically always different, standard deviations.

Again, we take the token unigrams as a starting point. As for style, the only real factor is echt really. From this material, we considered all tweets with a date stamp in and In all, there were about 23 million users present.

In the example tweet, we find e. The only hyperparameters we varied in the grid search are the metric Numerical and Cosine distance and the weighting no weighting, information gain, gain ratio, chi-square, shared variance, and standard deviation.

Several errors could be traced back to the fact that the account had moved on to another user since We could have used different dividing strategies, but chose balanced folds in order to give a equal chance to all machine learning techniques, also those that have trouble with Poolse mannen dating data.

This type of character n-gram has the clear advantage of not needing any preprocessing in the form of tokenization. This restriction brought the number of users down to aboutFigure 5 shows all token unigrams.

Feature type Unigram Bigram Trigram Skipgram Char 5-gram Top Function 14 get the impression that Dutch is not his native language, which is supported by his name. The authors do not report the set of slang words, but the non-dictionary words appear to be more related to style than to content, showing that purely linguistic behaviour can contribute information for gender recognition as well.

In fact, for all the tokens n-grams, it would seem that the further one goes away from the unigrams, the worse the accuracy gets.

Gender Recognition on Dutch Tweets

However, our starting point will always be SVR with token unigrams, this being the best performing combination. But it might alsomean that the gender just influences all feature types to a similar degree.

When adding more information sources, such as profile fields, they reach an accuracy of