Determining Gender on Twitter

One of the things that is missing from Twitter is an indication of a person’s gender.  When you register for Twitter, you do not indicate whether you are male or female, and there is nothing later on that directly identifies your Twitter account as being male or female1.

Nonetheless, it’s often an important question to ask what the balance is between males and females in various groups.  And so we must rely on indirect techniques.

The approach I use is fairly straightforward.  I first look at the first name of the person.  Some first names are very predictive — it is unlikely that a Charles is going to be female, and so any Charles we see I will count as male.

To build the list of names, I look to the Social Security Administration’s index of popular baby names.  You can download the top 1000 male and 1000 female first names for any given year.  By building up a database of names used over the past century it’s possible to get a fairly comprehensive list of about 5000 distinctly male or female first names.

Still, some names are androgynous (Chris? Lee? Pat?), and we have to look at the person’s profile for indications.  People often say “wife of” or “father to”, etc.  By looking for these key phrases in a person’s profile I can make an educated guess.

Even with all these techniques, in general I can classify people’s gender about 50% of the time.  Lots of people use abbreviations (unless you knew, how would you classify Crossfire host S.E. Cupp?).  Other people use fake names that are gender neutral.  And other people just have oddball names or nicknames that aren’t in my database.

There are researchers out there who feel they can look at the wording people use in their status updates to make a better assessment of people’s gender, but I have not moved towards that approach yet.  It may be true, but I suspect that they may just be building a predictive model for their existing data.  Regardless, it’s very time consuming to go track down a corpus of a person’s tweets, especially when you’re trying to deal with large volumes of tweets in real-time.

Given that we can only guess at half of the Twitter users, what do we do about the half we cannot guess? At this point, I mostly ignore them.  I make the assumption that the overall population will have the same gender distribution as the identifiable portion.  But even for the portion we can guess, there is always a realistic chance that some or all are being deceptive in their Twitter profile.

No matter what, estimations of gender on Twitter will always have a caveat, that it’s an unverifiable estimate, but one we can reasonably assume gets better the larger the sample size.

1 I appreciate that gender can be more complicated than male/female; unfortunately I don’t think I can do much about it in this case.

Leave a Reply