Skip to main content

Gender Attribution Methodology

1. Attribution Process | 2. Results Analysis | 3. Cross-country Variation

The gender attribution team used two sources of information for gender attribution:

  1. IBM-GNR (Global Name Recognition), a name search technology produced by IBM. IBM-GNR is a commercial product that performs various name disambiguation tasks, of which two are relevant to our methodology: (1) the association of names and surnames to one or (more often) several countries of likely origin, and (2) the association of names to male and female given in the form of probability estimates. These associations originate from a database produced by U.S. immigration authorities in the first half of the 1990s. During this time, immigration authorities registered all names and surnames alongside nationality and gender of all foreign citizens entering the United States. It contains roughly 750,000 full names; in addition, variants of registered names and surnames are considered, according to country-sensitive spelling and abbreviation rules. More information can be found in Breschi et al. (2017ai, 2017b ii).
  2. The WIPO worldwide gender-name dictionary (WGND), produced by the World Intellectual Property Organization (WIPO). It includes a list of 6.2 million names from 182 different countries. For each name contained in the data set, it attaches a given gender by country where that name appears in the source data. The construction of the WGND drew on previous gender studies (see the literature review) as well as national public statistical institutions. See Martinez et al. (2016)iii for details. For several names in certain countries, that name is used for both female and male. Therefore, these names are given an “unknown” status for countries where this is the case.

Using these two sources of country-specific gender-attributed names, the gender attribution team attributes gender to U.S. Patent and Trademark Office (USPTO) inventor names through the following steps:

  1. For each inventor name, the IBM-GNR returns the fraction of instances it identifies as male in the data source and the fraction it identifies as female. In addition, it returns a “frequency” metric that indicates the frequency with which each name appears in the complete data set. A very uncommon name will be assigned a very low frequency, indicating that gender attribution will be unreliable for that name.
  2. For each inventor first name, female gender is attributed if it is identified as female in 97% or more cases and male gender is attributed if identified as male in 98% or more cases. These threshold values were decided by manual inspection of the distribution of the fraction of gender appearance from the first step. However, any names with a frequency metric of 5% or less are excluded due to unreliability of attribution. This step resulted in the gender attribution of 71.49% (2,499,999) of inventors.
  3. When the inventor’s first name is majority one gender but does not reach the thresholds established in step 2, the second (or middle) name is taken into consideration. When the second name does reach the threshold value, the appropriate gender is attributed to the inventor. This step results in 35,581 additional names being assigned a gender, upping the attribution rate to 72.90%.
  4. After steps 1–3, 943,725 inventor names remain. For these, we rely on WIPO’s WGND (mentioned in the data sources section). Due to the WGND being country-specific, we must first attribute a country of origin to each inventor remaining using the IBM-GNR.
  5. For each likely country of origin, the GNR attaches a measure of significance, which measures the share of instances in which the name or surname is associated with a given country of origin. The present algorithm focuses on the vector of countries associated with the surname to assign a country of origin. This decision is because the first name is the name of interest for gender attribution and thus cannot be part of the decision rule for country of origin.
  6. Of the set of associated countries for each surname, only those with at least 10% significance are considered. After dropping those under 10%, the list is sorted by significance in descending order. This step does encounter another problem, which can be best explained through the example of the “Smith” surname. It could be given a 30% significance for Germany, 20% for the United Kingdom, and 10% for Ireland and Australia. In principle, this would be associated with Germany, but the other Anglo-Saxon countries add up to a higher level of significance than Germany. To address this problem, some countries are collapsed into linguistic groups to create a list of countries and languages associated with surnames. These linguistic groups are sorted in the larger list of countries as one, and then sorted further within the group afterwards.
  7. After sorting, each inventor is associated with each surname based on the list of linguistic groups and individual countries. The “Smith” example from step 6 would first be associated with the United Kingdom, Ireland, the United States, Australia, Canada, and so on (all English-speaking countries), and then to Germany, Switzerland, and Austria.
  8. With linguistic groups and countries associated with each surname, the first name and at least one of the associated countries are matched to name-country pairs in the WGND data set. More than one linguistic group is kept per inventor because, for some name-country pairs, the first linguistic group does not exist in the WGND data set. In those cases, the most significant linguistic group included in the data set is used.
  9. For some inventors with rare surnames, we were not able to create a list of likely countries of origin. In these cases, country of residence is substituted for country of origin. This happened in 3.92% (37,003) of cases.
  10. After steps 4–9 using the WGND, an additional 498,620 inventor names were given gender attributions.
  11. Last, the cases of no name-country match in the WGND process must be addressed. These cases account for 18% (169,405) of inventor names. To address these, we use the WGND gender attribution despite no name-country match, and attribute gender only if two conditions are satisfied: (1) all instances in the WGND are that gender and (2) the majority of instances generated by GNR coincide with the gender attributed by the WGND.

Following these 11 steps, known as the “baseline” method, we were able to attribute gender to 92.08% (3,206,605) of USPTO inventors. The United States has a much higher attribution rate than countries such as China, India, and the Republic of Korea. This problem is shared by prior studies with a similar aim that have attempted to attribute gender to Asian names. Therefore, some additional steps were implemented to create a “baseline-augmented” method. Thresholds for these steps were all set by manual inspection of the distribution of GNR shares for each group/country.

  1. For surnames primarily associated with China, Singapore, Taiwan, Macao, and Hong Kong, we attribute a gender if it is identified in 60% or more of GNR cases.
  2. For surnames primarily associated with the Republic of Korea, the threshold is set at 80%.
  3. For surnames primarily associated with India, the threshold is set at 90%.

These steps related to Asian countries result in attribution of 1.1% (38,188) of total names, bringing the attribution rate up to 93.18% (3,244,813) of total inventors.


iBreschi, S., Lissoni, F., Miguelez, E., 2017a. Foreign-origin inventors in the USA: testing for diaspora and brain gain effects. J Econ Geogr 17, 1009–1038. https://doi.org/10.1093/jeg/lbw044

iiBreschi, S., Lissoni, F., Tarasconi, G., 2017b. Inventor Data for Research on Migration & Innovation: The Ethnic-Inv Pilot Database., in: In: FINK, C. & MIGUELEZ, E. (Eds.) The International Mobility of Talent and Innovation: New Evidence and Policy Implications. Cambridge University Press.

iiiMartínez, G.L., Raffo, J., Saito, K., 2016. Identifying the Gender of PCT inventors (No. 33), WIPO Economic Research Working Papers. World Intellectual Property Organization - Economics and Statistics Division.


Appendix

Below are three tables: Table A.1.1 shows the nonattribution rates for the process detailed in steps 1–11, Table A.1.2 shows the rates after the inclusion of steps 12–14, and Table A.1.3 shows the linguistic groups used in the attribution process.

Table A.1.1: Gender non-attribution cases
(% on total inventors worldwide and country breakdown)

  Baseline Baseline-augmented Ethnic-based Ethnic-based-augmented
  % # (,000) % # (,000) % # (,000) % # (,000)
All countries 7.92 276 6.82 237 9.44 329 5.45 190
  of which: of which: of which: of which:
AT 0.01 0.3 0.01 0.3 0.01 0.5 0.01 0.2
AU 0.03 0.9 0.02 0.8 0.04 1.4 0.02 0.7
BE 0.02 0.6 0.02 0.6 0.02 0.8 0.01 0.4
BR 0.01 0.5 0.01 0.5 0.02 0.6 0.01 0.4
CA 0.14 5 0.13 4.5 0.23 8.1 0.11 3.9
CH 0.03 0.9 0.03 0.9 0.05 1.6 0.02 0.7
CN 0.99 34.6 0.85 29.6 0.84 29.3 0.78 27.3
DE 0.12 4.3 0.12 4.2 0.26 9 0.11 3.7
DK 0.01 0.4 0.01 0.4 0.03 0.9 0.01 0.4
ES 0.01 0.3 0.01 0.3 0.02 0.6 0.01 0.3
FI 0.03 0.9 0.03 0.9 0.04 1.4 0.01 0.4
FR 0.10 3.6 0.10 3.6 0.10 3.6 0.06 2
GB 0.06 2.2 0.06 2.1 0.11 3.8 0.05 1.8
GR 0.00 0.1 0.00 0.1 0.01 0.2 0.00 0.1
HK 0.03 1.2 0.02 0.8 0.02 0.6 0.01 0.5
HU 0.01 0.3 0.01 0.3 0.02 0.7 0.01 0.3
IE 0.00 0.1 0.00 0.1 0.01 0.3 0.00 0.1
IL 0.05 1.7 0.05 1.7 0.10 3.4 0.04 1.5
IN 0.26 8.9 0.25 8.7 0.42 14.5 0.25 8.7
IT 0.04 1.4 0.04 1.4 0.07 2.6 0.04 1.3
JP 1.56 54.4 1.56 54.2 2.03 70.9 1.48 51.7
KR 0.81 28.3 0.66 23 0.32 11.3 0.20 6.9
MX 0.00 0.1 0.00 0.1 0.00 0.1 0.00 0.1
MY 0.04 1.3 0.03 1 0.02 0.7 0.01 0.5
NL 0.06 2 0.05 1.9 0.11 3.9 0.05 1.9
NO 0.01 0.5 0.01 0.4 0.02 0.7 0.01 0.4
NZ 0.00 0.1 0.00 0.1 0.01 0.2 0.00 0.1
PL 0.00 0.1 0.00 0.1 0.01 0.2 0.00 0.1
PT 0.00 0 0.00 0 0.00 0 0.00 0
RU 0.02 0.6 0.02 0.6 0.03 1.2 0.02 0.6
SE 0.03 1.2 0.03 1.2 0.05 1.7 0.02 0.8
SG 0.09 3.2 0.07 2.4 0.05 1.9 0.04 1.5
TR 0.00 0.1 0.00 0.1 0.01 0.4 0.00 0.1
TW 0.84 29.3 0.35 12.3 0.14 4.8 0.12 4.1
US 2.36 82.2 2.14 74.2 3.95 137.8 1.78 62
ZA 0.01 0.3 0.01 0.3 0.01 0.4 0.01 0.2

Table A.1.2: Gender non-attribution cases (% of total inventors by country)

  Baseline Baseline-augmented Ethnic-based Ethnic-based-augmented
  % % % %
AT 1.66 1.63 3.05 1.45
AU 3.23 2.94 5.2 2.52
BE 3.24 3.17 4.56 2.41
BR 6.84 6.84 8.7 5.99
CA 5.21 4.74 8.55 4.05
CH 2.5 2.44 4.42 2
CN 62.34 53.33 52.8 49.22
DE 1.6 1.57 3.36 1.39
DK 3.17 3.12 6.09 2.92
ES 1.95 1.92 3.58 1.66
FI 5.11 5.08 7.55 2.12
FR 3.05 3.01 3.05 1.65
GB 2.05 1.93 3.55 1.68
GR 6.92 6.92 19.47 6.75
HK 17.78 11.82 9.23 6.85
HU 4.38 4.38 9.9 4.08
IE 2.71 2.68 5.07 2.09
IL 5.81 5.79 11.79 5.28
IN 29.19 28.5 47.36 28.27
IT 3.16 3.15 5.76 2.95
JP 10.73 10.69 13.98 10.19
KR 30.75 24.99 12.3 7.5
MX 1.72 1.68 2.97 1.49
MY 36.88 28.63 19.56 14.79
NL 4.78 4.72 9.4 4.56
NO 5.11 5.07 8.01 4.25
NZ 3.07 2.92 4.55 2.32
PL 2.98 2.92 6.41 2.4
PT 1.88 1.88 2.04 1.31
RU 6.45 6.44 11.81 6.47
SE 2.99 2.91 4.29 2.06
SG 33.78 25.07 20.19 15.9
TR 7.01 7.01 35.38 6.85
TW 31.67 13.3 5.25 4.43
US 4.87 4.39 8.16 3.67
ZA 5.41 5.41 8.65 5.06
         
All countries 7.92 6.82 9.44 5.45

Table A.1.3: Linguistic groups

1 2 3 4 5 6 7 8 9 10
Country code Linguistuic group Country code Linguistuic group Country code Linguistuic group Country code Linguistuic group Country code Linguistuic group
AD Spanish CW CW IQ Arabic MX Spanish SR Ducth
AE Arabic CY Greek IR Persian MY MY SS SS
AF Persian CZ Slavic IS Occidental Scandinavian MZ MZ ST ST
AG English DE German IT Italian NA NA SU Russian
AI English DJ DJ JE English NC French SV Spanish
AL AL DK Oriental Scandinavian JM English NE NE SX English
AM AM DM English JO Arabic NF English SY Arabic
AN AN DO Spanish JP JP NG NG SZ SZ
AO AO DZ Arabic KE KE NI Spanish TC English
AR Spanish EC Spanish KG Turkic NL Dutch TD TD
AS AS EE Finnic KH KH NO Occidental Scandinavian TF TF
AT German EG Arabic KI KI NP NP TG TG
AU English EH Arabic KM KM NR NR TH TH
AW AW ER ER KN English NU NU TJ Persian
AZ Turkic ES Spanish KP Korean NZ English TL TL
BA Serbo-Croatian ET ET KR Korean OM OM TM Turkic
BB English FI Finnic KW Arabic PA Spanish TN Arabic
BD BD FJ English KY English PE Spanish TO TO
BE Dutch FK English KZ Turkic PF French TR Turkic
BF BF FM English LA LA PG PG TT English
BG Orienatal Slavic FO Occidental Scandinavian LB Arabic PH PH TV TV
BH Arabic FR French LC English PK PK TW Chinese
BI BI GA GA LI German PL Slavic TZ TZ
BJ BJ GB English LK LK PM French UA Russian
BM English GD English LR LR PN English UG UG
BN Malay GE GE LS LS PR Spanish US English
BO Spanish GF French LT Baltic PS Arabic UY Spanish
BQ BQ GG English LU French PT Portuguese UZ Turkic
BR Portuguese GH GH LV Baltic PW PW VA Italian
BS English GI Spanish LY Arabic PY Spanish VC English
BT BT GL GL MA Arabic QA Arabic VE Spanish
BW BW GM GM MC French RE French VG English
BY Russian GN GN MD MD RO RO VI English
BZ English GP GP ME Serbo-Croatian RS Serbo-Croatian VN VN
CA English GQ GQ MF French RU Russian VU VU
CD CD GR Greek MG MG RW RW WF WF
CF CF GT Spanish MH MH SA Arabic WS WS
CG CG GU GU MK Oriental Slavic SB SB YE Arabic
CH German GW GW ML ML SC SC YU Serbo-Croatian
CI CI GY English MM MM SD SD ZA ZA
CK English HK Chinese MN MN SE Oriental Scandinavian ZM ZM
CL Spanish HN Spanish MO Chinese SG Chinese ZW ZW
CM CM HR Serbo-Croatian MP MP SH SH    
CN Chinese HT French MQ MQ SI Serbo-Croatian    
CO Spanish HU HU MR Arabic SJ Occidental Scandinavian    
CR Spanish ID Malay MS MS SK Slavic    
CS Slavic IE English MT MT SL SL    
CU Spanish IL IL MU MU SM Italian    
CV CV IM English MV MV SN SN    
CX CX IN IN MW MW SO SO    

Note: Country codes are presented in odd columns, and their corresponding linguistic group in even columns. If even columns contain a country code, instead of a linguistic group, it means that no action was taken and that country was not grouped with other linguistically similar ones. In order to identify relevant linguistic groups, we used several sources, including CIA Factbook 2016, Ethnologue (https://www.ethnologue.com/), Wikipedia, and CEPII (Centre d'Etudes Prospectives et d'Informations Internationales, in particular, see: Melitz and Toubal, 2014)