Gender Attribution Methodology
The gender attribution team used two sources of information for gender attribution:
- IBM-GNR (Global Name Recognition), a name search technology produced by IBM. IBM-GNR is a commercial product that performs various name disambiguation tasks, of which two are relevant to our methodology: (1) the association of names and surnames to one or (more often) several countries of likely origin, and (2) the association of names to male and female given in the form of probability estimates. These associations originate from a database produced by U.S. immigration authorities in the first half of the 1990s. During this time, immigration authorities registered all names and surnames alongside nationality and gender of all foreign citizens entering the United States. It contains roughly 750,000 full names; in addition, variants of registered names and surnames are considered, according to country-sensitive spelling and abbreviation rules. More information can be found in Breschi et al. (2017ai, 2017b ii).
- The WIPO worldwide gender-name dictionary (WGND), produced by the World Intellectual Property Organization (WIPO). It includes a list of 6.2 million names from 182 different countries. For each name contained in the data set, it attaches a given gender by country where that name appears in the source data. The construction of the WGND drew on previous gender studies (see the literature review) as well as national public statistical institutions. See Martinez et al. (2016)iii for details. For several names in certain countries, that name is used for both female and male. Therefore, these names are given an “unknown” status for countries where this is the case.
Using these two sources of country-specific gender-attributed names, the gender attribution team attributes gender to U.S. Patent and Trademark Office (USPTO) inventor names through the following steps:
- For each inventor name, the IBM-GNR returns the fraction of instances it identifies as male in the data source and the fraction it identifies as female. In addition, it returns a “frequency” metric that indicates the frequency with which each name appears in the complete data set. A very uncommon name will be assigned a very low frequency, indicating that gender attribution will be unreliable for that name.
- For each inventor first name, female gender is attributed if it is identified as female in 97% or more cases and male gender is attributed if identified as male in 98% or more cases. These threshold values were decided by manual inspection of the distribution of the fraction of gender appearance from the first step. However, any names with a frequency metric of 5% or less are excluded due to unreliability of attribution. This step resulted in the gender attribution of 71.49% (2,499,999) of inventors.
- When the inventor’s first name is majority one gender but does not reach the thresholds established in step 2, the second (or middle) name is taken into consideration. When the second name does reach the threshold value, the appropriate gender is attributed to the inventor. This step results in 35,581 additional names being assigned a gender, upping the attribution rate to 72.90%.
- After steps 1–3, 943,725 inventor names remain. For these, we rely on WIPO’s WGND (mentioned in the data sources section). Due to the WGND being country-specific, we must first attribute a country of origin to each inventor remaining using the IBM-GNR.
- For each likely country of origin, the GNR attaches a measure of significance, which measures the share of instances in which the name or surname is associated with a given country of origin. The present algorithm focuses on the vector of countries associated with the surname to assign a country of origin. This decision is because the first name is the name of interest for gender attribution and thus cannot be part of the decision rule for country of origin.
- Of the set of associated countries for each surname, only those with at least 10% significance are considered. After dropping those under 10%, the list is sorted by significance in descending order. This step does encounter another problem, which can be best explained through the example of the “Smith” surname. It could be given a 30% significance for Germany, 20% for the United Kingdom, and 10% for Ireland and Australia. In principle, this would be associated with Germany, but the other Anglo-Saxon countries add up to a higher level of significance than Germany. To address this problem, some countries are collapsed into linguistic groups to create a list of countries and languages associated with surnames. These linguistic groups are sorted in the larger list of countries as one, and then sorted further within the group afterwards.
- After sorting, each inventor is associated with each surname based on the list of linguistic groups and individual countries. The “Smith” example from step 6 would first be associated with the United Kingdom, Ireland, the United States, Australia, Canada, and so on (all English-speaking countries), and then to Germany, Switzerland, and Austria.
- With linguistic groups and countries associated with each surname, the first name and at least one of the associated countries are matched to name-country pairs in the WGND data set. More than one linguistic group is kept per inventor because, for some name-country pairs, the first linguistic group does not exist in the WGND data set. In those cases, the most significant linguistic group included in the data set is used.
- For some inventors with rare surnames, we were not able to create a list of likely countries of origin. In these cases, country of residence is substituted for country of origin. This happened in 3.92% (37,003) of cases.
- After steps 4–9 using the WGND, an additional 498,620 inventor names were given gender attributions.
- Last, the cases of no name-country match in the WGND process must be addressed. These cases account for 18% (169,405) of inventor names. To address these, we use the WGND gender attribution despite no name-country match, and attribute gender only if two conditions are satisfied: (1) all instances in the WGND are that gender and (2) the majority of instances generated by GNR coincide with the gender attributed by the WGND.
Following these 11 steps, known as the “baseline” method, we were able to attribute gender to 92.08% (3,206,605) of USPTO inventors. The United States has a much higher attribution rate than countries such as China, India, and the Republic of Korea. This problem is shared by prior studies with a similar aim that have attempted to attribute gender to Asian names. Therefore, some additional steps were implemented to create a “baseline-augmented” method. Thresholds for these steps were all set by manual inspection of the distribution of GNR shares for each group/country.
- For surnames primarily associated with China, Singapore, Taiwan, Macao, and Hong Kong, we attribute a gender if it is identified in 60% or more of GNR cases.
- For surnames primarily associated with the Republic of Korea, the threshold is set at 80%.
- For surnames primarily associated with India, the threshold is set at 90%.
These steps related to Asian countries result in attribution of 1.1% (38,188) of total names, bringing the attribution rate up to 93.18% (3,244,813) of total inventors.
iBreschi, S., Lissoni, F., Miguelez, E., 2017a. Foreign-origin inventors in the USA: testing for diaspora and brain gain effects. J Econ Geogr 17, 1009–1038. https://doi.org/10.1093/jeg/lbw044
iiBreschi, S., Lissoni, F., Tarasconi, G., 2017b. Inventor Data for Research on Migration & Innovation: The Ethnic-Inv Pilot Database., in: In: FINK, C. & MIGUELEZ, E. (Eds.) The International Mobility of Talent and Innovation: New Evidence and Policy Implications. Cambridge University Press.
iiiMartínez, G.L., Raffo, J., Saito, K., 2016. Identifying the Gender of PCT inventors (No. 33), WIPO Economic Research Working Papers. World Intellectual Property Organization - Economics and Statistics Division.
Below are three tables: Table A.1.1 shows the nonattribution rates for the process detailed in steps 1–11, Table A.1.2 shows the rates after the inclusion of steps 12–14, and Table A.1.3 shows the linguistic groups used in the attribution process.
Table A.1.1: Gender non-attribution cases
(% on total inventors worldwide and country breakdown)
|%||# (,000)||%||# (,000)||%||# (,000)||%||# (,000)|
|of which:||of which:||of which:||of which:|
Table A.1.2: Gender non-attribution cases (% of total inventors by country)
Table A.1.3: Linguistic groups
|Country code||Linguistuic group||Country code||Linguistuic group||Country code||Linguistuic group||Country code||Linguistuic group||Country code||Linguistuic group|
|BG||Orienatal Slavic||FO||Occidental Scandinavian||LB||Arabic||PH||PH||TV||TV|
Note: Country codes are presented in odd columns, and their corresponding linguistic group in even columns. If even columns contain a country code, instead of a linguistic group, it means that no action was taken and that country was not grouped with other linguistically similar ones. In order to identify relevant linguistic groups, we used several sources, including CIA Factbook 2016, Ethnologue (https://www.ethnologue.com/), Wikipedia, and CEPII (Centre d'Etudes Prospectives et d'Informations Internationales, in particular, see: Melitz and Toubal, 2014)