Inventor Disambiguation Workshop Summary

More than 100 people attended the event in person and online on September 24, 2015. Six research teams presented exciting new computational approaches for identifying unique inventor entities across 40 years of U.S. Patent and Trademark Office (USPTO) patent data. 

The workshop was part of the effort to find creative new approaches to get better information on innovators and the latest technologies they develop by disambiguating inventor names. Research teams from the United States, Europe, Australia, and China submitted their inventor disambiguation algorithms, which resulted in the run-up to the workshop.

Nicholas Monath and Andrew McCallum from the University of Massachusetts (UMass) Amherst authored the successful algorithm that will now be integrated into the PatentsView data platform. The UMass team received a $25,000 stipend. The stipend will support the team's work on the algorithm and compensate team members for their technical guidance on integration efforts.

The USPTO Deputy Director Russell Slifer opened the workshop.

The USPTO Chief Economist Alan Marco provided an overview of the PatentsView initiative and set the goals for the workshop.

Joseph Bailey from USPTO presented the evaluation approach and outcomes of the workshop.

Participant presentations:

The judges presentations:

Disambiguation Workshop Participant Information

Algorithm Submission Checklist

Tab-delimited file showing the result of disambiguating the rawinventor table. Note that only the rawinventor table provided on the workshop website should be disambiguated (you do not need to disambiguate the applications database).
The first column of the tab-delimited file should be an inventor ID which is constructed by taking the hyphenated combination of the patent number and sequence fields for each inventor in the rawinventor table. The second column should be an integer ID generated by your program. Inventor IDs that are predicted to refer to the same unique individual should be assigned the same integer ID. For example, in the following excerpt the second and fourth inventors in the list are believed to be the same individual:
```
1234567-1	1
1234567-2	2
1234567-3	3
2345678-1	2
2345678-2	4
```
Note that some other sources of patent data may contain an “inventor ID” column which may not agree with the identifier that we are asking you to use. If you use a different inventor ID than the one described here, then we will not be able to evaluate your results properly.
Plain text or Word file describing the computing setup used and the run time of the algorithm. The description of the computing setup should include processor speed, number of cores/processors, amount of RAM, and amount of HD space. If applicable, describe the GPU or distributed computing setup used.
Source code for your disambiguation algorithm, including any preprocessing steps. If the code is available online, then you may submit a link. Otherwise, provide a compressed file.
Draft program documentation (note that minimal user documentation will be required for participants asked to continue to the second round of evaluation)
Draft write-up of methods

Next steps

We will evaluate your output file against withheld labeled data by computing the precision and recall rates as described in the evaluation documentation.
The judges will review the precision and recall rates as well as your description of run time and computing setup. The judges will choose up to three participating groups to advance to the next stage of evaluation by September 7.
The teams that advance to the next evaluation stage must be prepared to work in the test environment as described in the evaluation documentation.

Program Requirements

Participants will write a program that reads in files containing processed patent data from 1976 to 2014 and produce an output file giving predictions for which inventors in the data correspond to the same underlying individual.

Input files

The input files consist of parsed text and XML data from USPTO on published patent grants (1976-2014) and applications (2001-2014). Participants are free to use any portion of this data for their algorithms as they see fit. The data tables are provided both as individual CSV downloads and as a MySQL export file which can be used to populate a new MySQL database.

Output file format

The output file should be a tab-delimited file with two columns and no header. The first column should be an inventor ID that is constructed taking the hyphenated combination of the patent number and sequence fields for each inventor in the rawinventor table. The second column should be an integer ID generated by your program. Inventor IDs that are predicted to refer to the same unique individual should be assigned the same integer ID. For example, in the following excerpt the second author on the first patent and the first author on the second patent are believed to be the same individual:

1234567-1	1
1234567-2	2
1234567-3	3
2345678-1	2
2345678-2	4

Runtime and computing resources

The algorithm should not run for more than 5 days when processing all patent application and grant data (2001-2014 for applications; 1976-2014 for grants)
The implementation should be runnable on hardware equivalent to a single Amazon Web Services (AWS) instance. For reference, currently the largest compute-optimized AWS instance provides 36 virtual CPUs and 60 GB memory.
AIR and the panel will review any requests for software or hardware updates that might be required to accommodate the incorporation of a novel algorithm into the current PatentsView workflow. These requests must be submitted in your letter of intent to participate.

External data

Proprietary data sets cannot be included in any algorithm.
American Institutes for Research (AIR) and the panel will review any requests to incorporate additional nonproprietary data into a submitted algorithm. Please specify any additional data you intend to use in your letter of intent to participate.

Evaluation

Algorithms were evaluated for accuracy, runtime, and usability. The evaluation took place in two phases. We will briefly describe the evaluation criteria here but refer to the evaluation criteria documentation for further details and definitions.

First Phase

This is an initial round of self-testing where participants will infer links for the bulk patents database. They may train their algorithms using any part of the provided data, as well as any additional data sets that have been submitted to the workshop organizers. During this round, participants will be evaluated on the following criteria:

Recall Rate defined as $$\text{Recall} = \frac{\text{# of true positives}}{\text{# of true positives} + \text{# of false negatives}}$$
Precision Rate defined as $$\text{Precision} = \frac{\text{# of true positives}}{\text{# of true positives} + \text{# of false positives}}$$
Self-reported algorithm run-time

Second Phase

Up to three workshop participants will be invited to a second phase of evaluation, where they will be asked to run their disambiguation algorithm in a server environment that we provide. During this phase, participants will be evaluated on

Algorithm generalizability, which we will assess by computing the recall rate and precision rate for different sets of training and evaluation data.
Self-reported algorithm run-time
Usability of the implementation. We will ask participants in this phase of evaluation to provide user documentation for their algorithm implementation.

Training Datasets

Four patent data sets were provided to workshop participants for the training of their algorithms. Each is a human-labeled research data set with validated inventor identities. The data sets were previously developed, curated, and validated for research purposes. These four research data sets were generously provided by Erica Fuchs and colleagues, Ivan Png and colleagues, Pierre Azoulay and colleagues, and Manuel Trajtenberg and colleagues. AIR is providing the four research data sets in multiple formats, as described below:

The original optoelectronic human-labeled data set and full documentation can all be accessed at http://www.cmu.edu/epp/disambiguation.

File name (click to download)	Source Data	Description
als_training_data.csv	Azoulay, et al 2010	A training data set with 15,000 records. The data is a bootstrap sample of record comparisons based on a labeled data set of approximately 5,000 researchers in the academic life sciences and their US patents.
is_inventors.csv	Trajtenberg, et al 2008	Original dataset of all Israeli inventor and their US patents
ens_inventors.csv ens_patents.csv	Chunmian, et al forthcoming	Original dataset of engineers and scientists and their patents
benchmark_epfl.rar	Lissoni et al., 2010	An archive containing a database of labeled inventors affiliated to the Ecole Polytechnique Federale de Lausanne.
benchmark_france.rar	Lissoni et al., 2010	An archive containing a database of labeled inventors from French universities.
td_patent.csv	Chunmian, et al forthcoming Trajtenberg, et al 2008	Patent fields from the processed bulk patent data for patents matched to Trajtenberg or Png (see methods)
td_inventor.csv	Chunmian, et al forthcoming Trajtenberg, et al 2008	Inventor fields from the processed bulk patent data for all inventors on patents matched to Trajtenberg or Png (see methods)
td_assignee.csv	Chunmian, et al forthcoming Trajtenberg, et al 2008	Assignee fields from the processed bulk patent data for all assignees on patents matched to Trajtenberg or Png (see methods)
td_class.csv	Chunmian, et al forthcoming Trajtenberg, et al 2008	USPC classes from the processed bulk patent data for patents matched to Trajtenberg or Png (see methods)
epo_patent.csv	EPO Worldwide Patent Statistical Database (PATSTAT)	Additional patent fields from PATSTAT for patents that appear in benchmark_epfl.rar or bench_mark_france.rar
epo_inventor.csv	EPO Worldwide Patent Statistical Database (PATSTAT)	Additional patent fields from PATSTAT for inventors on patents that appear in benchmark_epfl.rar or bench_mark_france.rar
epo_cpc.csv	EPO Worldwide Patent Statistical Database (PATSTAT)	CPC classifications from PATSTAT for patents that appear in benchmark_epfl.rar or bench_mark_france.rar

Intellectual Property

PatentsView data and the underlying codebase are open to the public and available on the main website, and our github repository is under the BSD-2 open-source license (https://github.com/PatentsView). All submitted algorithms and related information will be subject to the same open-source copyright and will be made available to the public on the PatentsView github repository.

Teams

There is no limit to the size or composition of participating researcher teams; however, government employees are not eligible to participate. We will only provide travel expenses for one representative from the top teams (up to 15 teams) to present their technology to the workshop attendees.

Governance

The American Institutes for Research (AIR) hosted the workshop with the support of a judges’ panel of three subject matter experts:

Erica Fuchs, Carnegie Mellon University
Lee Giles, Penn State University
Francesco Lissoni, GREThA-University of Bordeaux

AIR and the judges’ panel reviewed all “intent to participate” submissions and oversaw the evaluation of submitted algorithms in advance of the technical workshop.

Location

The final workshop was held at the USPTO headquarters in Alexandria, Virginia.