PatentsView Project Disambiguation Algorithms
PatentsView uses a complex series of algorithms and post-processing techniques to track the patenting activity of inventors and assignees over time. The disambiguation process1 is a value-added service applied to the U.S. Patent and Trademark Office’s (USPTO’s) raw, publicly available data on granted patents from 1976 to the present. The data are updated quarterly. The process provides unique identifiers for patent inventors, assignees, and locations. The disambiguation methodology is constantly evaluated and updated to incorporate recent developments emerging from computer and information sciences as well as reported issues from patent data users.
USPTO does not track inventors and assignees over time. Inventors and assignees apply for a patent as a singular event. It is not required to keep a consistent format for how they input their name, their co-inventors’ names, the assignee(s), or even the locations that they or their assignees are associated with. This results in a plethora of data inconsistencies—typos, use of short names or nicknames, and use of acronyms, to name a few.
Businesses, for example, like IBM, could show up as either International Business Machines, IBM, or sometimes even IMB.
John Smith, as an individual, could be mentioned as either J. Smith, Johnny Smith, or John Smit.
Therefore, to associate the same inventor or assignee with more than one patent, it is necessary to cluster like entities together. The current disambiguation algorithm was created and integrated in 2015 by a University of Massachusetts Amherst team, Nicholas Monath and Andrew McCallum, as a result of a workshop hosted by USPTO, PatentsView’s Inventor Disambiguation Technical Workshop. The workshop focused on developing new approaches to inventor disambiguation.
Four-Step Disambiguation Process Overview
The PatentsView disambiguation process is split into four separate steps. These four steps include:
The first step in the process is to construct “canopies” that contain mentions that are likely to refer to the same entity. The purpose of this step is to lower the total number of comparisons required in later steps, which lowers the amount of time and/or computing power needed.
The next step is to calculate similarity scores for the mentions in each canopy. Three different similarity models are used, one each for assignees, inventors, and locations. Scores produced in this step are used in the clustering algorithm in the next step.
After computing similarity scores, a clustering algorithm is used to create clusters from mentions with scores above a predefined threshold. This threshold was determined by testing different values and comparing the desirability of the results. The clusters produced in this step represent the final sets of mentions that the algorithm believes refer to the same entity.
After clustering, the results must be evaluated to compare against other clustering algorithms. These can either be past algorithms that are being improved on or future algorithms improving on the current one. Three metrics are used for evaluation: precision, recall, and F1. These metrics are explained in detail on the evaluation page.
The disambiguation process methods documentation is here: Disambiguation Methods.
PatentsView Project Source Code
The source code for the PatentsView project can be found in the GitHub repository.
1The disambiguation process operates on a set of “mentions.” Mentions are defined as the separate occurrences of a name, location, or other text fields of interest that are observed in the raw patent data. In the John Smith example above, there are three additional “mentions,” each different variation of John Smith.