PatentsView Project Disambiguation Algorithms
PatentsView uses a series of algorithms and post-processing techniques to track the patenting activity of inventors and assignees over time. The disambiguation process is a value-added service applied to the U.S. Patent and Trademark Office’s (USPTO’s) raw, publicly available data on granted patents from 1976 to the present, the data are updated quarterly. The process provides unique identifiers for patent inventors and assignees. The disambiguation methodology is constantly evaluated and updated to incorporate recent developments emerging from computer and information sciences as well as from reported issues from patent data users.
USPTO does not track inventors and assignees over time. Inventors and assignees apply for a patent as a singular event. It is not required to keep a consistent format for how they input their name, their co-inventors’ names, the assignee(s), or even the locations that they or their assignees are associated with. This results in a plethora of data inconsistencies—typos, use of short names or nicknames, and use of acronyms, to name a few.
Businesses, for example, like IBM, could show up as either International Business Machines, IBM, or sometimes even IMB.
John Smith, as an individual, could be mentioned as either J. Smith, Johnny Smith, or John Smit.
Therefore, to associate the same inventor or assignee with more than one patent, it is necessary to cluster like entities together. The current disambiguation algorithm was created and integrated in 2015 by a University of Massachusetts Amherst team, Nicholas Monath and Andrew McCallum, as a result of a workshop hosted by USPTO, PatentsView’s Inventor Disambiguation Technical Workshop. The workshop focused on developing new approaches to inventor disambiguation.
Four-Step Disambiguation Process Overview
The PatentsView disambiguation process is split into four separate steps. These four steps include:
Constructing canopies 1
The first step in the process is to construct “canopies” that contain mentions that are likely to refer to the same entity. The purpose of this step is to lower the total number of comparisons required in later steps, which lowers the amount of time and/or computing power needed.
Applying Similarity Metrics
The next step is to calculate similarity scores between mentions in each canopy. The inventor and assignee processes use different methods to model similarity. Scores produced in this step are used in the clustering algorithm in step three.
Clustering by Similarity Scores
After computing similarity scores, PatentsView uses an agglomerative hierarchical clustering model to group inventors and assignees into single individuals with unique identifiers that can be tracked over time. A threshold for similarity was determined by testing different values and comparing the desirability of the results.
Clustering Algorithms Evaluation
After clustering, the results must be evaluated to compare against other clustering algorithms. These can either be past algorithms that are being improved on or future algorithms improving on the current one. Three metrics are used for evaluation: precision, recall, and F1. These metrics are explained in detail on the evaluation page.
The disambiguation process methods documentation is here: Disambiguation Methods.
PatentsView Project Source Code
The source code for the PatentsView project can be found in the GitHub repository.
1The disambiguation process operates on a set of “mentions.” Mentions are defined as the separate occurrences of a name, assignee, or other text fields of interest that are observed in the raw patent data. In the John Smith example above, there are three additional “mentions,” each different variation of John Smith.