Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org
The U.S. Patents and Trademarks Office receives thousands of patent applications every year. Often, the same inventor will apply for multiple patents. Other times, multiple inventors with similar names will each apply for a patent.
The issue researchers and innovation enthusiasts have run into is that, when analyzing patent data, there is no standard way to tell whether an inventor named on multiple patents is the same person or different people with a similar name.
PatentsView uses algorithms to make that determination, a process known as entity resolution or disambiguation. The process is not perfect, and the PatentsView team is constantly working to make the algorithm more accurate.
The first step in any improvement process is to evaluate how well the current system works. Olivier Binette, a PhD candidate in Statistical Science at Duke University, explored this question in his publication Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org.
Challenges for the PatentsView algorithm
Binette notes in his paper that the PatentsView entity resolution algorithm faces three main challenges in accurately determining whether the names on multiple patent applications belong to one or more than one inventor.
First, when researchers apply the PatentsView algorithm to benchmark datasets — smaller subsets of larger datasets that are used to train and test algorithms — the results tend to be more accurate then when the algorithm is applied to the larger, real-world data. This is likely because many of the false links between inventors with similar names do not appear in the benchmark dataset.
Second, the number of patents that share a common inventor is relatively small compared to the larger number of patents. This creates a challenge for training the PatentsView algorithm to classify pairs of records as either sharing an inventor or not sharing an inventor.
Finally, there are many different methods researchers have used to sample the benchmark data sets and adjust their estimates according to those samples. This creates an additional challenge in training the PatentsView algorithm.
Binette argues that his method for estimating the performance of the PatentsView algorithm addresses all three challenges.
His method uses three different representations of precision and recall. Precision is the fraction of pairs that are put into the same group for analysis and recall is the fraction of pairs that are correctly identified. So, an algorithm with high precision would correctly identify two similar names and put them together for analysis most of the time. An algorithm with high recall would, most of the time, correctly identify which of those similar names belonged to the same inventor.
He tested each representation using PatentsView’s current disambiguated inventor data. For the test, he treated that data as the ground truth, then randomly added in errors before calculating precision and recall.
He repeated the process 100 times. Then, he performed additional tests on two existing benchmark datasets and a disambiguation set done by hand.
Using this method, Binette found that the PatentsView’s inventor disambiguation algorithm had a precision between 79%-91% and a recall between 91%-95%, which is much lower than the 100% found by previous testing on benchmark datasets. This shows that PatentsView’s current entity resolution algorithm over-estimates matching pairs.
Binette’s evaluation method gives PatentsView a way to reliably analyze the effectiveness of changes made to the entity resolution algorithm in the future. Dive deeper into Binette’s method and review his code on his PatentsView Evaluation page on Github.