Disambiguation Step 4: Clustering Algorithms Evaluation
An important part of developing and understanding these algorithms is the evaluation of their performance through analysis of the results. However, this is a challenge because there is nothing to compare the results against to determine if disambiguation outputs the desired results. To tackle this challenge, the team used nine distinct manually labeled datasets created from samples of the larger dataset. Two methods of evaluation were used in conjunction with these datasets. The first uses the standard metrics from computer science algorithm evaluation; precision, recall, and F1. These will be discussed in more detail later. The second method uses a metric constructed by the PatentsView team that focuses on the number of differently spelled names that were correctly assigned.
Evaluating the Quality of Disambiguation Algorithms
The metrics for evaluating the quality of disambiguation algorithms compare the predicted clusters produced by the algorithm against the clusters from the evaluation datasets. Figure 6, below, shows an example of partially incorrect clustering.
For each predicted cluster, 3 cases must be quantified:
 the number of mentions correctly in the cluster (true positives),
 the number of mentions incorrectly included in the cluster (false positives), and
 the number of mentions incorrectly left out of the cluster (false negatives).
In figure 6, cluster A has two true positives (General Electric and General Electric), one false positive (Motorola Electronics), and no false negatives. Cluster B contains one true positive (Motorola), no false positives, and one false negative (Motorola Electronics).
The above cases must be quantified in relation to the results of the clustering algorithm being evaluated. Once the numbers of true positives, false positives, and false negatives are determined, they can be used to calculate evaluation metrics for determining the reliability of the algorithm. These metrics are precision, recall, and F1. The calculation of these metrics is as follows:
 Precision – Measures what fraction of mentions predicted to be in a cluster are true positives. The formula for this metric is:
 Recall – Measures what fraction of mentions that should have been in the cluster were predicted correctly. The formula for this metric is:
 F1 Score – F1 is a metric that balances the tradeoff between precision and recall to create a more allinclusive metric. The formula for this metric is:
Algorithm Evaluation Methodology
The team took two key approaches for evaluating the algorithm using precision, recall, and F1 metrics defined above.
The first is a comparison, with the caveat that test cases were intentionally selected to be hard to disambiguate. Some results of this evaluation are shown in table 5, below:
Table 5. Evaluation Results That Compare the Precision, Recall, and F1 Scores for the Pairwise F1 Metric on PatentsView Disambiguation Methodologies


Evaluation Datasets 



NBER Assignee 
PatentsView Assignee 
Location (Inventor) 
Location (Assignee) 
Disambiguation methodology for PatentsView data updated through 05/28/2018 
Precision 
0.838 
0.977 
1.000 
0.998 
Recall 
0.971 
0.977 
1.000 
0.975 

F1 
0.900 
0.977 
1.000 
0.986 

Disambiguation methodology for PatentsView data updated through 11/27/2018 
Precision 
0.957 
1.000 
1.000 
0.999 
Recall 
0.928 
0.981 
0.980 
0.990 

F1 
0.942 
0.991 
0.990 
0.994 
The second approach uses the same metrics as above (precision, recall, and F1) but focuses on the number of distinct entity names rather than on the number of mentions. The idea with this approach is that the algorithm could be performing well using the overall metrics without performing well at disambiguation.
For example, say we have 300 mentions of ‘Green Farms Corn Technologies,’ one mention of ‘Greens Farms Corn Technologies’ and two mentions of ‘Green Farms Corn Technology.’ If our algorithm correctly clustered the 300 ‘Green Farms Corn Technologies’ together but excluded the other three mentions, it would have a very high recall (300/300+3 = .99). However, the algorithm would not have succeeded in clustering any of the entities with differently spelled names.
Name Variation metrics are used to measure disambiguation performance. These metrics measure how many different spellings of a particular name are included in a single predicted cluster. So, in the example in the previous paragraph, our recall would now be much lower since we correctly identified one name (one true positive) and falsely excluded two names (two false negatives) for a recall of 1/3 = .3333. The results for this more challenging metric are shown in Table 6, below:
Table 6. Evaluation Results Comparing the Average, True Positives, False Positives, False Negatives, and F1 Scores for the Name Variation Evaluation Metric on PatentsView Disambiguation Methodologies


Evaluation Datasets 



NBER Assignee 
PatentsView Assignee 
Location (Inventor) 
Location (Assignee) 
Disambiguation methodology for PatentsView data updated through 05/28/2018 
Average 
2.49 ± 15.25 
41.73 ± 84.89 
0.65 ± 0.56 
1.94 ± 2.70 
True positive (TP) 
7,305 
2,036 
26 
229 

False positive (FP) 
10,155 
2,682 
0 
18 

False negative (FN) 
7,841 
1,951 
32 
197 

F1 
0.448 
0.468 
0.619 
0.681 

Disambiguation methodology for PatentsView data updated through 11/27/2018 
Average 
1.90 ± 7.02 
31.14 ± 54.29 
0.69 ± 0.54 
0.78 ± 1.30 
True positive (TP) 
6,716 
2,207 
25 
352 

False positive (FP) 
5,312 
1,676 
1 
13 

False negative (FN) 
8,430 
1,780 
33 
74 

F1 
0.494 
0.561 
0.595 
0.890 