Skip to main content

Disambiguation Step 4: Clustering Algorithms Evaluation

1. Constructing Canopies | 2. Applying Similarity Metrics | 3. Clustering by Similarity Score| 4. Clustering Algorithms Evaluation

An important part of developing and understanding these algorithms is the evaluation of their performance through analysis of the results. However, this is a challenge because there is nothing to compare the results against to determine if disambiguation outputs the desired results. To tackle this challenge, the team used nine distinct manually labeled datasets created from samples of the larger dataset. Two methods of evaluation were used in conjunction with these datasets. The first uses the standard metrics from computer science algorithm evaluation; precision, recall, and F1. These will be discussed in more detail later. The second method uses a metric constructed by the PatentsView team that focuses on the number of differently spelled names that were correctly assigned.

Evaluating the Quality of Disambiguation Algorithms

The metrics for evaluating the quality of disambiguation algorithms compare the predicted clusters produced by the algorithm against the clusters from the evaluation datasets. Figure 6, below, shows an example of partially incorrect clustering.
Evaluation 2

For each predicted cluster, 3 cases must be quantified:

  1. the number of mentions correctly in the cluster (true positives),
  2. the number of mentions incorrectly included in the cluster (false positives), and
  3. the number of mentions incorrectly left out of the cluster (false negatives).

In figure 6, cluster A has two true positives (General Electric and General Electric), one false positive (Motorola Electronics), and no false negatives. Cluster B contains one true positive (Motorola), no false positives, and one false negative (Motorola Electronics).

The above cases must be quantified in relation to the results of the clustering algorithm being evaluated. Once the numbers of true positives, false positives, and false negatives are determined, they can be used to calculate evaluation metrics for determining the reliability of the algorithm. These metrics are precision, recall, and F1. The calculation of these metrics is as follows:

  • Precision – Measures what fraction of mentions predicted to be in a cluster are true positives. The formula for this metric is:Evaluation 5
  • Recall – Measures what fraction of mentions that should have been in the cluster were predicted correctly. The formula for this metric is:Evaluation 6
  • F1 Score – F1 is a metric that balances the tradeoff between precision and recall to create a more all-inclusive metric. The formula for this metric is:Evaluation 7

Algorithm Evaluation Methodology

The team took two key approaches for evaluating the algorithm using precision, recall, and F1 metrics defined above.

The first is a comparison, with the caveat that test cases were intentionally selected to be hard to disambiguate. Some results of this evaluation are shown in table 5, below:

Table 5. Evaluation Results That Compare the Precision, Recall, and F1 Scores for the Pairwise F1 Metric on PatentsView Disambiguation Methodologies

 

 

Evaluation Datasets

 

 

NBER Assignee

PatentsView Assignee

Location (Inventor)

Location (Assignee)

Disambiguation methodology for PatentsView data updated through 05/28/2018

Precision

0.838

0.977

1.000

0.998

Recall

0.971

0.977

1.000

0.975

F1

0.900

0.977

1.000

0.986

Disambiguation methodology for PatentsView data updated through 11/27/2018

Precision

0.957

1.000

1.000

0.999

Recall

0.928

0.981

0.980

0.990

F1

0.942

0.991

0.990

0.994

The second approach uses the same metrics as above (precision, recall, and F1) but focuses on the number of distinct entity names rather than on the number of mentions. The idea with this approach is that the algorithm could be performing well using the overall metrics without performing well at disambiguation.

For example, say we have 300 mentions of ‘Green Farms Corn Technologies,’ one mention of ‘Greens Farms Corn Technologies’ and two mentions of ‘Green Farms Corn Technology.’ If our algorithm correctly clustered the 300 ‘Green Farms Corn Technologies’ together but excluded the other three mentions, it would have a very high recall (300/300+3 = .99). However, the algorithm would not have succeeded in clustering any of the entities with differently spelled names.

Name Variation metrics are used to measure disambiguation performance. These metrics measure how many different spellings of a particular name are included in a single predicted cluster. So, in the example in the previous paragraph, our recall would now be much lower since we correctly identified one name (one true positive) and falsely excluded two names (two false negatives) for a recall of 1/3 = .3333. The results for this more challenging metric are shown in Table 6, below:

Table 6. Evaluation Results Comparing the Average, True Positives, False Positives, False Negatives, and F1 Scores for the Name Variation Evaluation Metric on PatentsView Disambiguation Methodologies

 

 

Evaluation Datasets

 

 

NBER Assignee

PatentsView Assignee

Location (Inventor)

Location (Assignee)

Disambiguation methodology for PatentsView data updated through 05/28/2018

Average

2.49 ± 15.25

41.73 ± 84.89

0.65 ± 0.56

1.94 ± 2.70

True positive (TP)

7,305

2,036

26

229

False positive (FP)

10,155

2,682

0

18

False negative (FN)

7,841

1,951

32

197

F1

0.448

0.468

0.619

0.681

Disambiguation methodology for PatentsView data updated through 11/27/2018

Average

1.90 ± 7.02

31.14 ± 54.29

0.69 ± 0.54

0.78 ± 1.30

True positive (TP)

6,716

2,207

25

352

False positive (FP)

5,312

1,676

1

13

False negative (FN)

8,430

1,780

33

74

F1

0.494

0.561

0.595

0.890