Disambiguation Step 3: Clustering by Similarity Score
The next step in the disambiguation process is to use a clustering algorithm to group mentions together. We use hierarchical agglomerative clustering to cluster inventors, assignees, and locations.
- Compute the similarity score for each pair of mentions within each canopy.
- Group together the two most similar records. For assignees, if there are more than 1,000 records in the canopy, only a sample are compared as described above for inventors, to reduce computational overhead.
- Repeat the comparisons between all mentions and the newly formed clusters from step 2.
- For assignees, the similarity between the cluster and any other mention is defined as the maximum similarity between any element in the cluster and the mention it is being compared with.
- For locations, the similarity between the cluster and any other mention is defined as the similarity between the concatenation of all mentions in the cluster and the mention it is being compared with. Groups are represented by the canonical name when comparing similarity measures between a group and a mention.
- Each time records are clustered together, they form a node in the group that we are creating.
- After all the records in a canopy are formed into a group, the final clusters are any groups whose similarity score exceeds the empirically determined threshold.
All records in a canopy become clustered based on similarity scores. Each time records are clustered together, they form a node in a tree. After all records are formed into a tree, the final clusters are a subtree whose similarity score exceeds a determined threshold. To incrementally add data to a clustering produced by hierarchical agglomerative clustering, we use agglomerative clustering’s incremental variant, Grinch.
 This is called single linkage.