Disambiguation Step 3: Clustering by Similarity Score
Inside each canopy, mentions are further broken down into clusters, which are groups within canopies. After computing similarity scores (step 2 of the PatentsView disambiguation process) for all mentions inside canopies, adaptive cluster analysis or adaptive clustering is used to recluster patent mentions into final clusters. These final clusters represent the groups of mentions that are associated with the same entity. Two distinct clustering algorithms are performed on different types of mentions, inventors being different from assignees and locations.
Inventor Clustering Algorithm
The inventor clustering algorithm takes the following steps:
- Compute the similarity score for a random sample of mentions in each canopy, starting with the smallest. The random sample is to accommodate the large size of inventor canopies, to reduce the number of resources that need to be dedicated.
- Join pairs of mentions that have a similarity score above a certain threshold, determined experimentally by testing the results of using different ones.
- Continue to join mentions and clusters of mentions when joining improves the set of clustering criteria. Criteria are detailed in UMass-IESL Methods Description for PatentsView Inventor Disambiguation Technical Workshop.
- Continue until the clustering algorithm fails to meet the threshold value to provide a set of mentions. The clusters remaining will each represent a single inventor entity.
- Move to larger canopies, copying over clusters formed in the smaller canopies.
- Repeat steps 2–5 with these larger canopies, using the copied clusters.
- Assign a label to each cluster based on the most common name among its elements.
Assignee and Location Clustering Algorithm
The clustering algorithm for assignee and location mentions takes the following steps:
- Compute similarity score for each pair of mentions within each canopy.
- Cluster mentions together with the two most similar mentions, being the pair from step 1 with the highest score. If the canopy holds more than 1,000 mentions, the clustering algorithm will compare a random sample.
- Repeat the comparison between all canopies and mentions formed in step 2. The similarity between clusters and mentions is calculated differently for assignees and locations.
- For assignees, use the maximum similarity between any element in the cluster and the mention being compared.
- For locations, use the combination of all elements in the cluster rather than the maximum similarity of any one element.
- When mentions are clustered together, form a node in the final cluster.
- After all of the mentions are clustered, final clusters are formed from any cluster that exceeds the similarity threshold.