Skip to main content

Disambiguation Step 2: Applying Similarity Metrics

1. Constructing Canopies | 2. Applying Similarity Metrics | 3. Clustering by Similarity Score| 4. Clustering Algorithms Evaluation

The second step in the PatentsView disambiguation process is to calculate similarity scores. Similarity scores are used to recluster the canopies described in Step 1. All mentions within the canopies are reclustered based on the similarity metrics described below. The comparison process is different for each entity— inventors, assignees, and locations—however, the method of calculation is the same where a numerical score represents the degree of similarity.

Assignee Similarity Score

The assignee similarity model is a pairwise model, which scores the similarity between two assignees (also called pairwise similarity). Three features are used in the new assignee model: (a) name match, (b) character n-gram “bag-of-attributes” model, and (c) PermID identifiers. These techniques are designed to identify similar text variants due to misspellings or other data limitations. For example, it is apparent that “General Electric” and “General Electirc” are likely to be the same company. The character n-gram model learns a weighted bag-of-attributes representations that characterizes two assignee names as similar if they share several sequences of characters that are unique to the two names (relative to all the other assignee names [TF-IDF weights]).

PermID[1] is a publicly available knowledge base of business entities. To measure the similarity of two assignee names, we measure to determine whether the two strings might refer to the same entity in PermID. For instance, “General Electric” and “General Electric Co” will both be “close” to the entity 4295903128 (General Electric Co). Thus, the two will be encouraged by our PermID based feature to be similar. However, consider “Oregon State University” and “Oregon University”. These two have high textual similarity despite referring to different real-world assignees. A priori, it might be difficult for the assignee model to determine that these are different entities (especially when the word “state” appears needlessly in many other assignee names). It will be the case, however, that the two refer to different PermID identifiers and thus will be highly dissimilar by our new assignee model. Note that we use a high-precision string match to link assignee names to PermID identifiers.

Table 1. Assignee Bag-of-Attributes Representation

  Julia Chan Tyson Phillips Fred Smith Bob Jones Jia Lu
Green Solutions, Inc. 1 0 1 0 0
Green Solutions, Ltd. 1 1 0 0 0

 

We then calculate the cosine similarity between the vector representations of the assignee mentions that we are comparing where the vectors contain the values for the character n-grams in the two strings. Table 2 describes the assignee similarity metrics in detail.

Table 2. Assignee Similarity Metrics

Feature Description Possible Values Feature Weight
Exact Name Match Indicator for whether the names of the two assignees are exactly the same. This feature has an infinite weight for our clustering algorithm. 0 or Infinity 1.0
Prefix/Suffix Match Indicator for whether the first (or last) four characters of each word in the two strings match. 0 or Infinity 1.0
Name Similarity TF-IDF weighted character n-gram similarity model Any value between 0 and 1 1.5
PermID Mismatch A binary indicator for whether the two assignee names refer to two different PermID entities. 0 or 1 -100.0
Acronym Match Indicator for whether one assignee name is an acronym for the other, based on a dictionary of company name acronyms. 0 or Infinity 1.0
"Relaxed" Name Match Indicator for whether, after both names are converted to lowercase and have punctuation, spaces, and particular irrelevant words (e.g., org, ltd, co)b removed, the two names are the same. 0 or Infinity 1.0

 

Each metric in the above model has its value determined based on the description and possible values. These values are totaled together to find the final similarity score. This score is used in the final clustering algorithm to sort mentions into their final clusters. Any value of “infinite” implies that the model believes the two refer to the same assignee, and other metrics will not change that.

Inventor Similarity Score

The inventor disambiguation method uses a learned linear model that determines the similarity between two sets of records. Each feature is computed as a linear function of its value and a bias term. The resulting scores from each of the features are summed to produce a final score. 

For the computation of name similarity, we use a rule-based name_match_score function. The function is designed to determine the likelihood that names from a group of first or middle names with the same last name match. The function takes, as input, a list of first or middle names and a last name that is common to all the names in the group. It does the following:

  1. Check the number of penalty cases. If the list of names is empty, we return a “no name penalty.” Similarly, if the list of names is larger than a set maximum size, we return a “too many names penalty.” Last, we check if the last name matches the common names for this group. If this is not the case, we return a “mismatch on common last name penalty.”
  2. After these penalty checks have been completed, we begin to compute pairwise distances between the names. For all pairs of names, we begin by checking if the first characters in the two strings match. If they do not, we increase the firstLetterMismatches variable by 1.
  3. Next, we run our editDistance function to compute the difference between the given pair of names and store this in a nameMismatches variable.
  4. The firstLetterMismatches value is then multiplied by an intial_mismatch_weight and subtracted from the score.
  5. The nameMismatches value is multiplied by a name_mismatch_weight, and this value is also subtracted from the score.
  6. The final score is returned by the function.

For the remaining types of features, such as co-inventors, patent classifications, and lawyers, we measure the cosine similarity, Shannon entropy, and a size/quantity term (Table 3). See the Methods document[1] for more details. 

 

[1] Monath, N., & McCallum, A. (n.d.). Discriminative hierarchical coreference for inventor disambiguation. University of Massachusetts Amherst Information Extraction and Synthesis Laboratory. https://www.patentsview.org/data/presentations/UMassInventorDisambiguation.pdf

Table 3 shows the rest of these metrics and the weight assigned to each portion.

Table 3. Inventor Similarity Metrics

Feature Weight Name Value Description
First name  name_mismatch_weight 6.0  
Middle name name_mismatch_weight 3.0  
Patent Title Embedding cosine_similarity_weight 10.0 cosine_similarity_weight * cos_sim(fv1,fv2)
Coinventor  cosine_similarity_weight 9.5  
Coinventor entropy_weight 0.125  
Coinventor complexity_weight 0.5  
Assignee  cosine_similarity_weight 9.5  

Note: CPC = Cooperative Patent Classification; IPCR = International Patent Classification-Revised; NBER = National Bureau of Economic Research; USPC = U.S. Patent Classification System.

 

Location Similarity Score

To determine similarity for locations, like for assignees, we focus on name similarity. This process assists with the handling of frequent variants and misspellings of city names. Many measures only compare records with matching states and countries. In addition to considering the location name, the similarity measure considers the assignees and inventors associated with a given location. More detail is provided in Table 4..

Table 4. Location Similarity Metrics

Feature Description Possible Values Feature Weight
Exact Name Match Exact match between city, state, and country for both the location mentions being compared. 0 or infinity 1.0
Nonexistent Location Match Indicator for whether two locations (in the same canopy, which therefore are associated with the same assignee or inventor) have the same city name, and one does not exist in theMaxMinddatabase.11 0 or infinity 1.0
Relaxed Name Match Indicator for whether, after both locations are converted to lowercase and have punctuation and spaces removed, the two names are the same. 0 or infinity 1.0
City Name Similarity Jaro-Winkler similarity between city names: 1 if the state and country are the same; 0 if the state and country are not the same. TheJaro-Winkler similarity is a common measure of how similar two strings are. Any value between 0 and 1 1.0
Name Incompatibility A binary indicator for whether the Jaccard similarity (described above) of the assignee names is less than 0.75.12 0 or 1 -10.0
Overwhelming Number of Records Match When comparing similarity between two locations’ clusters during the clustering process, if the two locations’ city names are the same (using a relaxed match as described above) and one cluster has more than 1.5 times as many records associated with it than the other.13 0 or infinity 1.0
Inventor or Assignee Similarity Cosine similarity of the vector representation of either assignee or inventor names associated with each location. For location canopies defined by an inventor, this is assignees and vice versa. Any value between 0 and 1 0.5