Disambiguation Step 2: Applying Similarity Metrics
The second step in the PatentsView disambiguation process is to calculate similarity scores. Similarity scores are used to cluster the canopies described in Step 1. All mentions within the canopies are clustered based on the similarity metrics described below. The comparison process is different for each entity— inventors and assignees—however, both methods represent similarity using a numerical score.
The assignee similarity model is a pairwise model, which scores the similarity between two assignees (also called pairwise similarity). Two features are used in the new assignee model: (a) name match and(b) custom character n-gram “bag-of-attributes” model. These techniques are designed to identify cases where two assignees should be clustered but one is misspelled, abbreviated, or shortened. For example, it is apparent that “General Electric” and “General Electirc” are likely to be the same company. The character n-gram model learns a weighted bag-of-attributes representations that characterizes two assignee names as similar if they share several sequences of characters that are unique to the two names (relative to all the other assignee names [TF-IDF weights]).
Up until the data released through June 30, 2022, PermID, a publicly available knowledge base of business entities, was incorporated as a feature in our assignee disambiguation model. Because PermID hasn't been maintained since 2021, we removed the PermID feature and migrated to a custom n-gram model which utilizes the organization names available in the raw patent data to train the “bag-of-attribute “model. We call it a custom n-gram model because the new model treats common English words as a single feature vector instead of n-gram vectors. In other words, common English words will not be broken down into n-gram character strings as part of the vectorization process.
For example, the assignee “Ramat Metal Works” would previously generate the following n-grams:
[' R', 'Ra', 'am', 'ma', 'at', 't ', ' Ra', 'Ram', 'ama', 'mat', 'at ', ' Ram', 'Rama', 'amat', 'mat ', ' Rama', 'Ramat', 'amat ', ' Ramat', 'Ramat ', ' Ramat ', ' M', 'Me', 'et', 'ta', 'al', 'l ', ' Me', 'Met', 'eta', 'tal', 'al ', ' Met', 'Meta', 'etal', 'tal ', ' Meta', 'Metal', 'etal ', ' Metal', 'Metal ', ' Metal ', ' W', 'Wo', 'or', 'rk', 'ks', 's ', ' Wo', 'Wor', 'ork', 'rks', 'ks ', ' Wor', 'Work', 'orks', 'rks ', ' Work', 'Works', 'orks ', ' Works', 'Works ', ' Works ']
With the custom model, the words metal and works are treated as "common" words and are not treated as "n-gram":
[' R', 'Ra', 'am', 'ma', 'at', 't ', ' Ra', 'Ram', 'ama', 'mat', 'at ', ' Ram', 'Rama', 'amat', 'mat ', ' Rama', 'Ramat', 'amat ', ' Ramat', 'Ramat ', ' Ramat ', 'Metal', 'Works']
This is done to reduce the generated similarity score between assignees that share common words.
Table 1. Assignee Bag-of-Attributes Representation
|Julia Chan||Tyson Phillips||Fred Smith||Bob Jones||Jia Lu|
|Green Solutions, Inc.||1||0||1||0||0|
|Green Solutions, Ltd.||1||1||0||0||0|
We then calculate the cosine similarity between the vector representations of the assignee mentions that we are comparing where the vectors contain the values for the character n-grams in the two strings. Table 2 describes the assignee similarity metrics in detail.
Table 2. Assignee Similarity Metrics
|Feature||Description||Possible Values||Feature Weight|
|Exact Name Match||Indicator for whether the names of the two assignees are exactly the same. This feature has an infinite weight for our clustering algorithm.||0 or Infinity||1.0|
|Prefix/Suffix Match||Indicator for whether the first (or last) four characters of each word in the two strings match.||0 or Infinity||1.0|
|Name Similarity||TF-IDF weighted character n-gram similarity model||Any value between 0 and 1||1.5|
|Acronym Match||Indicator for whether one assignee name is an acronym for the other, based on a dictionary of company name acronyms.||0 or Infinity||1.0|
|"Relaxed" Name Match||Indicator for whether, after both names are converted to lowercase and have punctuation, spaces, and particular irrelevant words (e.g., org, ltd, co)b removed, the two names are the same.||0 or Infinity||1.0|
Each metric in the above model has its value determined based on the description and possible values. These values are totaled together to find the final similarity score. This score is used to identify clusters. A value of “infinite” implies that the model believes the two assignee mentions identify the same assignee, and other metrics will not change that.
Inventor Similarity Score
The inventor disambiguation method uses a learned linear model that determines the similarity between two sets of records. Each feature is computed as a linear function of its value and a bias term. The resulting scores from each of the features are summed to produce a final score.
For the computation of name similarity, we use a rule-based name_match_score function. The function is designed to determine the likelihood that names from a group of first or middle names with the same last name match. The function takes, as input, a list of first or middle names and a last name that is common to all the names in the group. It does the following:
- Check the number of penalty cases. If the list of names is empty, we return a “no name penalty.” Similarly, if the list of names is larger than a set maximum size, we return a “too many names penalty.” Last, we check if the last name matches the common names for this group. If this is not the case, we return a “mismatch on common last name penalty.”
- After these penalty checks have been completed, we begin to compute pairwise distances between the names. For all pairs of names, we begin by checking if the first characters in the two strings match. If they do not, we increase the firstLetterMismatches variable by 1.
- Next, we run our editDistance function to compute the difference between the given pair of names and store this in a nameMismatches variable.
- The firstLetterMismatches value is then multiplied by an intial_mismatch_weight and subtracted from the score.
- The nameMismatches value is multiplied by a name_mismatch_weight, and this value is also subtracted from the score.
- The final score is returned by the function.
For the remaining types of features, such as co-inventors, patent classifications, and lawyers, we measure the cosine similarity, Shannon entropy, and a size/quantity term (Table 3). See the Methods document for more details.
Table 3 shows the rest of these metrics and the weight assigned to each portion.
Table 3. Inventor Similarity Metrics
|Patent Title Embedding||cosine_similarity_weight||10.0||cosine_similarity_weight * cos_sim(fv1,fv2)|
 Monath, N., & McCallum, A. (n.d.). Discriminative hierarchical coreference for inventor disambiguation. University of Massachusetts Amherst Information Extraction and Synthesis Laboratory. https://s3.amazonaws.com/data.patentsview.org/documents/UMassInventorDisambiguation.pdf