Disambiguation Step 2: Applying Similarity Metrics
The second step in the PatentsView disambiguation process is to calculate similarity scores. Similarity scores are used to recluster the canopies described in Step 1. All mentions within the canopies are reclustered based on the similarity metrics described below. The comparison process is different for each entity— inventors, assignees, and locations—however, the method of calculation is the same where a numerical score represents the degree of similarity.
Assignee Similarity Score
Calculating assignee similarity combines various metrics that form a model used for computing a similarity score. Some of the metrics are based on the similarity of the names, while the slightly more complicated metrics are related to locations, inventors, and patent classifications.
The latter part of the calculation relies on the assumption that mentions with overlapping lists of these characteristics are more likely to represent the same assignee. This is done through what is called a bag-of-attributes representation. A bag-of-attributes representation creates a column for each inventor, location, or patent classification and a row for each mention being compared. Table 1 shows an example of this representation for assignees.
Table 1. Assignee Bag-of-Attributes Representation
Julia Chan | Tyson Phillips | Fred Smith | Bob Jones | Jia Lu | |
---|---|---|---|---|---|
Green Solutions, Inc. | 1 | 0 | 1 | 0 | 0 |
Green Solutions, Ltd. | 1 | 1 | 0 | 0 | 0 |
One of the similarity metrics included in the model is the cosine similarity of these values for each mention to be compared. Cosine similarity is a numerical representation between 0 and 1 of the similarity of the two sets of values; in our case, the above bag-of-attributes representation. Table 2 contains the formal definition for all metrics in the model.
Table 2. Assignee Similarity Metrics
Feature | Description | Possible Values | Feature Weight |
---|---|---|---|
Prefix/Suffix Match | Indicator for whether the first (or last) four characters of each word in the two strings match. | 0 or Infinity | 1.0 |
Name Similarity | The Jaccard similarity between the names of the two assignee mentions is used. This takes the number of distinct letters in common between the two strings divided by the total number of distinct letters in both strings combined. | Any value between 0 and 1 | 1.5 |
Name Incompatibility | A binary indicator for whether the Jaccard similarity (described above) of the assignee names is less than 6. | 0 or 1 | -100.0 |
Inventor(s) Similarity | Cosine similarity of the vector representation of inventor names associated with each assignee. | Any value between 0 and 1 | 1.0 |
Location Similarity | Cosine similarity of the vector representation of locations (concatenated city/state/country) associated with each assignee. | Any value between 0 and 1 | 0.5 |
Name and Location Match | A binary indicator for whether the name similarity is very high (>.89) and the location similarity also is very high (>.95),9with one special condition that the location is not Washington D.C.10 | 0 or 1 | 100.0 |
Word-Wise Relaxed Name Similarity | Cosine similarity of the vector representation of each individual name (or word in an organization name) for each assignee. | Any value between 0 and 1 | 100.0 |
Patent Classification Similarity | Sum of cosine similarities of the vector representations of four separate classifications (CPC, NBER, USPC, and IPCR) associated with the assignee. | Any value between 0 and 4 | 1.0 |
Bias | Constant feature added to score. | Any value | -6.0 |
Each metric in the above model has its value determined based on the description and possible values. These values are totaled together to find the final similarity score. This score is used in the final clustering algorithm to sort mentions into their final clusters. Any value of “infinite” implies that the model believes the two refer to the same assignee, and other metrics will not change that.
Inventor Similarity Score
Calculating similarity scores for inventors uses a different model. Part of the inventor similarity score is based on name similarity, contained in a Name Match Score function. This function determines the likelihood that names from a cluster of first/middle names with the same last name refer to the same individual. The input of the function is a list of first or middle names that share a last name. The output is the final similarity score of the names.
The steps that this function takes are as follows:
- Check for penalties, which may include a “no name” penalty (when the list is empty), “too many names” penalty (when the list is too large), and the “mismatch on common last name” penalty (when the list does not share last names).
- Compute the difference for each pair by taking the following steps:
- Check if the initial first letter matches. Increase firstLetterMismatches by 1 if not.
- Compute how many changes need to be made to create an exact match (edit distance). Increase nameMismatches by this amount.
- Multiply firstLetterMatches by initial_mismatch_weight, and subtract the result from the score.
- Multiply nameMismatches by name_mismatch_weight, and subtract the result from the score.
After calculating Name Match similarity, it is combined with the rest of the metrics (including the cosine similarity mentioned in assignee rules) in the model to produce a final score.
Table 3 shows the rest of these metrics and the weight assigned to each portion.
Table 3. Inventor Similarity Metrics
Feature | Weight Name | Value | Description |
---|---|---|---|
First name | initial_mismatch_weight | 20.0 | |
First name | no_name_penalty | 4.0 | |
First name | name_mismatch_weight | 6.0 | |
First name | mismatch_on_common_last | –1000.0 | |
First name | too_many_names_penalty | –1000.0 | |
Middle name | initial_mismatch_weight | 8.0 | |
Middle name | no_name_penalty | 0.35 | |
Middle name | name_mismatch_weight | 3.0 | |
Middle name | mismatch_on_common_last | –1000 | |
Middle name | too_many_names_penalty | –1000 | |
Patent Title Embedding | cosine_similarity_weight | 10.0 | cosine_similarity_weight * cos_sim(fv1,fv2) + cosine_similarity_bias |
Patent Title Embedding | cosine_similarity_bias | –0.25 | cosine_similarity_weight * cos_sim(fv1,fv2) + cosine_similarity_bias |
Coinventor | cosine_similarity_weight | 9.5 | |
Coinventor | cosine_similarity_bias | –0.2 | |
Coinventor | entropy_weight | 0.125 | |
Coinventor | complexity_weight | 0.5 | |
Location | cosine_similarity_weight | 9.5 | |
Location | cosine_similarity_bias | –0.1 | |
Location | entropy_weight | 0.25 | |
Location | complexity_weight | 0.5 | |
Assignee | cosine_similarity_weight | 9.5 | |
Assignee | cosine_similarity_bias | –0.1 | |
Assignee | entropy_weight | 0.125 | |
Assignee | complexity_weight | 0.5 | |
Lawyers | cosine_similarity_weight | 5.0 | |
Lawyers | cosine_similarity_bias | 0.0 | |
CPC | cosine_similarity_weight | 1.5 | |
CPC | cosine_similarity_bias | 0.0 | |
IPCR | cosine_similarity_weight | 1.5 | |
IPCR | cosine_similarity_bias | 0.0 | |
NBER | cosine_similarity_weight | 1.5 | |
NBER | cosine_similarity_bias | 0.0 | |
USPC | cosine_similarity_weight | 1.5 | |
USPC | cosine_similarity_bias | 0.0 |
Note: CPC = Cooperative Patent Classification; IPCR = International Patent Classification-Revised; NBER = National Bureau of Economic Research; USPC = U.S. Patent Classification System.
Location Similarity Score
Calculating location similarity is largely focused on name similarity between mentions but also uses the cosine similarity of the inventors/assignees associated with the city mentions being compared. State and/or country are included as attributes to aid in the comparison. The focus on name similarity arises from the fact that cities with the same name are likely to be distinguished by associated states/provinces and/or country. Part of the name similarity model checks if two cities have the same name, and one does not exist in the MaxMind database used for these comparisons. This is intended to accommodate for cases where the wrong state is entered, but the same inventor/assignee has patented in that city and entered the correct name on a different patent. Table 4 shows the model for computing location similarity scores.
Table 4. Location Similarity Metrics
Feature | Description | Possible Values | Feature Weight |
---|---|---|---|
Exact Name Match | Exact match between city, state, and country for both the location mentions being compared. | 0 or infinity | 1.0 |
Nonexistent Location Match | Indicator for whether two locations (in the same canopy, which therefore are associated with the same assignee or inventor) have the same city name, and one does not exist in theMaxMinddatabase.11 | 0 or infinity | 1.0 |
Relaxed Name Match | Indicator for whether, after both locations are converted to lowercase and have punctuation and spaces removed, the two names are the same. | 0 or infinity | 1.0 |
City Name Similarity | Jaro-Winkler similarity between city names: 1 if the state and country are the same; 0 if the state and country are not the same. TheJaro-Winkler similarity is a common measure of how similar two strings are. | Any value between 0 and 1 | 1.0 |
Name Incompatibility | A binary indicator for whether the Jaccard similarity (described above) of the assignee names is less than 0.75.12 | 0 or 1 | -10.0 |
Overwhelming Number of Records Match | When comparing similarity between two locations’ clusters during the clustering process, if the two locations’ city names are the same (using a relaxed match as described above) and one cluster has more than 1.5 times as many records associated with it than the other.13 | 0 or infinity | 1.0 |
Inventor or Assignee Similarity | Cosine similarity of the vector representation of either assignee or inventor names associated with each location. For location canopies defined by an inventor, this is assignees and vice versa. | Any value between 0 and 1 | 0.5 |