Disambiguation Step 1: Constructing Canopies
The algorithm begins by grouping mentions into different “canopies,” with each “canopy” containing a set of mentions that are likely to match. Mentions within these canopies are then compared with each other to determine which mentions are associated with the same organization or individual. After these comparisons, the canopies are divided further into “clusters.” Clusters use similarity scores to determine which mentions belong to which cluster. Figure 1 shows the process visually.
Rules for clustering assignees are different than the rules used for clustering inventors. To further break down how this process works, one must understand the difference between patent assignees and inventors.
The inventor associated with a patent refers to the individual that initially conceived of the invention for which the patent confers rights.
The assignee refers to the entity that holds property rights for the patent. This is typically a company, university, research lab, or other patent-owning entity.
Sometimes, the inventor and assignee are the same, but it is more likely that a company or organization will be the assignee. Many times, patents are developed by employees of an organization that reserves the property rights for patents associated with any invention created by an employee while working for them.
Breaking Down Canopies Into Clusters
The assignee canopy is broken down as the algorithm looks at words in each name and creates a new canopy for each distinct set of four characters at the beginning of a term. If there are multiple words in the assignee name, the assignee is put into canopies associated with the beginning of each word. For example, assume there are four mentions that mention “General Electric,” “General Electirc,” “Motorola, and “Motorola Electronics.” The canopies created from this set of mentions would be for three sets of beginning characters: “gene,” “elec,” and “moto.” Figure 2 shows visually how these mentions would be clustered into canopies.
The above approach is used for all assignees, whether an individual or organization. There are different rules for inventors due to the increased number of comparisons that must be made. With the smaller number of assignee mentions, including individual assignees, this is an unnecessary step.
The canopy assignment rules for inventors create four canopies in every situation. These canopies increase in size and decrease in specificity. Mentions will be put into all canopies that satisfy the rule. The rules are as follows:
- Mentions that match exactly on first, last, and middle names
- Mentions that share a last name (with suffixes removed) and first name
- Mentions that share a last name (with suffixes removed) and the first five characters of the first name
- Mentions that share a last name (with suffixes removed) and the first three characters of the first name
Figure 3 shows this process visually.
The last set of canopies is created for locations using the mentions for inventors and assignees after being processed by the above algorithms. A separate canopy is created for each cluster from the previous output that stores “mentions” of distinct inventors or assignees. Within these canopies are stored every location associated with all the mentions within the associated cluster. This results in roughly 3.5 million canopies, but they are relatively small and thus require few calculations per canopy to create clusters. These clusters hold mentions that are likely to represent the same place. For instance, New York, NY, and New York City, NY, would be in the same cluster as they represent the same location. Figure 4 shows this process visually.