Skip to main content
 
 
 
IN THIS SECTION
6 posts
Hegedus
Last seen: 09/10/2021 - 17:39
Joined: 10/24/2018 - 20:10
Duplicate Values for Organization Name

Hi,

I am working with the assignee file and I am noticing that some organizations are listed 2+ times.  For example:

2anza8ga48wsus63rjst1ifmw    3    NULL    NULL    Infineon Technologies AG
5ysm5zlcaf2b02d62mgic7512    3    NULL    NULL    Infineon Technologies AG
org_61gyUoVVQyeF60uJoBif    3    NULL    NULL    Infineon Technologies AG
pn5la4wmxkdt5gof0ubafymhq    3    NULL    NULL    Infineon Technologies AG

Isn't the purpose of the disambiguation process to combine these into a single entity id?

Also I am working with Natural Language Processing and a term of the art used in Lemma which is the base root and allows the combinations for different word use such as plurals to grouped to a single entity.  Perhaps that could be considered as in the sample shown below: Note the plural in Americas in the second line.  In reality these are probably the same business entity.

org_0T3DUOVT6gX9RCesn7iE    2    NULL    NULL    Infineon Technologies America Corp.
org_KvWrsyblXUCdRpJcqGns    2    NULL    NULL    Infineon Technologies Americas Corp.

 

Andy

Hegedus
Last seen: 09/10/2021 - 17:39
Joined: 10/24/2018 - 20:10
Follow On Data

Hi,

The base file has 471,974 rows.  If I group by organization, then there are 429,689 unique values.

The top duplicates are

NULL    36562
GM Global Technology Operations LLC    689
Broadcom Corporation    152
GM Global Technology Operations, Inc.    147
Unity Semiconductor Corporation    138
Warsaw Orthopedic, Inc.    129
Dow Global Technologies LLC    125
The Invention Science Fund I, LLC    93
The Invention Science Fund I LLC    87
Cordis Corporation    70
United Parcel Service of America, Inc.    70
Elwha LLC    69
Broadcom Corp.    49
Rohm and Haas Electronic Materials LLC    49
Amkor Technology, Inc.    47

 

Andy

PVTeam
Role: moderator
Last seen: 10/01/2021 - 10:33
Joined: 10/17/2017 - 10:47
RE: DUPLICATE VALUES FOR ORGANIZATION NAME

Hi,

 The purpose of disambiguation is to ultimately differentiate assignees by condensing down to a single assignee (with varied spellings, punctuation, etc.) if multiple assignees are found to be the same entity.

Location is a major factor in the assignee disambiguation algorithm. When assignees are missing location information, the algorithm doesn’t have enough information to identify them as the same entity.

 In the case you brought up with the assignee ‘Infineon Technologies AG’:

  • There are 4 different assignees named ‘Infineon Technologies AG’. The difference between the different assignee ids is location (you can check this by joining the assignee and rawassignee tables and looking for Infineon Technologies AG). 3 of the 4 assignees ‘Infineon Technologies AG’ do not have an associated location. Therefore, the disambiguation process ends up clustering these assignees as distinct from the 4th assignee which does have a location.

 
It is possible in some cases (such as Infineon Technologies AG’) that there is only one organization in the world with that name. We will note this and see if it can be addressed in the future as part of the disambiguation process.

 Thanks,

 PVTeam

domika
Last seen: 03/30/2021 - 14:27
Joined: 03/30/2021 - 14:24
Patents are exclusive rights…

Patents are exclusive rights received by the inventors themselves which have been granted by the state. The terms that often appear related are inventor and invention, seen from the difference the inventor can be said to be the inventor himself. However, the invention is an idea or inspiration for an inventor.

This exclusive patent right will relate to all activities, starting from arti nama nama bayi the cost of the logo, brand name and so on, which you can submit and have patented. If it is used without permission, it will eventually create a copyright

rippledj
Last seen: 09/25/2021 - 16:54
Joined: 09/16/2021 - 14:05
RE: DUPLICATE VALUES FOR ORGANIZATION NAME

Hi,

Since my server is not so powerful, I merged the two provided assignee tables into a single table which only contains patent ID and organization name / inventor name.  After that, the full-duplicates are GROUP BY together and I don't have to use two JOIN statements so save some query time. 

I'm curious about why the assignee_id strings are so long, since there is clearly no need.  But perhaps there is a reason, I just don't read it documentation yet.  Maybe makes JOIN statements a little slower / small burden on RAM.

I can also make suggestion that since so many records have `name_first` contains a organization name, and those records mostly (but now always) have name_last = '' AND organization = '' (2538 records).  You can move those records to their proper place in the organization column and include them in the disambiguation algorithm. 

I'm particularly amazed by the disambiguation algorithm and I believe I read something about how you accomplished this, but it was with specialized software for Windows.  I'm curious about building a Python3 script that can do the same thing.

rippledj
Last seen: 09/25/2021 - 16:54
Joined: 09/16/2021 - 14:05
I can also make suggestion…

thanks-you~