Skip to main content
 
 
 
IN THIS SECTION
3 posts
rory_mullen
Last seen: 02/08/2024 - 06:15
Joined: 01/01/2024 - 15:16
g_assignee_disambiguated not merging well with g_detail_desc_text

Hi there, 

Thanks for creating this great platform!

I'm trying to merge the g_assignee_disambiguated file with the g_detail_desc_text_yyyy files on patent_id, and I'm finding that only very few patent_id values match across these files. Specifically, for

2015: ~300,000 patent ids in g_detail_desc_text_2016, only ~10,000 matches with g_assignee_disambiguated are found
2016: ~305,000 patent ids in g_detail_desc_text_2016, only ~10,000 matches with g_assignee_disambiguated are found

I have not checked other years, but it does seem to me that these match rates are too low. For example, for the corresponding pg_assignee_disambiguated and pg_detail_desc_text_yyyy files, the match rates are around ten times higher, yielding over 100,000 matches for the same years.

I could be mistaken, but is it possible that something went wrong with the recent December 2023 update to the g_assignee_disambiguated file? 

Thank you again!

Best wishes, 

Rory

rory_mullen
Last seen: 02/08/2024 - 06:15
Joined: 01/01/2024 - 15:16
My stupid mistake, sorry everyone!

My mistake, theg_assignee_disambiguated dataset is absolutely fine. I was filtering for assignee_sequence == 1 (as in the pre-grant "pg" data), but I should have been filtering for assignee_sequence == 0. Hopefully this helps somebody avoid my stupid mistake in the future :)

Best wishes, 

Rory

PVTeam
Role: moderator
Last seen: 09/10/2024 - 13:29
Joined: 10/17/2017 - 10:47
Merging issues

Hi Rory,
Glad you were able to find your answer! Yes, some tables begin indexing at 0, and others begin indexing at 1. Standardizing this pattern across tables is planned for a future update, but in the meantime, you can always check the starting value of sequence for any table in our data dictionary
Another thing that occasionally causes issues matching IDs between tables is not making sure that patent IDs are consistently stored in your software as string/text values. Python's Pandas library, for example, when reading in large files and not explicitly told the data types of columns, will try to infer the data type for each column in each chunk of the file that it reads in. This occasionally results in failed matches between the same value stored as a number versus a string.
Thank you for using PatentsView!

Best,
PVTeam