Skip to main content
 
 
 
IN THIS SECTION
5 posts
behroozkh
Last seen: 03/17/2021 - 17:50
Joined: 03/10/2021 - 11:38
Crosswalk between application.tsv and patent.tsv

Hi,

I'm new to this Forum. I have searched for this topic, but couldn't find anything related. If the question is repetitive, appologies in advance.

I am trying to select all the patents that have either an assignee or an inventor from a Nordic country (Sweden, Denmark, Norway, Finland, and Iceland).

I have used both raw data (rawlocation.tsv, rawinventor.tsv, rawassignee.tsv, patent.tsv, and application.tsv) and disambiguated data (patent_inventor.tsv, patent_assignee.tsv, location.tsv,application.tsv,and patent.tsv) to filter the data to the desired output. The number of patent_ids with Nordic assignees or inventors are 167,381 and 163,119 using disambiguated and raw data, respectively.

My problem here is that when I try to filter patent.tsv using the patent_id from the previous task, I get only 16,949 and 16,681 patents using disambiguated and raw patent_ids, respectively. But, the using the same patent_id, I can extract almost all the corresponding applications from application.tsv. 

I further figured out that the column 'id' in patent.tsv (with 7,528,963 rows) overlaps with only 726,704 values in the column 'patent_id' in application.tsv (with 7,526,704), which is about just a tenth and explains why I got almost a tenth of the desired patent_id.

Is there a problem with the columns 'id' and 'number' in patent.tsv? Or, is there a crosswalk between application.tsv and patent.tsv?

I appreciate any assistance in advance. 

Regards,

Behrooz

Russ
Last seen: 12/06/2024 - 09:12
Joined: 11/14/2017 - 22:15
UNDERLYING DATA ISSUES

behroozkh,

The problem is probably with the underlying data.  The granted patent data is available from 1976 on while the application data is only available from March 2001 on.  Also an invertor can choose to suppress publication  if they are only applying for a patent in the US so there isn't application data for every granted patent since March 2001.  

There is granted_patent_crosswalk on https://www.patentsview.org/download/pregrantpublications.html.  It's built from the granted patent xml files and only has 3,228,546 rows. I'm not sure why the granted xml doesn't always include the document number (suppression thing or something else?).  I suggested ways of making a crosswalk in a previous post  I don't know if processing the application xml would produce a more complete crosswalk.  Is there something specific you are looking for by joining?

Note that some of the links in the previous post go to reedtech.com for  xml data they no longer host.  The files can now be found on https://bulkdata.uspto.gov/

Russ

behroozkh
Last seen: 03/17/2021 - 17:50
Joined: 03/10/2021 - 11:38
PROBLEM JOINING PATENT.TSV

Hi Russ,

Thanks for your detailed response.

I may have explained the issue poorly, but actually my problem is not with the application.tsv, but with the patent.tsv. In this long post, I will try to explain my problem more clearly. Appologies for the length of this post in advance.

What I am trying to do is to separately left-join patent_assignee.tsv and patent_inventor.tsv with the patent.tsv (based on the condition that each of the patents should correspond to a location in a Nordic country), in order to access their abstract, date, and number of claims. 

So, having imported location.tsv, patent_assignee.tsv, patent_inventor.tsv, patent.tsv, I first created a list of Nordic location_ids. Then, I generated two separate tables nordic_patent_assignee and nordic_patent_inventor. Finally, I separately left-joined the two tables with patent.tsv. Please see the screenshots below:

create nordic patents

Creating a list of Nordic locations and patents 

merge nordic patent assignee

Merging nordic_patent_assignee with the patent table

merge nordic patent inventor

Merging nordic_patent_inventor with the patent table

My understanding is that the patent.tsv file should contain all the patents granted by USPTO (from 1976 onwards). Therefore, every single unique patent_id in the tables patent_assignee, patent_inventor, rawassignee, and rawinventor must have a match in the column 'id' in patent.tsv.

I may be totally wrong, but I feel that it can be difficult or nearly impossible to do any join of this kind, because the columns 'patent_id' in patent_assignee and patent_inventor (also in rawassignee, and rawinventor) do not entirely match the columns 'id' \ 'number' in patent.tsv. To test this hypothesis, I tried to create a list of unique patent IDs from the column 'patent_id' in the tables patent_assignee, patent_inventor, rawassignee, and rawinventor to check the number of matches with the column 'id' in patent.tsv. Below, please see the number of unique patent_ids in each table (which should ideally be the expected number of matches) as well as the actual number of matches:

Unique number of patents in patent.tsv: 7,528,963 

Unique patent_ids:

patent_assignee: 6,654,578 - patent_inventor: 7,527,876 - rawassignee: 6,654,578 - rawinventor: 7,527,876

Number of matches with the unique set of  patent IDs (column 'id') in patent.tsv:

patent_assignee: 551,512 - patent_inventor: 728,621 - rawassignee: 551,512 - rawinventor: 728,621

I raised the issue of application.tsv in my initial post only because the patent_id in patent_assignee and patent_inventor does match the patent_id in application.tsv (unlike in patent.tsv). I have also checked the crosswalk you very kindly suggested, but it does not help to do the join I was interested.

Hope that my explanation was clearer this time, and thanks again for your assistance.

Regards,

Behrooz

Russ
Last seen: 12/06/2024 - 09:12
Joined: 11/14/2017 - 22:15
PYTHON PROBLEM?

Hi Behrooz,

My python isn't great but I am seeing some of the same things you are seeing.  I managed to write out a csv of the 127,049 patent numbers where the assignee is from a Nordic country.  A perl script then found them all in patent.tsv.  I also wrote out the 12,000 matches the python script found and it looks like it was matching just the non utility patents for some reason.  It found 789 plant patents, 10877 design patents and 323 reissued patents.  

Another option would be to use the query tool.  It looks like you would have to issue two separate queries for each Nordic country, one for inventors and one for assignees.  You could then merge the csv files.

Russ

nordic_patent_ids=list(set(nordic_patent_assignee['patent_id'].to_list()))
result_file = open("nordic_patents.csv",'w')
wr = csv.writer(result_file, dialect='excel')
for item in nordic_patent_ids:
    wr.writerow([item,])

behroozkh
Last seen: 03/17/2021 - 17:50
Joined: 03/10/2021 - 11:38
PYTHON PROBLEM

Hi Russ,

Thanks a lot for your help.

I did exactly as you suggested and merged the files. 

Regards,

Behrooz