Hi,
I'm new to this Forum. I have searched for this topic, but couldn't find anything related. If the question is repetitive, appologies in advance.
I am trying to select all the patents that have either an assignee or an inventor from a Nordic country (Sweden, Denmark, Norway, Finland, and Iceland).
I have used both raw data (rawlocation.tsv, rawinventor.tsv, rawassignee.tsv, patent.tsv, and application.tsv) and disambiguated data (patent_inventor.tsv, patent_assignee.tsv, location.tsv,application.tsv,and patent.tsv) to filter the data to the desired output. The number of patent_ids with Nordic assignees or inventors are 167,381 and 163,119 using disambiguated and raw data, respectively.
My problem here is that when I try to filter patent.tsv using the patent_id from the previous task, I get only 16,949 and 16,681 patents using disambiguated and raw patent_ids, respectively. But, the using the same patent_id, I can extract almost all the corresponding applications from application.tsv.
I further figured out that the column 'id' in patent.tsv (with 7,528,963 rows) overlaps with only 726,704 values in the column 'patent_id' in application.tsv (with 7,526,704), which is about just a tenth and explains why I got almost a tenth of the desired patent_id.
Is there a problem with the columns 'id' and 'number' in patent.tsv? Or, is there a crosswalk between application.tsv and patent.tsv?
I appreciate any assistance in advance.
Regards,
Behrooz