Skip to main content
2 posts
RANDom Student
Last seen: 10/09/2023 - 10:54
Joined: 03/09/2023 - 21:19
Merging PatentsView and AI Patent Dataset data

Dear PatentsView Community:

I’m trying to merge data on granted patents from the PatentsView bulk download database with USPTO's Artificial Intelligence Patent Dataset (AIPD). I found that many records from AIPD do not merge onto granted patents. Specifically, of 6,901,479 patents I used from AIPD, only 4,022,699 (about 58%) merged onto granted patents, while 2,878,780 (about 42%) did not merge.

Can anyone suggest why so many records from AIPD didn’t merge? I reviewed all reports, articles, and technical documentation on AIPD but haven't found anything suggestive.

Here are my steps:

  • For granted patents, I used PatentsView’s g_patent table. I restricted the data to utility patents using patent_type == utility, and I restricted the date range to 1976-2020 inclusive. With these criteria, there were 6,913,035 patents.
  • For AIPD, I used the ai_model_predictions file, DTA version. I restricted the data to patents (i.e., excluded PGPubs) using flag_patent == 1, and I excluded reissue patents by dropping records where doc_id starts with RE. With these criteria, there were 6,901,479 patents, which matches the count of utility patents in Appendix D of Giczy, Pairolero, and Toole (2022).
  • I merged the PatentsView and AIPD data on patent_id (from g_patent) and doc_id (from AIPD) using a left outer join (i.e., keep all records from PatentsView and only records from AIPD where the key matched). I manually inspected the resulting table and noticed that, for records where the merge worked, the date fields from PatentsView and AIPD matched. This suggests to me that the merge worked correctly in many cases. (I also changed both date fields from string to datetime format before the merge, but I don’t think that should have affected the merge.)

Since g_patent should represent the “universe” of granted patents, I would think that nearly all records for granted patents from AIPD would merge onto records from g_patent.

Thanks in advance for any suggestions!

Role: moderator
Last seen: 04/26/2024 - 10:39
Joined: 10/17/2017 - 10:47
Data Type Issue

Hello! We attempted to do the same join based on what you shared above (thank you for the step-by-step of what you had done!) 

We found that by changing the data type for doc_id to "int" that the join is much closer to 6 million. 

Please let us know if you would like us to share our code.

Thank you,