Skip to main content
 
 
 
IN THIS SECTION
3 posts
alagaesia
Last seen: 01/01/2019 - 15:04
Joined: 11/16/2018 - 16:06
Little data quality improvements and some questions

Hello to everybody,

I'm doing my master degree thesis about patents and I'm working on patentsview.org data. My goal is to find a classification of patents according to their citation network. I'm using patent table, uspatentcitation table and cpc_current table. While working on them I found some little errors and I have some questions. Thank you for the answers.

PS. I'm using Python 3, with Pandas, Networkx and Demon (https://github.com/GiulioRossetti/DEMON).

Patent table:

- should I use "id" or "number" as primary key when I join for example with cpc_current? They're often equals

- lines 4243120, 4277941, 4308329, 4348258, 4390841, 4400719. pd.read_csv('patent.tsv', sep='\t') returns error: expected 11 fields, saw 12

UsPatentCitation:

- while reading table (pd.read_csv('uspatentcitation.tsv', sep='\t')) there are a lot of nodes with 11 spaces after the id. Like '3930271           '. Do you know why? Should I manually remove the spaces? 

Last, but not least: I created an ER schema of provided tables. May I share it? I did not find anywhere. 

Thank you in advance

Alessandro

PVTeam
Role: moderator
Last seen: 09/10/2024 - 13:29
Joined: 10/17/2017 - 10:47
Re: LITTLE DATA QUALITY IMPROVEMENTS AND SOME QUESTIONS

Hi Alessandro,

Thanks for reaching out!

  1. You should use the ‘id’ column in the patent table as the key to join with other tables. 
     
  2. Thanks for pointing out the issue with those lines in the patent table. That problem is likely caused by a tab character in the title or abstract field - we will add this to the list of things to be fixed going forward!
     
  3. In the US patent citation table there should, in fact, not be a lot of spaces after the patent id — you should definitely remove those. We will look into the issue and add this to the list of things to be fixed going forward!
     
  4. Finally, we would love to see your ER diagram,  thanks for putting that together! Perhaps you could send it to us, we could look it over, and then you could post it on this forum as a possible resource for other researchers! 

 

Best, 

The PatentsView team

 

alagaesia
Last seen: 01/01/2019 - 15:04
Joined: 11/16/2018 - 16:06
Hello,…

Hello,

thank you for your answer! 

1. Yeah! All my code was done using number as primary key ?

2 and 3. I think you're right, thanks

4. ER schema: It's a draft right now, I'll share with you the draw.io schema, so we can improve it together.

Best,

Alessandro