Little data quality improvements and some questions

3 posts

Sun, 12/16/2018 - 04:54

alagaesia

Last seen: 01/01/2019 - 15:04

Joined: 11/16/2018 - 16:06

Little data quality improvements and some questions

Hello to everybody,

I'm doing my master degree thesis about patents and I'm working on patentsview.org data. My goal is to find a classification of patents according to their citation network. I'm using patent table, uspatentcitation table and cpc_current table. While working on them I found some little errors and I have some questions. Thank you for the answers.

PS. I'm using Python 3, with Pandas, Networkx and Demon (https://github.com/GiulioRossetti/DEMON).

Patent table:

- should I use "id" or "number" as primary key when I join for example with cpc_current? They're often equals

- lines 4243120, 4277941, 4308329, 4348258, 4390841, 4400719. pd.read_csv('patent.tsv', sep='\t') returns error: expected 11 fields, saw 12

UsPatentCitation:

- while reading table (pd.read_csv('uspatentcitation.tsv', sep='\t')) there are a lot of nodes with 11 spaces after the id. Like '3930271 '. Do you know why? Should I manually remove the spaces?

Last, but not least: I created an ER schema of provided tables. May I share it? I did not find anywhere.

Thank you in advance

Alessandro

Share Your Knowledge in the Community Forum

Contact Us

Terms of Use