Hello to everybody,
I'm doing my master degree thesis about patents and I'm working on patentsview.org data. My goal is to find a classification of patents according to their citation network. I'm using patent table, uspatentcitation table and cpc_current table. While working on them I found some little errors and I have some questions. Thank you for the answers.
PS. I'm using Python 3, with Pandas, Networkx and Demon (https://github.com/GiulioRossetti/DEMON).
Patent table:
- should I use "id" or "number" as primary key when I join for example with cpc_current? They're often equals
- lines 4243120, 4277941, 4308329, 4348258, 4390841, 4400719. pd.read_csv('patent.tsv', sep='\t') returns error: expected 11 fields, saw 12
UsPatentCitation:
- while reading table (pd.read_csv('uspatentcitation.tsv', sep='\t')) there are a lot of nodes with 11 spaces after the id. Like '3930271 '. Do you know why? Should I manually remove the spaces?
Last, but not least: I created an ER schema of provided tables. May I share it? I did not find anywhere.
Thank you in advance
Alessandro