Skip to main content
 
 
 
IN THIS SECTION
4 posts
Hegedus
Last seen: 02/04/2024 - 19:23
Joined: 10/24/2018 - 20:10
Trouble Importing g_patent.tsv

Hi,

I am attempting to create an updated database for my research.  I have done this several time without issue.  This time I am have tremendous problems with corrupted databases working with the g_patent.tsv file.

I have downloaded the lasted version (twice and happens with both).

Process is to use SQLite3 to create the database.

1. Create new database using DB Browser for SQLite

2. load g_patent.tsv with the options to disable the auto data detection (This is necessary because patent_id initially start as numbers, but have text at end of file - side note if the text cases are first in the file then auto detect would register the type correctly)

3. Set the data type with the following command.

CREATE TABLE "patent" (
    "patent_id"    TEXT,
    "patent_type"    TEXT,
    "patent_date"    TEXT,
    "patent_title"    TEXT,
    "patent_abstract"    TEXT,
    "wipo_kind"    TEXT,
    "num_claims"    INTEGER,
    "withdrawn"    INTEGER,
    "filename"    TEXT
);

4. Create an index on the patent_id with the following command

CREATE INDEX "pat_pat" ON "patent" (
    "patent_id"    ASC
);
 

This where the error occurs and the database becomes corrupted with the following information

(11) database corruption at line 66843 of [1b256d97b5]
(11) database corruption at line 66993 of [1b256d97b5]
(11) statement aborts at 19: [CREATE INDEX "pat_pat" ON "patent" (
    "patent_id"    ASC
);] database disk image is malformed
(1) executeSQL:  "database disk image is malformed (CREATE INDEX \"pat_pat\" ON \"patent\" (\n\t\"patent_id\"\tASC\n);)" (, :0)
 

 

This appears to be reproducible with same line numbers as the corruption point.

Row 66843

"10067308"    "utility"    "2018-09-04"    "Low profile fiber distribution hub"    "Certain embodiments of a fiber distribution hub include a swing frame pivotally mounted within an enclosure having a low profile. For example, the enclosure can have a depth of less than about nine inches. Termination modules can be mounted to the swing frame and oriented to slide at least partially in a front-to-rear direction to facilitate access to connectors plugged into the termination modules. Splitter modules and connector storage regions can be provided within the enclosure."    "B2"    31    0    "ipg180904.xml"
 

Row 66993

"10067459"    "utility"    "2018-09-04"    "Image forming apparatus"    "An image forming apparatus includes a main assembly, an operating portion provided slidably between a first position and a second position of the main assembly, a supporting position, a slide rail, a slidable member, and an urging unit. A relationship of engagement between the slide rail and the slidable member is set so that the engagement between the slide rail and the slidable member when the operating portion is in a position between the first position and the second position is looser than the engagement between the slide rail and the slidable member when the operating portion is in the first position or the second position."    "B2"    6    0    "ipg180904.xml"
 

I see no obvious issues.

I do not seem to have issues with 

g_cpc_at_issue.tsv

g_assignee_disambiguated.tsv

g_us_patent_citation.tsv

only g_patent.tsv

Any clues?

Andy

Russ
Last seen: 03/21/2024 - 09:05
Joined: 11/14/2017 - 22:15
input file has extra tabs

Hi Andy,

It's been a while since you've posted!  It looks like 5 of the lines have an extra tab in them, though they don't line up with the line numbers you reported.   Maybe try eliminating them to see if it you can get it to load?  There are a bunch of rows without extracts, they're just "", but that shouldn't keep the file from loading.  They start with Row 2293993 patent_id "4468297".  The patent_ids are all unique so that shouldn't be the cause of your error.  Everything else looks ok.

Rows with an extra tab
Row 4700167  patent_id "6888315"
Row 5585054  patent_id "7776999"
Row 5707291  patent_id "7899705"
Row 6160253  patent_id "8354499"
Row 7509034  patent_id "9711533"

I hope that helps,
Russ Allen
 

PVTeam
Role: moderator
Last seen: 04/24/2024 - 12:31
Joined: 10/17/2017 - 10:47
extra tabs in g_patent.tsv

Thank you Andy and Russ for pointing this out!

Looking at the five specific records you identified, it looks like these extra tab characters were introduced by a character-escaping oversight when those records were added to our database. The original abstract or title text for those patents includes the literal character sequence "\t", and when that was read into our database, those two characters were combined into a single tab character in the data. 
We will be correcting this mistaken character replacement in the next data update. In the meantime, there are a few options you can consider for performing a corrected data load: 

  • You can edit those specific tab characters out directly, as Russ suggested. 
  • You can check the software settings or operation parameters used when loading the data in your currently used program to see if there is an option that controls quotation mark binding, and setting that to its most strict option.
  • You can try loading the data into your database using a different program that will handle the tabs for you. Python's Pandas library, for example, should automatically detect that those tab characters are part of a field rather than field separators, and provides several in-built functions for connecting to and working with SQL databases.

 

Thank you for your feedback, and please let us know if any other issues come up with the files!
Best,
PVTeam

Hegedus
Last seen: 02/04/2024 - 19:23
Joined: 10/24/2018 - 20:10
Thank you for the update and…

Thank you for the update and it was a head scratching problem.

I was working with the folks on SQLite forum and we had some  folks who could import it without issue. I was finally able to but with an odd constraint. If the data file and the database file were both on my internal drive it worked.  If they were on external drives then errors were produced. I had started going down a hardware issue.

Andy