Skip to main content
 
 
 
IN THIS SECTION
5 posts
spsg
Last seen: 08/23/2021 - 02:07
Joined: 08/22/2021 - 07:28
Errors in Detailed Description (2005 onwards)?

Hi, there seems to be a problem with the text in  Detailed Description data for  2005-onwards, for both granted patents as well as pre-grant publications of applications. 

The issue is that words and numbers (references to features in the Drawings) in the Description text are bunched together and missing spaces. Two examples from 2018 granted patents (see bolded portions):

"As shown inFIGS. 3A, 4A, and 5A, the main part of the airbag25is formed by folding a single fabric piece (also referred to as a base fabric sheet, or a fabric panel) along a fold line26, which is a folding portion at the center in the width direction, to be superposed on itself in the automobile width direction, and joining the superposed parts. To distinguish the two superposed parts of the airbag25, the part located on the inner side will be referred to as a first fabric portion27, and the part located on the outer side will be referred to as a second fabric portion28."

"Continuing with the prior embodiment and in some other instances, the mobile device app303is also configured to communicate a geographical location of the employee's mobile device to the employee monitor301to use or as one or more automated clock actions. So, assuming the employee consents and is not coerced in any manner, the employee's mobile device has the mobile device app303installed and processing on that mobile device and the geographical position can be used to initiate automated clock actions."

Most, if not all records from 2005 onwards seem to have this problem. I have not checked Brief Summary text to see if this problem is present in those records too.

Records from 1976 to 2004 do not appear to have this problem.

I suspect it has to do with the change in the raw data XML format to 4.0+ from 2005 onwards?

PVTeam
Role: moderator
Last seen: 08/31/2021 - 13:26
Joined: 10/17/2017 - 10:47
Thank you for bringing this…

Thank you for bringing this to our attention, we will investigate and respond when we know more about this potential error. 

Best,

PVTeam

Russ
Last seen: 09/16/2021 - 17:44
Joined: 11/14/2017 - 22:15
markup in the xml

It looks like there is markup in the xml that isn't being removed properly.  First example is patent 10035488 in ipg180731.zip

<p id="p-0044" num="0043">As shown in <figref idref="DRAWINGS">FIGS. 3A, 4A, and 5A</figref>, the main part of the airbag <b>25</b> is formed by folding a single fabric piece (also referred to as a base fabric sheet, or a fabric panel) along a fold line <b>26</b>, which is a folding portion at the center in the width direction, to be superposed on itself in the automobile width direction, and joining the superposed parts. To distinguish the two superposed parts of the airbag <b>25</b>, the part located on the inner side will be referred to as a first fabric portion <b>27</b>, and the part located on the outer side will be referred to as a second fabric portion <b>28</b>.</p>

spsg
Last seen: 08/23/2021 - 02:07
Joined: 08/22/2021 - 07:28
Hi Russ, yes, you are right,…

Hi Russ, yes, you are right, I think the markup is not being removed correctly during the extraction process. However, if I try to extract the Detailed Description text from the original XML source files (found on the USPTO bulk data downloads webpage), it comes out cleanly without the formatting error. I used the lxml etree 'tostring' method for each para.

However, it is far more convenient to extract everything from a single Patentsview TSV file, compared to downloading all those xml files and doing batch processing.

Hi PVTeam, thanks for the response. Looking forward to an update when you identify/resolve this :)

 

 

PVTeam
Role: moderator
Last seen: 08/31/2021 - 13:26
Joined: 10/17/2017 - 10:47
Thank you both! We will…

Thank you both! We will adjust the parser to fix the data moving forward and then we will start replacing the older files after we complete the current data update. 

Best,

PVTeam