Skip to main content
 
 
 
IN THIS SECTION
3 posts
WouterJ
Last seen: 11/08/2017 - 11:15
Joined: 11/07/2017 - 04:44
Data Download: Encoding, Encapsulation, Escaping, ...

Hi

In the Patents_DB_dictrionary-bulk_downloads.xlsx file, I can't find any information on the folowing items:

  • Encoding being used: cp1252, UTF-8, ?. I would need this in order to be sure I am loading the characters correctly.
  • Encapsulation: For what columns is encapsulation used? How is the encapsulation character escaped in case it appears inside the data?
  • Escaping: I notice some characters seem to be HTML-escaped. What method of escaping is used here? Is it part of the export method, or is this the orignal source data?
  • Newlines: Should I be expecting newlines within Encapsulated fields? If so, Unix or Windows newlines, or both?

 

Also, there is a issue I am facing with the data. Some fields, like "name_first" in the "inventor" table, don't seem to be encapsulated, and contain text much longer than the specified 64 characters. For example for inventor ID 4101894-1 the "name_first" is: 

"""""""""""Melvin """"""""""""Cy"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

This seems to be an encapsulation issue. Yet none of the other fields use encapsulation.

It is mentioned that the data is in "tab delimited format" but as explained here, knowing this is insufficient to correctly and reliably load the data. Hence, additions to the data dictionary would be welcome.

Thanks!

WouterJ
Last seen: 11/08/2017 - 11:15
Joined: 11/07/2017 - 04:44
Some more findings

After a long search, I've found that some of the files use \n line endings instead of \r\n (the patent.tsv file, for example). I received what should have been identical source files from another user, also the 2017-08-08 version, but that file had \r\n line endings. So this seems to have changed recently, and within the same version...

PVTeam
Role: moderator
Last seen: 11/29/2024 - 15:02
Joined: 10/17/2017 - 10:47
RE: Data Download: Encoding, Encapsulation, Escaping...
  • Encoding being used: cp1252, UTF-8, ?. I would need this in order to be sure I am loading the characters correctly.

A: The files are all either ASCII (if the characters can be ascii encoded) or UTF-8 otherwise
 

  • Encapsulation: For what columns is encapsulation used? How is the encapsulation character escaped in case it appears inside the data?

A: There is no encapsulation. We do not expect to see tab characters within fields
 

Escaping: I notice some characters seem to be HTML-escaped. What method of escaping is used here? Is it part of the export method, or is this the orignal source data?
A: The html character escaping is done during raw data processing using a custom Python process but we know there are some remaining issues — these will be addressed in the future

  • Newlines: Should I be expecting newlines within Encapsulated fields? If so, Unix or Windows newlines, or both?

A: Yes, both