Hi
In the Patents_DB_dictrionary-bulk_downloads.xlsx file, I can't find any information on the folowing items:
- Encoding being used: cp1252, UTF-8, ?. I would need this in order to be sure I am loading the characters correctly.
- Encapsulation: For what columns is encapsulation used? How is the encapsulation character escaped in case it appears inside the data?
- Escaping: I notice some characters seem to be HTML-escaped. What method of escaping is used here? Is it part of the export method, or is this the orignal source data?
- Newlines: Should I be expecting newlines within Encapsulated fields? If so, Unix or Windows newlines, or both?
Also, there is a issue I am facing with the data. Some fields, like "name_first" in the "inventor" table, don't seem to be encapsulated, and contain text much longer than the specified 64 characters. For example for inventor ID 4101894-1 the "name_first" is:
"""""""""""Melvin """"""""""""Cy"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
This seems to be an encapsulation issue. Yet none of the other fields use encapsulation.
It is mentioned that the data is in "tab delimited format" but as explained here, knowing this is insufficient to correctly and reliably load the data. Hence, additions to the data dictionary would be welcome.
Thanks!