Skip to main content
 
 
 
IN THIS SECTION
3 posts
WouterJ
Last seen: 11/08/2017 - 11:15
Joined: 11/07/2017 - 04:44
Data Download: Newlinews and Tabs in text fields

Hi

First of all, thank you for providing this data. Unfortunately though the data contains some problems that make it unsuitable to upload into common database platforms, like MySQL and MS SQL Server. While I was loading the data, I noticed that some (probably not more than a handful) records contain tabs and newlines inside text fields. These fields are encapsulated by double quotes, so technically this is correct. Yet this is such an uncommon practice that MySQL and MS SQL Server are not even designed to be able to load it.

After digging a bit deeper, I noticed these newlines and tab characters appear in pieces of text that look like local paths (c:\users\....) or URL's. If a \t \n \r appears in this path, the character sequence has been replaced by the control character (tab, newline). This indicates that something went wrong while exporting the data, or worse, while loading the data into your own database.

Some examples would be the patent table and the otherreference table. For the otherreference table I have been able to compile a list of affected records. I will attach the uuid-patent_id pairs below.

I can suggest a couple of fixes, from easy to implement, to harder:

  • Perform a search in all text columns for tab and newline characters, and remove them. Doing this inside your database is much easier than doing it once the data has been exported. You should do a visual check, but they seem to be errors in the data to begin with, so manipulating the data a bit more shouldn't be a problem.
  • If you are convinced the tabs and newlines need to remain in the data, it would be better to choose a field and newline separator during the export that does not appear inside your data. Although it is very common to use \r\n as line-separator and \t as field-separator, this is no fixed requirement, and using another character actually avoids the issue of encapsulation, and the issue of MySQL and MS SQL Server not supporting newlines within encapsulated fields.
  • And then the hardest solution... Since this issue indicates a possibility that at some point the data has been loaded wrongly, it might be interesting to try and find the loading procedure that caused this issue. This might prevent future issues. I have been in contact with another user who has had no issue loading previous versions of the data into MySQL, hence I suspect this is a recent issue.

 

I'm looking forward to using this data, yet don't have the time required to implement a custom fix to remove the "invalid" newlines and tabs from the current source files.

Regards

Wouter

"Appendix": List of otherreference rows containing tab and newline issues in the text column

069aj5agwj1y51tlk1qark54j    D675840
0iwdsypz792x5rlh092wt8i3j    7978592
0jdla7l5khavwi3tdb7g1ho88    8402319
1cs6kdllx4j7c3hqqmlre3xvz    7499590
1kzjntfxshx4npa5m3xoi5f7b    6647208
1v5a3tv1i9gsjytl5y8zblwhc    7934254
2fr0ycudtrwb8eqczsw81fab9    7698638
39xojhtedrqj0z15hve5e21a2    7446180
3iqu4tp0jmiyyz49ytev43ski    7921459
3pye22pu2qnhcyvzfeg1xxtrz    7917588
46m40s6o1fsvnmxaxxr00jja1    8929039
4aflaft1ye5bimosdgirco8en    9419824
4j5go4phadsspzpjss2sn2mm6    8238467
4mziublufpmnsjw54qh3v5o2n    7178166
4qbg377fvcrt1tioelzuijrdf    6652015
4qhvspvn99sztlrb72k93ni7c    6644136
52g08pzzb01r4fc2uthv8usct    7130466
5i7qsfuhf36i6ovave4sx4ex9    7801785
7tta8n2grf3gg5amx0xzdvg7y    7224668
80za5ruj5gcto360b9qork7sq    8771720
85sbrny2gro93muwcbojac7pm    D702979
8jz6bpe4y0tsaabm4vwmftxk3    7770225
8ko7mawf17pubi3wnntsc9nhv    8753555
99bmy9sju4qr3li8vi6hffpxo    7673137
9sja8k8v0qcvjnn56qshlhn9r    7978592
9uyhcyneb5656el7zt5gvpp7c    7072963
ahaoj5ohtaweib3s3vdnszp02    8587602
bgnj1f9v1lk1dv1ub8huoptcn    7712138
bllsvvqd73tofg2fuapbguq8o    8347320
c5slzx1puwc4knzlp2fscgsc4    7472172
cpzye3n1nbek7w2lqto4ukvd7    8262403
cr2kltvj7fipx3l82tbpllkn4    7342918
d92y0cp8fonuflmlem0ha55dj    8748799
db47zkwre7afcye7qtvvhjzs4    D691901
djspokjbtvz5bzu4dkfgzrk63    6823278
dknat50vwrhdwvai9mdvx492i    8231889
dku9mvd0hk0ovxu6qagwqsq9r    6736352
dy3zo7dyvu1f97kdbxjr4vv49    8311147
e4aj8v5yh2z3fc9k2tgx2k2m1    7460015
es6eha3tiqa4ddszfz6n4s6tf    6961735
ewsb4slqduvt2rv4qh1jlnwl2    8020091
f744sjpmjyi9nm2bcz0vxx10i    9710227
fscovl9urf2puvww1bjtvzmwd    7634800
g5lw0felfysbf8xlkfs95peez    7574740
g9205le6lkvyit2cg2zr76kme    7219131
gnpend66urrm3qb0zta7pkzqz    7146305
h87z1zc81onqq66cp9ogyczqm    8209604
hithvnppmovef0ck3eshpybh4    7477151
ia3yye12x3accrlhcz4619qyw    9021080
it014g1l7btp89rhpxc45lb1w    6829437
iwqyjn34bi0rnqao2i2rskpoi    9584259
k0khnopywoahtkjqvq056qsgk    9478685
k2p8f80znhcth7crb4qvya9mp    9710227
k7wmlprtf4jqtktft184w35y3    9128246
kh1ew83y1gm8muv9d854y1en6    8923433
krw9e7dah2uluokad2jvarc8q    7629888
kuh7o0har28yx2fgtclq5zp7e    9000924
lcqkoiwoax4cxdum8v6h686hu    7748038
lp3q0zmnd5wetxoo4v539dgly    7529186
mb4x035xuewsw7y7llynmujxe    7220539
mcc9n9ql7op07qafm99pqzczn    8199016
mpmdoh1dkunj2le04022psgq4    7653695
n6d1fzedx4gl8avp5oys002cm    8587602
navc7yrlcwkohdokqidz4s7kx    7481243
nfo1cqv3f0bgand8f3d39zqxk    8238470
ng7i1fzhxj1dfpd99ojp4woef    8259841
ntwb02080ufp9kpv4ql4xealp    9471829
o1s94oh0lqb7zumtdmuvd73c2    8989303
ol4vq1gvw62hk05rl441hh0hs    8006243
otk4wis6qayq9ae8x0ri84uts    9646227
p1jcbknezvtyh1jpwtckn8fy3    8199668
p9f0a3c7b9wuxust71eq3p5qn    7756930
q742ej7in0xaz6olydkoemwvj    6859271
qypjn48hnmbctv7uxz0arvkmf    8681837
rmkyemrszb9mk0k3iek291d6b    7162649
rprth9bjqxcn50knl104luibz    8254491
ruaecp5im0w6ys5enh5y8wztt    6647208
sm85nwcxhbrmmhlcdcbsjtw9z    8777305
t3y7xs1mcoo39sncjgmvbwkh8    7657938
tg1bj7ir27nlaacual2emrcy5    D779670
tglfzzxalkdg9h37xrb8it0e1    7340776
tt6f30l4dnu10adazp4qss8dh    7420101
u9on7kubufu4lqckzmz9g2cx8    6829437
uhcyk862t5uq6vidjssn0ec3w    D699568
v04il63whdixg2y241cdj8d1b    7809857
vcafd3gl69pcjdeiqs45ibrd7    9027121
vdateapifbmvj8zxwsvpswtdz    9710227
w7wtark3ejbodiislwyoexzsk    7913303
warwake27agvt0qbo0e17000v    6846455
wfxw41c07ep1fn8g7kmrya28x    6760679
wnt2joalgg9zvp9me1p3f52s1    7003808
wsnvf4qiwwbs92rgueppuy3fr    7873695
x6qiv8b5gk2i05j01uhdizo5i    9647857
x72w3ikixh3tndwrpbsn4z2s3    7565549
xlva56ia68zxftfgaz10v30th    8243847
xr7pq1yvl7tpy5xt74qhsfial    D689354
yj2xe9hem9z6sl3atqf5l9oy3    7917588
yt2xzc13kqjf3fge2yo7qxvfd    7237264
yxq28zjj1kedhm8h8rlqko0v7    9419816
zf7pb6osrzc3exsaug7y4qlnv    D615040
zkm1qw8n8lk9ju4ci5bqwzb96    D677066
 

WouterJ
Last seen: 11/08/2017 - 11:15
Joined: 11/07/2017 - 04:44
Example row

I'd like to illustrate the above mentioned issue with the example below

069aj5agwj1y51tlk1qark54j    D675840    "Kuchera, Ben, “The Transformer of Office Chairs: a Review of the Trey Chair,” [online], Aug. 15, 2007. Retrieved from the Internet: http://arstechnica.com
eviewsother    he-transformer-of-office-chairs-a-review-of-the-trey-chairs.ars (8 pgs.)."    1

You can see a newline has been inserted just after "http://arstechnica.com" and the line continues with "eviewsother". It is safe to assume the actual text was "http://arstechnica.com/reviewsother" and the /r has been replaced with a newline at some point.

PVTeam
Role: moderator
Last seen: 04/24/2024 - 12:31
Joined: 10/17/2017 - 10:47
Thank you for your comments…

Thank you for your comments on how to improve on PV data quality. We are planning to include your issue in the next PV database update. 

~PatentsView Team