Hi
First of all, thank you for providing this data. Unfortunately though the data contains some problems that make it unsuitable to upload into common database platforms, like MySQL and MS SQL Server. While I was loading the data, I noticed that some (probably not more than a handful) records contain tabs and newlines inside text fields. These fields are encapsulated by double quotes, so technically this is correct. Yet this is such an uncommon practice that MySQL and MS SQL Server are not even designed to be able to load it.
After digging a bit deeper, I noticed these newlines and tab characters appear in pieces of text that look like local paths (c:\users\....) or URL's. If a \t \n \r appears in this path, the character sequence has been replaced by the control character (tab, newline). This indicates that something went wrong while exporting the data, or worse, while loading the data into your own database.
Some examples would be the patent table and the otherreference table. For the otherreference table I have been able to compile a list of affected records. I will attach the uuid-patent_id pairs below.
I can suggest a couple of fixes, from easy to implement, to harder:
- Perform a search in all text columns for tab and newline characters, and remove them. Doing this inside your database is much easier than doing it once the data has been exported. You should do a visual check, but they seem to be errors in the data to begin with, so manipulating the data a bit more shouldn't be a problem.
- If you are convinced the tabs and newlines need to remain in the data, it would be better to choose a field and newline separator during the export that does not appear inside your data. Although it is very common to use \r\n as line-separator and \t as field-separator, this is no fixed requirement, and using another character actually avoids the issue of encapsulation, and the issue of MySQL and MS SQL Server not supporting newlines within encapsulated fields.
- And then the hardest solution... Since this issue indicates a possibility that at some point the data has been loaded wrongly, it might be interesting to try and find the loading procedure that caused this issue. This might prevent future issues. I have been in contact with another user who has had no issue loading previous versions of the data into MySQL, hence I suspect this is a recent issue.
I'm looking forward to using this data, yet don't have the time required to implement a custom fix to remove the "invalid" newlines and tabs from the current source files.
Regards
Wouter
"Appendix": List of otherreference rows containing tab and newline issues in the text column
069aj5agwj1y51tlk1qark54j D675840
0iwdsypz792x5rlh092wt8i3j 7978592
0jdla7l5khavwi3tdb7g1ho88 8402319
1cs6kdllx4j7c3hqqmlre3xvz 7499590
1kzjntfxshx4npa5m3xoi5f7b 6647208
1v5a3tv1i9gsjytl5y8zblwhc 7934254
2fr0ycudtrwb8eqczsw81fab9 7698638
39xojhtedrqj0z15hve5e21a2 7446180
3iqu4tp0jmiyyz49ytev43ski 7921459
3pye22pu2qnhcyvzfeg1xxtrz 7917588
46m40s6o1fsvnmxaxxr00jja1 8929039
4aflaft1ye5bimosdgirco8en 9419824
4j5go4phadsspzpjss2sn2mm6 8238467
4mziublufpmnsjw54qh3v5o2n 7178166
4qbg377fvcrt1tioelzuijrdf 6652015
4qhvspvn99sztlrb72k93ni7c 6644136
52g08pzzb01r4fc2uthv8usct 7130466
5i7qsfuhf36i6ovave4sx4ex9 7801785
7tta8n2grf3gg5amx0xzdvg7y 7224668
80za5ruj5gcto360b9qork7sq 8771720
85sbrny2gro93muwcbojac7pm D702979
8jz6bpe4y0tsaabm4vwmftxk3 7770225
8ko7mawf17pubi3wnntsc9nhv 8753555
99bmy9sju4qr3li8vi6hffpxo 7673137
9sja8k8v0qcvjnn56qshlhn9r 7978592
9uyhcyneb5656el7zt5gvpp7c 7072963
ahaoj5ohtaweib3s3vdnszp02 8587602
bgnj1f9v1lk1dv1ub8huoptcn 7712138
bllsvvqd73tofg2fuapbguq8o 8347320
c5slzx1puwc4knzlp2fscgsc4 7472172
cpzye3n1nbek7w2lqto4ukvd7 8262403
cr2kltvj7fipx3l82tbpllkn4 7342918
d92y0cp8fonuflmlem0ha55dj 8748799
db47zkwre7afcye7qtvvhjzs4 D691901
djspokjbtvz5bzu4dkfgzrk63 6823278
dknat50vwrhdwvai9mdvx492i 8231889
dku9mvd0hk0ovxu6qagwqsq9r 6736352
dy3zo7dyvu1f97kdbxjr4vv49 8311147
e4aj8v5yh2z3fc9k2tgx2k2m1 7460015
es6eha3tiqa4ddszfz6n4s6tf 6961735
ewsb4slqduvt2rv4qh1jlnwl2 8020091
f744sjpmjyi9nm2bcz0vxx10i 9710227
fscovl9urf2puvww1bjtvzmwd 7634800
g5lw0felfysbf8xlkfs95peez 7574740
g9205le6lkvyit2cg2zr76kme 7219131
gnpend66urrm3qb0zta7pkzqz 7146305
h87z1zc81onqq66cp9ogyczqm 8209604
hithvnppmovef0ck3eshpybh4 7477151
ia3yye12x3accrlhcz4619qyw 9021080
it014g1l7btp89rhpxc45lb1w 6829437
iwqyjn34bi0rnqao2i2rskpoi 9584259
k0khnopywoahtkjqvq056qsgk 9478685
k2p8f80znhcth7crb4qvya9mp 9710227
k7wmlprtf4jqtktft184w35y3 9128246
kh1ew83y1gm8muv9d854y1en6 8923433
krw9e7dah2uluokad2jvarc8q 7629888
kuh7o0har28yx2fgtclq5zp7e 9000924
lcqkoiwoax4cxdum8v6h686hu 7748038
lp3q0zmnd5wetxoo4v539dgly 7529186
mb4x035xuewsw7y7llynmujxe 7220539
mcc9n9ql7op07qafm99pqzczn 8199016
mpmdoh1dkunj2le04022psgq4 7653695
n6d1fzedx4gl8avp5oys002cm 8587602
navc7yrlcwkohdokqidz4s7kx 7481243
nfo1cqv3f0bgand8f3d39zqxk 8238470
ng7i1fzhxj1dfpd99ojp4woef 8259841
ntwb02080ufp9kpv4ql4xealp 9471829
o1s94oh0lqb7zumtdmuvd73c2 8989303
ol4vq1gvw62hk05rl441hh0hs 8006243
otk4wis6qayq9ae8x0ri84uts 9646227
p1jcbknezvtyh1jpwtckn8fy3 8199668
p9f0a3c7b9wuxust71eq3p5qn 7756930
q742ej7in0xaz6olydkoemwvj 6859271
qypjn48hnmbctv7uxz0arvkmf 8681837
rmkyemrszb9mk0k3iek291d6b 7162649
rprth9bjqxcn50knl104luibz 8254491
ruaecp5im0w6ys5enh5y8wztt 6647208
sm85nwcxhbrmmhlcdcbsjtw9z 8777305
t3y7xs1mcoo39sncjgmvbwkh8 7657938
tg1bj7ir27nlaacual2emrcy5 D779670
tglfzzxalkdg9h37xrb8it0e1 7340776
tt6f30l4dnu10adazp4qss8dh 7420101
u9on7kubufu4lqckzmz9g2cx8 6829437
uhcyk862t5uq6vidjssn0ec3w D699568
v04il63whdixg2y241cdj8d1b 7809857
vcafd3gl69pcjdeiqs45ibrd7 9027121
vdateapifbmvj8zxwsvpswtdz 9710227
w7wtark3ejbodiislwyoexzsk 7913303
warwake27agvt0qbo0e17000v 6846455
wfxw41c07ep1fn8g7kmrya28x 6760679
wnt2joalgg9zvp9me1p3f52s1 7003808
wsnvf4qiwwbs92rgueppuy3fr 7873695
x6qiv8b5gk2i05j01uhdizo5i 9647857
x72w3ikixh3tndwrpbsn4z2s3 7565549
xlva56ia68zxftfgaz10v30th 8243847
xr7pq1yvl7tpy5xt74qhsfial D689354
yj2xe9hem9z6sl3atqf5l9oy3 7917588
yt2xzc13kqjf3fge2yo7qxvfd 7237264
yxq28zjj1kedhm8h8rlqko0v7 9419816
zf7pb6osrzc3exsaug7y4qlnv D615040
zkm1qw8n8lk9ju4ci5bqwzb96 D677066