<p>
Using </code>ipcr.tsv</code>, I am trying to match patents to their subclasses. Here is my code (it uses Python polars):
</p>
<pre>
import polars as pl
lf = (
pl.scan_csv(
file="patent_analysis/data/citations_dummy/ipcr.tsv",
sep="\t",
dtypes={
"patent_id": pl.Utf8,
"section": pl.Utf8,
"ipc_class": pl.Utf8,
"subclass": pl.Utf8
}
)
.select(
["patent_id", "section", "ipc_class", "subclass"]
)
)
</pre>
<p>
However, there are several duplicates:
</p>
<pre>
print(lf.filter(pl.col("patent_id").is_duplicated()).unique().collect())
shape: (701194, 4)
┌───────────┬─────────┬───────────┬──────────┐
│ patent_id ┆ section ┆ ipc_class ┆ subclass │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪═════════╪═══════════╪══════════╡
│ 10048897 ┆ G ┆ 06 ┆ F │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 9486337 ┆ B ┆ 21 ┆ C │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 9507950 ┆ G ┆ 06 ┆ Q │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 8979528 ┆ G ┆ 06 ┆ F │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10237684 ┆ G ┆ 1 ┆ S │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 11069981 ┆ H ┆ 1 ┆ Q │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10143461 ┆ A ┆ 61 ┆ F │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10540238 ┆ G ┆ 6 ┆ F │
└───────────┴─────────┴───────────┴──────────┘
</pre>
<p>
Some rows with the same patent_id have different subclasses.
</p>
<pre>
lf.filter(pl.col("patent_id") == "10048897").unique().collect()
shape: (3, 4)
┌───────────┬─────────┬───────────┬──────────┐
│ patent_id ┆ section ┆ ipc_class ┆ subclass │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪═════════╪═══════════╪══════════╡
│ 10048897 ┆ G ┆ 06 ┆ F │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10048897 ┆ H ┆ 04 ┆ L │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10048897 ┆ H ┆ 03 ┆ M │
└───────────┴─────────┴───────────┴──────────┘
</pre>
<p>
Considering the duplicates, what's the best way to match a patent to its subclass?
</p>