Skip to main content
 
 
 
IN THIS SECTION
2 posts
f.sun
Last seen: 07/18/2022 - 16:41
Joined: 07/07/2022 - 00:02
ipcr.tsv: duplicates when matching patent_id to subclass

<p>
Using </code>ipcr.tsv</code>, I am trying to match patents to their subclasses. Here is my code (it uses Python polars):
</p>

<pre>
import polars as pl
lf = (
    pl.scan_csv(
        file="patent_analysis/data/citations_dummy/ipcr.tsv",
        sep="\t",
        dtypes={
            "patent_id": pl.Utf8,
            "section": pl.Utf8,
            "ipc_class": pl.Utf8,
            "subclass": pl.Utf8
        }
    )
    .select(
        ["patent_id", "section", "ipc_class", "subclass"]
    )
)
</pre>

<p>
However, there are several duplicates:
</p>

<pre>
print(lf.filter(pl.col("patent_id").is_duplicated()).unique().collect())

shape: (701194, 4)
┌───────────┬─────────┬───────────┬──────────┐
│ patent_id ┆ section ┆ ipc_class ┆ subclass │
│ ---       ┆ ---     ┆ ---       ┆ ---      │
│ str       ┆ str     ┆ str       ┆ str      │
╞═══════════╪═════════╪═══════════╪══════════╡
│ 10048897  ┆ G       ┆ 06        ┆ F        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 9486337   ┆ B       ┆ 21        ┆ C        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 9507950   ┆ G       ┆ 06        ┆ Q        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 8979528   ┆ G       ┆ 06        ┆ F        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ...       ┆ ...     ┆ ...       ┆ ...      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10237684  ┆ G       ┆ 1         ┆ S        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 11069981  ┆ H       ┆ 1         ┆ Q        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10143461  ┆ A       ┆ 61        ┆ F        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10540238  ┆ G       ┆ 6         ┆ F        │
└───────────┴─────────┴───────────┴──────────┘
</pre>

<p>
Some rows with the same patent_id have different subclasses.
</p>

<pre>
lf.filter(pl.col("patent_id") == "10048897").unique().collect()
shape: (3, 4)
┌───────────┬─────────┬───────────┬──────────┐
│ patent_id ┆ section ┆ ipc_class ┆ subclass │
│ ---       ┆ ---     ┆ ---       ┆ ---      │
│ str       ┆ str     ┆ str       ┆ str      │
╞═══════════╪═════════╪═══════════╪══════════╡
│ 10048897  ┆ G       ┆ 06        ┆ F        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10048897  ┆ H       ┆ 04        ┆ L        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 10048897  ┆ H       ┆ 03        ┆ M        │
└───────────┴─────────┴───────────┴──────────┘
</pre>

<p>
Considering the duplicates, what's the best way to match a patent to its subclass?
</p>

f.sun
Last seen: 07/18/2022 - 16:41
Joined: 07/07/2022 - 00:02
Never mind

Ah, never mind! I didn't realize a patent could have multiple subclasses.