Pagination Not Working

7 posts

Mon, 01/27/2025 - 00:37

bxb888

Last seen: 02/06/2025 - 16:43

Joined: 01/27/2025 - 00:27

Pagination Not Working

assignee_type = [2,3]
fields = [
    "patent_id",
    "patent_type",
    "application.filing_date",
    "assignees.assignee_organization",
    "assignees.assignee_type",
    "assignees.assignee_country",
    "assignees.assignee_sequence",
    "assignees.assignee_id",
    "ipcr.ipc_sequence",
    "ipcr.ipc_section",
    "patent_num_us_patents_cited",
    "inventors.inventor_country",
    "inventors.inventor_id",
]

query = {
    "_and": [
        {"patent_type":"utility"},
        {"assignees.assignee_type": assignee_type},
        {"_gte": {"application.filing_date": "1989-01-01"}},
        {"_lte": {"application.filing_date": "2023-12-31"}}
    ]
}

field_list = json.dumps(fields)
sort_param = json.dumps([{"patent_id": "asc"}])

#Initial URL
url = f"https://search.patentsview.org/api/v1/patent/?q={json.dumps(query)}&f={field_list}&s={sort_param}&o={json.dumps({"size": 1000})}"

REQUEST_LIMIT = 45
REQUEST_INTERVAL = 60
requests_made = 0
last_request_time = 0

def fetch_patent_data(url, api_key):
    global requests_made, last_request_time

    current_time = time.time()
    time_since_last_request = current_time - last_request_time

    if requests_made >= REQUEST_LIMIT and time_since_last_request < REQUEST_INTERVAL:
        sleep_time = REQUEST_INTERVAL - time_since_last_request
        print(f"Rate limit reached. Sleeping for {sleep_time:.2f} seconds...")
        time.sleep(sleep_time)
        requests_made = 0  # Reset counter after sleep

    headers = {"X-Api-Key": api_key}
    response = requests.get(url, headers=headers)
    requests_made += 1
    last_request_time = time.time()


    if response.status_code == 200:
        data = response.json()
        return data["patents"]
    else:
        # Error handling (same as before)
        status_reason = response.headers.get("X-Status-Reason")
        status_reason_code = response.headers.get("X-Status-Reason-Code")
        print(f"Error fetching data:")
        print(f"  Status Code: {response.status_code}")
        print(f"  X-Status-Reason: {status_reason}")
        print(f"  X-Status-Reason-Code: {status_reason_code}")
        print(f"  Response Text: {response.text}")
        return []

all_patent_data = []
iter = 0
while True:
    patent_data = fetch_patent_data(url, API_KEY)
    if not patent_data:
        break
    all_patent_data.extend(patent_data)
    iter += 1
    print(iter)
    if len(patent_data) < 1000: #Check if less than 1000 results were returned which indicates end of pagination
        print(len(all_patent_data))
        break

    # Prepare the URL for the next page using the last patent_id
    last_patent_id = patent_data[-1]["patent_id"]
    print(last_patent_id)
    url = f"https://search.patentsview.org/api/v1/patent/?q={json.dumps(query)}&f={field_list}&s={sort_param}&o={json.dumps({"after": last_patent_id, "size": 1000})}"

print("Patent data downloaded. Convert to csv file next")

Mon, 01/27/2025 - 08:24

Russ

Last seen: 02/28/2025 - 16:52

Joined: 11/14/2017 - 22:15

patent_id needs padding

Hey bxb888,

I had the same problem but in R! A change around November makes you pad the patent_id in the `after` parameter and nowhere else. I'm far from pythonic but here's my solution:

def zero_pad(patent_id):
    return re.sub(
        pattern=r'^(0+)([A-Z]+)(\d+)', 
        repl='\\2\\1\\3', 
        string=patent_id.zfill(8)
    )
    
# usage in my code:
if primary_key == "patent_id":
    after = zero_pad(after)

Utility patents patent_id's get leading zeroes when necessary and non utility patents get the numeric portion padded, ex RE036479.

On the plus side, the API now sorts more naturally by patent_id, ids below 10 million don't come after ones above 10 million for example (a numeric like search instead of an alpha sort).

You can just send in requests and sleep/retry when you get throttled if you want. The patentsview team has a python wrapper for the original version of the API which also produces csv files. I haven't committed my changes for the new version of the API but here's the code it uses:

    r = requests.post(url, headers=headers, json=params)

    # sleep then retry on a 429 Too many requests
    if 429 == r.status_code:
        print("Throttled response from the api, retrying in {} seconds".format(r.headers["Retry-After"]))
        time.sleep(int(r.headers["Retry-After"]))  # Number of seconds to wait before sending next request
        r = requests.post(url, headers=headers, json=params)

I hope this helps
Russ Allen

Mon, 01/27/2025 - 11:31

Russ

Last seen: 02/28/2025 - 16:52

Joined: 11/14/2017 - 22:15

no good way

Hi Steve,

What I do is make head requests from time to time on the json object the API team's Swagger UI page is based on. The timestamp seems to be updated when there's an API release (maybe a three or four week release cycle?)

curl -I https://search.patentsview.org/static/openapi.json

When I notice a change in Last-Modified: I run the test cases in the R package for the API (I'm a contributor). If a test case breaks I try to figure out what changed or open an API bug if I can't figure it out. If my python was better I'd try to port the R package!

Russ

Share Your Knowledge in the Community Forum

Contact Us

Terms of Use