Skip to main content
4 posts
Last seen: 09/05/2023 - 05:28
Joined: 06/28/2021 - 07:28
Pre-Grant Data discrepancy with PAIR/PATex

I am currently using the PAIR/PatEx data and am using PatentsView to get the disambiguated inventor and assignee information. I have tried to filter the data based on the data description but there's still a discrepancy of about 170000 applications.

Briefly, I did the following:
a) Dropped all observations before 2006 and after 2019
b) Dropped all non-utility non-provisional applications
c) Confirmed that there are no duplicates by application number in PatEx.
The resulting size was;
PatentsView: 4,793,067
PatEx: 4,964,247
I looked up specific application numbers in PatEx which were unsuccessful, but these were not in PatentsView. 

Is the Pre-Grant Data only a subset of the published applications (utility patents after 2005)? How is this subset defined?

Role: moderator
Last seen: 04/24/2024 - 12:31
Joined: 10/17/2017 - 10:47

Hi Sanjana,

Thank you for reaching out to our team! The Pre-Grant Data we release should contain information on all applications (2005-present) and is not intended to be a subset of any larger data source. 

We collect and extract this data based on the XML files that USPTO releases through their bulk download resource page ( For the Pre-Grant data specifically we extract all the XML data from the Patent Application Full Text Data (No Images) (MAR 15, 2001 - PRESENT) section.

Thank you for bringing the discrepancy between these data sources to our attention. We will investigate the differences in data to determine whether this is related to missing data in the XML files themselves or something in our database. If you have any examples of applications that look to be missing from our data that would be very helpful.

Please let us know if you have any further questions.



Last seen: 09/05/2023 - 05:28
Joined: 06/28/2021 - 07:28
Dear PVTeam,Thank you for…

Dear PVTeam,

Thank you for getting back to me so quickly. I truly appreciate it!

Some of the application numbers I could not find a match for include: 13786274, 13447927, 14169700, 14169702, 14172403, 16085602, 16307332, 13373750, 12803968, 59000004.

I had a few follow up questions I was hoping to check with you on:

1. I noticed that there was going to be a new release in June and was wondering when this will be?

2. Is the the disambiguation also fully there for 2005 onwards?

3. Can I access an updated or more detailed pre-grant data dictionary anywhere?

4. I also noticed some irregular dates: e.g. application numbers 05629532, 05497504 and 06557282 have dates 0975-11-06, 1074-08-14 and 1873-11-18. These are unlikely to be from encoding as most other dates seem fine. I also tried comparing these dates to the filing_date in PAIR. Some of them were missing while others had a different date. e.g. application number 05945628has the date as 1078-09-25 in PV but a filing_date of 1978-09-25 in PAIR.

Look forward to hearing from you.





Last seen: 09/05/2023 - 05:28
Joined: 06/28/2021 - 07:28

Dear PVT Team,

I was hoping to check if there was any update on the discrepancy between the datasets. I just checked the match again with the latest data and there still appears to be a significant difference:

Mismatched applications: 469410

Applications in Patex but not PV: 147025

Applications in PV but not Patex : 322385

I used the same filters as the last time and have put the R codes below for your reference. I would greatly appreciate your advice on how to proceed. Is there likely to be a new version any time soon addressing the discrepancy?



filter(filing_date > "2005-12-31" & filing_date < "2019-1-1" | %>% 
 filter(date > "2005-12-31" & date < "2019-1-1" | %>% 
 filter(type_application == "utility" | %>%