Skip to main content
 
 
 
IN THIS SECTION
9 posts
Russell Jurney
Last seen: 04/10/2019 - 15:30
Joined: 03/29/2019 - 16:44
Unawarded Patent Application Text?

I am interested in building a classifier that determines the odds that a patent will be rewarded based on an application's abstract/summary/full text. I have granted patents in the patentsview dataset but need patent applications as well that may or may not have been granted. I have read that this data is public, but I can't find this data.

Where can I find patent application full text?

Thanks!

Russell Jurney
Last seen: 04/10/2019 - 15:30
Joined: 03/29/2019 - 16:44
I've found patent…

I've found patent application data but now need to tie it into granted applications.

https://developer.uspto.gov/product/patent-application-full-text-dataxml

PVTeam
Role: moderator
Last seen: 03/15/2024 - 15:25
Joined: 10/17/2017 - 10:47
Re: Response to Unawarded Patent Application Text

Hi Russell,

You can visit our service desk and request the current version of the applications database which may have the information you are looking for. 

Thanks,

PVTeam

Russ
Last seen: 03/15/2024 - 23:34
Joined: 11/14/2017 - 22:15
application serial numbers

Russell,

You can join grants and applications on the application serial number.  application-reference.document-id.doc-number is common to both granted xml and application xml with a couple of caveats:

  1. Application xml is only available for applications published since 2001 (http://patents.reedtech.com/parbft.php), while grant xml is available for patents issued from 1976 on (http://patents.reedtech.com/pgrbbib.php).
  2. It is possible for an inventor to suppress publication of an application, which I assume excludes it from the application xml files. See https://www.uspto.gov/web/offices/pac/mpep/s1122.html
  3. I found 305 granted patents that aren't available in the grant xml files. At least 9 of them are recent enough that their applications should be in application xml files.

 

Russ

Russell Jurney
Last seen: 04/10/2019 - 15:30
Joined: 03/29/2019 - 16:44
Thanks PVTeam and Russ!…

Thanks PVTeam and Russ!

I was able to download 10 years of patent application data using a script from here: https://developer.uspto.gov/product/patent-application-full-text-dataxml and then I chopped the compound XML files into individual files using csplit, after which point I used Python/lxml to read the XML in each file and extract the application ID and text fields that interest me. Then I joined that data to the patentview application data to verify the join key worked, and it did. I used an outer join to create a label for each application: granted/not granted. Now I have my training data :)

If anyone is curious, the code that I used to do all of the above operations using bash and Python is here: https://gist.github.com/rjurney/0b90d71a25b327214b24f5dbe8f2ee67

It probably needs a bit of editing to run for others, but should speed up the next guy's work doing the same thing. This is part of a book I'm writing at https://github.com/rjurney/deep_products

# Load patentview mapping between patent application IDs and patent IDs
patent_applications = spark.read.csv('data/application.tsv.gz', header=True, sep='\t')

# Load the JSON patent application records
applications = spark.read.json('data/applications/json/*.jsonl.gz')

# Left join patent applications with patent/application mappings to create a patent_id (granted) / None (ungranted) field.
# Add .limit(50000) to applications dataframe if your run out of disk space
pas = applications.join(patent_applications, applications.application_id == patent_applications.number, 'left')
pas.registerTempTable('pas')

# Extract the granted field
final_pas = spark.sql("""SELECT application_id, title, abstract, description, INT(ISNULL(patent_id)) AS granted FROM pas""")
final_pas.write.json('data/patent_applications/2019-04-06.jsonl.gz', compression='gzip')

# Take a peek!
final_pas.show()

Thanks!

Russ
Last seen: 03/15/2024 - 23:34
Joined: 11/14/2017 - 22:15
very interesting!

Russell,

This looks really interesting but I think you are including application data that is too recent.  It can take years for an application to become an issued patent.  I would guess that very few applications under a year old have already become granted patents but they could be granted in the next few years. Also, I'm not sure if there's a limit to how long you'd have to wait before declaring that an application won't be granted.  Ex: a friend of mine applied for a patent in 2004 that was published in 2005 but wasn't granted until 2015.  I don't know about my friend, but I gave up on it ever being granted! It's pn 8,935,202 from 20050004928. Check out https://www.uspto.gov/dashboards/patents/main.dashxml for grant statistics. To me it looks like your training data shouldn't include applications more recent than at least 24 months old.

Russ 
 

Russ
Last seen: 03/15/2024 - 23:34
Joined: 11/14/2017 - 22:15
a little more

Russell,

You should also check how current your application.tsv file is.  The one that's in downloads now has patents granted through November 27, 2018.  

Your join code could have another use here.  The application xml has the document number that can be cited in granted patents.  If we process only the grant xml, it is impossible to determine if the cited document number went on to become a granted patent.  The grant xml does not include the document number but it could be added to the application table via a join with the application xml on application serial number.  We'd then be able to join the application table to usapplicationcitation on the document number to see if the cited application has become a granted patent.  The topic came up here a few months ago but no one provided working code!  See http://www.patentsview.org/community/forum/7/topic/107

I also wanted to mention the BulkDownloader at https://github.com/USPTO/PatentPublicData You could use it to pull down application xml as well as other bulk data files (or, to your original question,  find where bulk files are).

I hope this helps,
Russ/mustberuss on github
 

Russell Jurney
Last seen: 04/10/2019 - 15:30
Joined: 03/29/2019 - 16:44
Russ, that is exactly what I…
Image removed.

Russ, that is exactly what I do! I join the application number with the applications/granted patent table to get a label for granted/not granted. And counting the number of grants on the latest data shows the above chart, from 2000-2018. Patents tank after 2015, so the model would have to not use more recent data or it would have to take into account the time frame of the grant via the right features. I'm not sure how this might work.

The link has code for acquiring the XML patent applications, splitting them into individual files (it worked out easier), processing them in parallel to extract the fields of interest as JSON documents, and then loading that JSON into PySpark and joining it with the applications table from patentsview.org. Then I create this chart. I'm going to blog about it.

 

Russ
Last seen: 03/15/2024 - 23:34
Joined: 11/14/2017 - 22:15
another swing...

Hi Russell,

   I don't know how machine learning works but let me take another swing at this.  You can let me know how misguided I am.  I'm guessing your grants go through 2018-11-27 (unless pvteam sent you something more recent) but your scripts pulled apps up to 2019-04-01.  Nothing after 2018-11-27 will be thought to be granted which may misguide your machine learning.  If you repeat this same exercise two years from now, many patent apps published before 2019-04-01 will have been granted, which I'm assuming would yield a different machine learning than one ran today. 

  Like the cartoon character sawing the limb he's sitting on, I wouldn't pull applications more recent than 2016-11-27.  That would give them two years after publication to become granted patents, assuming 24 months as the average grant time.  The assumption being that if they weren't granted in the first two years after publication they won't go on to become granted patents.  (You could still be fooled by not waiting long enough, like my friend whose patent was granted a decade after it was published).  

I could be wrong, but to me it looks like you aren't accounting for the time it takes an application to become a granted patent.  That's all I'm trying to point out.  I wouldn't mind being totally wrong!

thanks,
Russ