Why are there discrepancies in PatentsView data?
PatentsView data come from multiple sources. The basic data is parsed from raw patent files released by the USPTO at https://bulkdata.uspto.gov/. The files are available in three formats for different time periods: plain text for 1976-2001, XML v2.5 for 2002-2004, and XML v4 for 2005 to present. The data contained in these files include main bibliographic information about granted patents such as grant date, filing date, inventor names and locations, assignee names and locations, related documents, citations, and other fields. Discrepancies in these files are mostly due to data inconsistencies at the time of filing or are produced in the process of scanning the patent documents into digital format. The PatentsView team employs several programmatic methods to resolve these inconsistencies at the time of parsing and post-processing. We also developed and implement a designated quality control procedure to ensure the integrity and quality of the PatentsView database before it is released to the community.
How do you evaluate the quality of disambiguation algorithms?
At the 2015 Inventor Disambiguation workshop the PatentsView team presented an evaluation approach based on available "ground truth" data from earlier academic studies providing disambiguated inventor identities. Six research teams presented exciting new computational approaches for identifying unique inventor entities across 40 years of USPTO patent data. Nicholas Monath and Andrew McCallum from the University of Massachusetts Amherst authored the successful algorithm that was integrated in the PatentsView data platform in March 2016.
Because the disambiguation of inventors, assignees, locations, and lawyers is an ongoing effort, errors are likely to be observable in the PatentsView query results. The team welcomes feedback as we continue to improve our disambiguation methodology.