Potential issues with citation data

Dear all,

Thanks for your community work. I believe there are some problems with 'g_us_patent_citation' data that I encountered during my research.

1- There are missing category data (citation_category). I know that grants after 2001 include the distinction whether the citations are made by the examiner or applicant. This variable is consistently missing up until the patent numbered 6334220, the very first patent in 2002. This does not seem to be a random mistake. For instance, I can observe from Google patents that patent documents have this distinction starting from 6167569, the very first patent of 2001.

2- The second problem is about the wrongful categorization where this data is available in the dataset. Interestingly, until patent numbered 8353062 (granted in 2013), no patents seem to have info regarding citations made by the applicant (i.e., cited by applicant never appears until this patent). It means that all citations are made either by examiner or 'cited by other'. This does not seem to make much sense and may point to a systematic coding problem. Maybe your scraping focuses only on the first DIS sheet. However, applicants may submit this form several times before examiner's search during the prosecution (just speculation).

I appreciate your efforts. I just want you to know. I am also curious where the problem originates.

Thank you.