Refining your patent data with OpenRefine
Refining patent data (specially Assignee/Inventor names) is one of the most time consuming and crucial task involved in patent analytics. There are multiple methods available to solve the need but in this blog we are going to talk about ‘OpenRefine’.
This Tool Started Its life as Google refine but now we all know it as the OpenRefine. OpenRefine makes data cleaning simpler and more efficient while keeping the data secure and private with in our workstation.
It is most helpful in the cases where we are stuck with inventor names such as ‘Jim, Daniels’ and ‘Daniels, Jim’. Searching and correcting such errors are important to have a proper and accurate analysis of the patent data.
In order to proceed, we first need to download OpenRefine. Once you download and install the application you can open the applications (BTW it runs in a web browser and also without any active internet connection)
It can accept a lot of input file types, but for this example I took the same good old Excel sheet. You can perform your basic cleaning (trimming/replacing and stuff) either in Excel itself or OpenRefine can do that for you. Cleaning data through OpenRefine can be found in more details here.
We all know there can be multiple Inventors for a single patent, so the first thing we will do is to split those columns.
We need to separate the columns with a “|” separator
In OpenRefine we work with facets, for applying cleaning algorithms and rules to sort and simplify our patent data.
To apply a facet > Text Facet
A cluster would be formed at the left side of the dashboard.
Click on the cluster and apply the required algorithm for data cleaning. You can find the details about these algorithms and use the one you like or the one meets your needs.
Now, we can see how this algorithm has captured all the discrepancies within the inventor’s name. We can select and merge the one we need and can cluster and match again using different algorithms.
We can also use customs facets using Open Refine Expression Language (GREL) to use functions which are not defined in the facet settings.
Once you are done with clustering and refine all your patent data (Inventors/Assignee) you need to concatenate the data which we had split into several columns. This is the phase open refine need to develop a bit. Even though it has a concatenate function it doesn’t concatenate well for our case where there are blank cells.