Thesis Explorer: Find Supporting Theses for Your Project

Have some of you been frustrated not finding research papers for your project, which cite something which you know is true? Personally, I feel that finding the right research paper for your project is a tedious job (Even more tedious than the project sometimes!). What's more frustrating is that you must cite supporting research papers as references before moving to your next point. Frequently, this supporting evidence is found in webpages that illustrate the point perfectly. But you cannot cite this webpage in your thesis! They would want solid proof of your point and that needs to be in the form of a RESEARCH PAPER, nothing else. Sometimes, you may find a very strong evidence in a newspaper article but you don't know the source.

You might type the same keywords in Google Scholar or Microsoft Academic or anywhere else, but the results are not up to your expectations. The papers found by Google Scholar don't illustrate the point so clearly, as compared to the information in the webpage or the newspaper article (you found previously).

That's the exact scenario I had been, and I thought: If there was some program, that would read the perfect webpage or article I found previously, and then find me the most suitable research paper, everything would be SO EASY! And that's when I came up with Article Finder, a JavaScript-based scholarly-literature search algorithm.

You just enter the URL of the webpage or type the text on the newspaper write-up in Article Finder, and then it will extract the necessary keywords using Watson Natural Language Understanding, showing you the matched research papers from both Microsoft Academic and Google Scholar. What's more, Article Finder has a client-server structure, and so setting it up in one computer (the server) will make it accessible to other computers on the network, so that it is accessible to multiple researchers under an institution.

Requirements:

So basically, we'll be using the following services in Article Finder:

Our software requirements are: NodeJS and a browser

Note that we are using Google Scholar and Microsoft Academic completely in the way they allow, respecting their T&C. We aren't using any API to scratch data from the sites, rather we are using them through their interface and their results are displayed in their very own interfaces.

Structure

Basically we have a frontend (client) and a backend (server). The frontend is where you enter your URL or article. Pressing the send button will generate a POST request to the backend. The backend receives the document, and then analyzes it using Watson Natural Language Understanding API, extracting the concepts, categories, keywords and entities in the document, and then sends these extracted items to the frontend, as the response to the POST request.

The frontend then conducts a search using Microsoft Academic as well as Google Scholar, with the extracted items as the search query and returns you the matched Research Papers from both the sites. Though sometimes, the results are pretty stupid, other than that it works pretty well.

Setup a Watson Natural Language Service

IBM cloud's Watson Natural Language Understanding is a cloud native product that uses deep learning to extract metadata from text such as entities, keywords, categories, sentiment, emotion, relations, and syntax.

Simply put, it will read our document (using Deep Learning) and extract our search keywords, with which we can conduct a nice search. However, we first need to create an instance of the service in IBM cloud.

We go to Watson Natural Language Understanding's home page and then click GET STARTED FREE. Enter your email, a password. Then verify the email and enter some personal information, and create your account. AGREE the acknowledgement and click PROCEED.

This takes you to the service-creation page. Choose any location and select the Lite plan (which is totally free). Of course, if you belong to an organization, or have enough budget, you may choose an appropriate plan for you. The Lite plan is great if you want Article Finder to be running on your personal computer, and won't be using it the whole day. Click Create. Your new Watson NLU service is ready to go.

This takes you to the your NLU service's page. Click on Manage (top) on the list at the left of the page. You will see your API key and the URL. Copy and save these down as we will need to enter them into Article Finder.

Download Article Finder

For those of you who are too lazy to go through the entire making process, you can simply download Article finder here for Windows or Linux. Just make sure that you haven't forgotten to install NodeJS though. Just extract the folder and run the backend.js file using NodeJS. You might even want to click run.bat (for Windows) or run.sh (for Linux) to launch the backend directly. Just don't forget to enter your API key and URL in the credentials.js file.

Then, we can view Article Finder by typing http://127.0.0.1:3000/file/frontend.html in our browser.

In case, errors pop up, I would suggest you delete the node_modules folder and enter in command prompt:

npm install

Downloads

Article Finder.rar

Install Modules Through NPM

Download the package.json file from the attachment. Copy it into your folder and open the folder in your shell (e.g. Command Prompt for Windows). Then enter the following:

npm install

This installs all the necessary packages for Article Finder.

Downloads

package.rar

The Backend: Setting Up the Server

The backend mainly receives the URL or writeup from the frontend, sends it to Watson NLU service and receives the analyzed JSON result. It then extracts the keywords, concepts, categories and entities from the JSON response, combining them into a string (the search query), and sends this search query back to the frontend.

First, we set up a server in NodeJS using express, give it some headers to function over the network (without which we'll get a CORS error) and configure it to server static files and handle POST requests. I am using POST for the frontend to communicate with the backend, over the port 3000. I am using body parser module to parse the POST requests as JSON objects, and extract the data sent from the frontend.

The Backend: Setting Up the Credentials

Firstly, we create a file credentials.js to store and export our API key, URL and the version of Watson NLU. I am using version 2018-11-16, but you may use any available version you like. Next we initialize the Watson NLU API by importing our credentials. Next, we define the features to extract using Watson NLU: concepts, categories, keywords, metadata and entities, as well as the maximum limit (number) of each extracted feature. The backend.js file (the main backend program), will read these exported credentials to initialize Watson NLU.

The Backend: Extraction of Features and POST Requests

We define a POST request router, for receiving the request from the frontend. In its callback, we define the features to be extracted from the text, the maximum extracted items for each feature, and the type of analysis (URL or text), based upon the frontend's request. As in Watson's site, the extracted features are described as:

Categories return a five-level taxonomy of the content. The top three categories are returned. Concepts return high-level concepts in the content. For example, a research paper about deep learning might return the concept, "Artificial Intelligence" although the term is not mentioned. Entities identify people, cities, organizations, and other entities in the content. Keywords return important keywords in the content.

Other than those, our Watson can also extract metadata of the text i.e. the author name, the title, the date published (in case these exist). The required features can be enabled by setting the respective property to TRUE, and their limit can be defined, all in the credentials.js file.

Higher the limit of each extracted item, more accurate will be the analysis, however it will cost you more (in case of a paid plan). In case you are using the Lite Plan, be aware that each extracted item will be decremented from your stock of 30,000 NLU items per month. Thus, lower limits will enable you to use Article Finder with lower accuracy but for a longer time. Also, too many features doesn't necessarily mean that your search results will be accurate; it would just confuse the search engines and it won't be able to find any matching research paper.

Next, we need to extract the detected features from Watson NLU's JSON response. We just loop through the results and store the items in string arrays. As categories contain a taxonomy of items separated by slashes (/), we extract each word separated by slash, storing them into an array. Then we combine all the items into a single array. To remove duplicate items, we convert the array into a JavaScript set and then covert the set back to an array using the spread (...) operator. These unique array elements are then converted into a string. It is then sent to the fronted, as a JSON response for the POST request.

The Frontend: HTML

The frontend is the area for the user input as well as the output. For determining the type of document, we include a drop-down list button, showing options: text / URL. Then we include a text area for entering the URL or the write-up, and a search button. We then add some CSS and a background to make the page look appealing. We would then be implementing DOM (JavaScript) for controlling the various aspects of the page.

The Frontend: JavaScript

In this frontend-control part, we use DOM to get the input values, and display the results. To create a dynamic look, we aren't writing the associated texts in HTML itself, but in JavaScript, which will show and remove the texts as and when required.

When the user enters the type of the document, the text/ URL and clicks the SEND button, a POST request is generated using Fetch API which is sent to the backend. The backend then sends the analyzed keywords in the form of a string, which is our search query. Note that Google Scholar has a maximum character limit of 256 characters (extra ones will be ignored). However, (from my knowledge) this is not so in Microsoft Academic.

We then need to conduct a search using the two search engines with our received search query. Google Scholar search URLs are of the form:

https://scholar.google.com/scholar?q=

Microsoft Academic search URLs are of the form:

https://academic.microsoft.com/search?q=

We add our search query after these initials, as a value for the href attribute of an auto-clicking link. Microsoft Academic search results are displayed in an inline frame (iframe), while the Google Scholar results are shown in a separate tab because it doesn't allow to show the search results in a frame.

Also we create an error handler function, to display a notification of an error as well as the error message, and texts to be displayed when the user has not entered all the necessary information for the search. With these, we have a new and fully functional Article Finder program.

Future Modifications

Article Finder is a simple open-source and legitimate ;) search algorithm for finding research papers, and is surely not perfect nor complete. I just wanted to provide a simple solution to a problem I faced while searching for research papers. Since Google Scholar and Microsoft Academic were free, I decided to use them for this project, while respecting the T&C of both the services. Surely there are tons of other different sites, and Article Finder can be greatly improved if those are included in it. Further, analysis of the input document can be improved by using more sophisticated ML techniques, leading to more accurate results. All nerds out there are welcome to improve upon this program. Best of luck out there.

Regards,