Milestone check

Hello, can you please check the milestone progress on my semestral project? The notebook in which I'm developing and a description of my progress so far is located in the repository's root, in a file called MVI_milestone.ipynb

Since the file is quite large when expanded on GitLab and you may not feel like importing it into your Colab / downloading it to your PC just to read the text, I am pasting the milestone description into the issue here:

Task description

In this task, we are creating an efficient clustering model for papers from the ArXiv e-print repository. Apart from the actual model, a survey of state-of-the-art text clustering algorithms must be included in the report.

Dataset retrieval and preprocessing

Since the metadata dataset is only hosted by Kaggle, where its download is limited to pre-signed URLs, I have re-hosted it on my Google Drive. If you want to run this notebook yourself, download the linked dataset, compress it using gzip (simple gzip -k <snapshot_file> should be enough) and upload it to your own Google Drive or Colab Environment.

There is little data pre-processing needed for this dataset. Kaggle already provides a JSON snapshot of paper metadata which is updated weekly. This snapshot contains basic information about each paper, such as its ID title, abstract, and authors, as well as more detailed information - revision history, submitter-chosen categories, submitter name, journal information, etc.

After re-hosting the dataset on my Google Drive (since the environment on which I'm developing this model is Google Colab), I prepared a class to parse it, called PaperRepository. The parsing takes approximately 2 minutes on a machine provided by Google Colab Pro.

During JSON parsing, this class generates documents - sentences that are passed to embedding models. These documents contain the paper title and abstract. They do not include information about the authors and the ArXiv category.

The title and abstract of each paper need to be cleaned to remove LaTeX/MathJaX formatting and accent marks. I have experimented with using pandoc to convert as many of these as possible into Unicode representations. I have abandoned this approach due to two reasons:

The preprocessing would take an extremely long time, since it would essentially consist of rendering almost 2 million LaTeX documents.
Embedding models are capable of embedding words, not mathematical formulas, so there would be little value in performing the conversion.

Thus, PaperRepository only removes backslashes, newlines and sigil characters from the document.

I have also experimented with including author names in the documents, but due to the large prevalence of a few common surnames (especially Asian and Spanish ones) in ArXiv papers, the models used for testing consistently created three clusters: a cluster of papers with Asian co-authors, a cluster of papers with Spanish co-authors and a large cluster of papers authored by other nationalities.

PaperRepository also allows retrieving ArXiv papers by ID, getting all documents for a specific ArXiv category, and getting categories for a document. To save on memory, string category labels are converted to numeric IDs.

Current model progress

My initial efforts are focused on leveraging the top2vec library which creates a space of document and word embeddings. After reducing its dimensionality using UMAP, it creates clusters of documents (using HDBSCAN) and finds the most representative words for each cluster.

I train top2vec with a pre-trained universal-sentence-encoder model by Google on a random sample of 10% of documents from the dataset. This creates the cluster. I then add the remaining 90% of documents into the dataset.

Documents similar to a specific paper can simply be found by using the model's search_documents_by_documents function. This provides a simple recommender functionality.

Ongoing work

In the coming days, I will be focusing on three major areas:

Evaluation

To be able to evaluate how multiple models perform, I need to find an effective measure of the clustering performance.

One of the options is using the existing category labels from ArXiv, which is why I don't currently include them in the document text. However, if I focused on using these labels as a simple measure of model quality, the clusters could just end up mirroring ArXiv categories, which is not an improvement on the user being able to directly search for papers in their chosen category on ArXiv.

Thus, if I end up using category labels, I cannot solely focus on trying to match clusters to them. However, they still may come useful.

I will also be exploring the evaluation criteria that were used in papers for embedding-based clustering packages, such as top2vec or BERTopic

Further exploration of embedding models

Since the pre-trained embedding models from Google may be suboptimal for the specific language of scientific papers, after solving the evaluation problem, I will focus my efforts in two directions:

Training a specific doc2vec model on the corpus of ArXiv paper abstracts and titles
Using the pre-trained SPECTER model, which is already pre-trained on scientific documents using their citations as a relatedness measure

Edited Nov 30, 2021 by Róbert Selvek