TopiQ: Topic Modeling and Graph Analytics on Research Paper Abstracts

This repository contains a Jupyter Notebook that demonstrates Topic Modeling on research paper abstracts using Large Language Models (LLMs). The project goes beyond basic topic modeling by incorporating advanced graph analytics, offering deep insights into the relationships between extracted topics. The notebook also provides comprehensive visualizations of topic clustering, topic networks, and their respective structures.

Dataset

The dataset used in this project is sourced from Kaggle:

Research Abstracts Dataset

This dataset consists of research paper abstracts, which are analyzed to uncover latent topics and their relationships. Please download the dataset from the link above and place it in the working directory before running the notebook.

Dependencies

The project requires the following Python libraries:

Core Libraries:

pandas: For data manipulation and analysis.
numpy: For numerical computations.
matplotlib: For visualizing clustering, topics, and graphs.

Clustering and Dimensionality Reduction:

umap-learn: For reducing high-dimensional embeddings to lower dimensions.
hdbscan: For hierarchical density-based clustering of data.

Machine Learning Models:

sentence-transformers: For generating embeddings of text data.
transformers: To use the Llama2 model for topic extraction.

Topic Modeling Techniques:

KeyBERT: For extracting representative keywords of a topic.
UMAP: For dimensionality reduction.
HDBSCAN: For identifying meaningful clusters.
MMR (Maximal Marginal Relevance): For selecting diverse and representative topics.

Graph Analytics:

networkx: For creating and analyzing topic graphs.

Notebook Features

1. Data Exploration

Thorough exploration of the Research Abstracts Dataset to understand its structure and content. This step prepares the data for further analysis and modeling.

2. Text Embedding

Utilizes sentence-transformers to convert research abstracts into high-dimensional embeddings for more effective topic modeling.

3. Dimensionality Reduction

Uses UMAP (Uniform Manifold Approximation and Projection) to reduce the high-dimensional embeddings to a 2D space, making it easier to visualize and interpret the data.

4. Clustering

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is employed to identify clusters of research abstracts, each representing a distinct topic.

5. Topic Extraction

Extracts representative topics using multiple techniques:
- KeyBERT for extracting keywords.
- Llama2 (LLM) for advanced topic extraction.
- MMR (Maximal Marginal Relevance) for selecting the most diverse and representative topics.

6. Graph Construction and Analytics

Constructs a graph from the extracted topics based on similarity.
Performs advanced graph analytics:
- Betweenness Centrality: Identifies influential nodes in the topic network.
- Community Detection: Groups related topics into communities.
- K-Cores: Finds dense subgraphs within the topic network.
- Subgraph Generation: Creates subgraphs for specific topics to analyze their local structure.

7. Visualization

Visualizes clustering, topic distributions, and graph structures for better understanding and interpretation.

Team

This project was developed by:

Mohammed Saqlain
Nishaan Padanthaya

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
README.md		README.md
topic_modelling_using_llms.ipynb		topic_modelling_using_llms.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TopiQ: Topic Modeling and Graph Analytics on Research Paper Abstracts

Dataset

Dependencies

Core Libraries:

Clustering and Dimensionality Reduction:

Machine Learning Models:

Topic Modeling Techniques:

Graph Analytics:

Notebook Features

1. Data Exploration

2. Text Embedding

3. Dimensionality Reduction

4. Clustering

5. Topic Extraction

6. Graph Construction and Analytics

7. Visualization

Team

About

Uh oh!

Releases

Packages

Languages

saqlain2204/Topic-Modelling-Using-Large-Language-Models

Folders and files

Latest commit

History

Repository files navigation

TopiQ: Topic Modeling and Graph Analytics on Research Paper Abstracts

Dataset

Dependencies

Core Libraries:

Clustering and Dimensionality Reduction:

Machine Learning Models:

Topic Modeling Techniques:

Graph Analytics:

Notebook Features

1. Data Exploration

2. Text Embedding

3. Dimensionality Reduction

4. Clustering

5. Topic Extraction

6. Graph Construction and Analytics

7. Visualization

Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages