Article Network missing link prediction
The goal of this post is to predict missing links from a network.
The dataset is a citation network of research articles, represented as a graph G = (V, E). The nodes correspond to scientific articles and the existence of a directed edge between nodes u and v indicates that paper u cites paper v. Each article is also associated with information such as title, publication year, author names, journal name and a short abstract.
We will build a machine learning model to predict from a pair of nodes if an edge is present (1) or not (0). A number of edges have been randomly removed from the initial graph, and the binary classifier performance will be measured on a test subset.
Get the network as a graph
import pandas as pd
training = pd.read_csv('./data/training_set.txt', sep = ' ', header = None)
training.head(3)
0 | 1 | 2 | |
---|---|---|---|
0 | 9510123 | 9502114 | 1 |
1 | 9707075 | 9604178 | 1 |
2 | 9312155 | 9506142 | 0 |
The training set has 3 columns: the first two indicating nodes IDs and the third one if there is a link between the node pair. The graph is directed: the first node is the source (quoting article) and the second node is the target (quoted article).
We also have access to the file node_information
, containing additional information about the research articles. The file has no Header and columns can be named for the sake of clarity.
We can use id as index, which will make information retrieval easier later.
node_info = pd.read_csv('./data/node_information.csv', header= None)
node_info.columns = ['id', 'pub_year', 'title', 'authors', 'journal_name', 'abstract']
node_info = node_info.set_index('id')
node_info.head(3)
pub_year | title | authors | journal_name | abstract | |
---|---|---|---|---|---|
id | |||||
1001 | 2000 | compactification geometry and duality | Paul S. Aspinwall | NaN | these are notes based on lectures given at tas... |
1002 | 2000 | domain walls and massive gauged supergravity p... | M. Cvetic, H. Lu, C.N. Pope | Class.Quant.Grav. | we point out that massive gauged supergravity ... |
1003 | 2000 | comment on metric fluctuations in brane worlds | Y.S. Myung, Gungwon Kang | NaN | recently ivanov and volovich hep-th 9912242 cl... |
Now that we have everything, we can start to build our network!
We take the IDs from node_info
, that contains every node of the entire graph (while the other files are just node pairs and would exclude unlinked nodes). We then add the edges from our training set.
A good library to deal with networks is the python package NetworkX. Important for the graph creation: the correct function for this usecase is DiGraph
and not Graph
as the network is directed.
import networkx as nx
IDs = [node_id for node_id in node_info.id]
training_list = training.values.tolist() # training dataframe convertion for easy edges list comprehension below
edges = [(node_pair[0], node_pair[1]) for node_pair in training_list if node_pair[2] == 1]
G = nx.DiGraph()
G.add_nodes_from(IDs)
G.add_edges_from(edges)
print("Number of nodes : " + str(G.number_of_nodes()))
print("Number of edges : " + str(G.number_of_edges()))
Number of nodes : 27770
Number of edges : 335130
Compute the predictors
The graph contains a lot of nodes and edges. Let’s create a subset to keep the learning time of our model more laptop-suitable, it shouldn’t impact the performance.
import random
training_reduced = training.sample(frac=0.05) # We keep 5%
training_reduced.columns = ['source', 'target', 'Y']
training_reduced.head(3)
source | target | Y | |
---|---|---|---|
348093 | 110261 | 9612133 | 1 |
496903 | 9712054 | 9602174 | 1 |
337809 | 1202 | 9903249 | 1 |
Now that the training dataset is ready, time to create our future predictors. We will use the following node descriptive metrics:
- Degree centrality: The fraction of nodes of the total graph each node is connected to. We are interested by the out-degree centrality of the source node (percentage of papers the article is quoting), and the in-degree centrality of the target node (percentage of papers the article is being quoted by).
- Page ranking: The ranking of the node in the graph based on the structure of the incoming links. We are interested in the page ranking of our target.
- Preferential attachment: Express the likeliness of a connection between two nodes based on their existing connectiveness. The function is not implemented in NetworkX for directed graph so we define it.
- HITS algorithm: Define a Hub and Authority value. A good Hub points to many other nodes, a good Authority is pointed by many Hubs. We are interested in the Hub score for the source, and the Authority score for the target.
# Degree Centrality features
out_degree_centrality = nx.out_degree_centrality(G)
in_degree_centrality = nx.in_degree_centrality(G)
training_reduced['source_out_centrality'] = training_reduced.apply(lambda row: out_degree_centrality[row.source],axis=1)
training_reduced['target_in_centrality'] = training_reduced.apply(lambda row: in_degree_centrality[row.target],axis=1)
# Page rank
page_rank = nx.pagerank_scipy(G)
training_reduced['target_pagerank'] = training_reduced.apply(lambda row: page_rank[row.target],axis=1)
# Preferential Attachment
# For a directed graph, is equal to K_out_source * K_in_target with K the number of neighbors. Which is equivalent to multiply the available centralities.
training_reduced['preferencial_attachment'] = training_reduced.apply(lambda row: row.source_out_centrality * row.target_in_centrality,axis=1)
# HITS algorithm
hub_score, authority_score = nx.hits(G)
training_reduced['source_hub_score'] = training_reduced.apply(lambda row: hub_score[row.source],axis=1)
training_reduced['target_authority_score'] = training_reduced.apply(lambda row: authority_score[row.target],axis=1)
training_reduced.head(3)
source | target | Y | source_out_centrality | target_in_centrality | target_pagerank | preferencial_attachment | source_hub_score | target_authority_score | |
---|---|---|---|---|---|---|---|---|---|
348093 | 110261 | 9612133 | 1 | 0.001260 | 0.000180 | 0.000015 | 2.269434e-07 | 7.067421e-07 | 4.779942e-08 |
496903 | 9712054 | 9602174 | 1 | 0.001008 | 0.001765 | 0.000064 | 1.779236e-06 | 5.649810e-05 | 7.506221e-05 |
337809 | 1202 | 9903249 | 1 | 0.000360 | 0.000612 | 0.000050 | 2.204593e-07 | 4.574718e-07 | 9.512625e-07 |
We can also extract a lot of information from the features of node_info
:
- Publication year: A paper cannot quote a more recent paper. We compute the difference between the publication year of the target and the publication year of the source.
- Title: Title similarity induce similar topics, and enhance the chances of articles being linked. We can use Sentence2Vec (word embedding) similarity with SpaCy.
- Abstract: We follow the same reasoning that for the title.
- Authors: Papers sharing authors should have higher chances of being connected. We count the number of common authors.
# Publication year
training_reduced['pub_year_difference'] = training_reduced.apply(lambda row: node_info.pub_year[row.source] - node_info.pub_year[row.target] ,axis=1)
training_reduced['pub_year_difference']=training_reduced['pub_year_difference'].where(training_reduced['pub_year_difference'] < 0, -1)
# Title
import spacy
nlp = spacy.load('en_core_web_lg')
training_reduced['title_similarity'] = training_reduced.apply(lambda row: nlp(node_info.title[row.source]).similarity(nlp(node_info.title[row.target])) ,axis=1)
# Abstract
training_reduced['abstract_similarity'] = training_reduced.apply(lambda row: nlp(node_info.abstract[row.source]).similarity(nlp(node_info.abstract[row.target])) ,axis=1)
# Authors
node_info['authors'] = node_info['authors'].fillna(value='')
training_reduced['common_authors'] = training_reduced.apply(lambda row: len(set(node_info.authors[row.source].split(",")).intersection(set(node_info.authors[row.target].split(",")))) ,axis=1)
training_reduced.head(3)
source | target | Y | source_out_centrality | target_in_centrality | target_pagerank | preferencial_attachment | source_hub_score | target_authority_score | pub_year_difference | title_similarity | abstract_similarity | common_authors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
348093 | 110261 | 9612133 | 1 | 0.001260 | 0.000180 | 0.000015 | 2.269434e-07 | 7.067421e-07 | 4.779942e-08 | -1 | 0.781564 | 0.972482 | 0 |
496903 | 9712054 | 9602174 | 1 | 0.001008 | 0.001765 | 0.000064 | 1.779236e-06 | 5.649810e-05 | 7.506221e-05 | -1 | 0.839936 | 0.657183 | 0 |
337809 | 1202 | 9903249 | 1 | 0.000360 | 0.000612 | 0.000050 | 2.204593e-07 | 4.574718e-07 | 9.512625e-07 | -1 | 0.547695 | 0.848389 | 0 |
We now have 10 predictors! Some of them express a certain part of common information, we can see it on the correlation matrix of features as well:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(14,12))
sns.heatmap(training_reduced.corr(),
vmax=0.5,
square=True,
annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x14b527dd8>
However, it cannot hurt our model performance so let’s keep them for now.
Build the model
We first want to create a train,test split to have an idea of the performance of the classifier.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(training_reduced.drop(['source', 'target', 'Y'], axis= 1), training_reduced.Y, test_size=0.2)
A Random Forest is often a good idea in this kind of prediction task. Indeed, Random Forest does not require a lot of specific feature engineering, and gives a good idea of important features. We use the scikit-learn library.
from sklearn.ensemble import RandomForestClassifier
RF_classifer = RandomForestClassifier(n_estimators= 1000)
RF_classifer.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
We now test on our freshly created X_test:
RF_classifer.score(X_test, y_test)
0.9074074074074074
A score of .9 accuracy is really good here! There is also other interesting metrics we could look at, such as the F1 score for example. It all depends on the priorities in predictions.
In case of prediction on new data, keep in mind that all the previously defined preprocessing tasks need to be done on the new set. The code would look like this:
testing.columns = ['source', 'target']
# Degree Centrality features
testing['source_out_centrality'] = testing.apply(lambda row: out_degree_centrality[row.source],axis=1)
testing['target_in_centrality'] = testing.apply(lambda row: in_degree_centrality[row.target],axis=1)
# Page rank
testing['target_pagerank'] = testing.apply(lambda row: page_rank[row.target],axis=1)
# Preferential Attachment
# For a directed graph, is equal to K_out_source * K_in_target with K the number of neighbors. Which is equivalent to multiply the available centralities.
testing['preferencial_attachment'] = testing.apply(lambda row: row.source_out_centrality * row.target_in_centrality,axis=1)
# HITS algorithm
testing['source_hub_score'] = testing.apply(lambda row: hub_score[row.source],axis=1)
testing['target_authority_score'] = testing.apply(lambda row: authority_score[row.target],axis=1)
# Publication year
testing['pub_year_difference'] = testing.apply(lambda row: node_info.pub_year[row.source] - node_info.pub_year[row.target] ,axis=1)
testing['pub_year_difference']=testing['pub_year_difference'].where(testing['pub_year_difference'] < 0, -1)
# Title
testing['title_similarity'] = testing.apply(lambda row: nlp(node_info.title[row.source]).similarity(nlp(node_info.title[row.target])) ,axis=1)
# Abstract
testing['abstract_similarity'] = testing.apply(lambda row: nlp(node_info.abstract[row.source]).similarity(nlp(node_info.abstract[row.target])) ,axis=1)
# Authors
testing['common_authors'] = testing.apply(lambda row: len(set(node_info.authors[row.source].split(",")).intersection(set(node_info.authors[row.target].split(",")))) ,axis=1)
predictions = RF_classifer.predict(testing.drop(['source', 'target'], axis = 1))
predictions_df = pd.DataFrame(predictions)
predictions_df.to_csv('predictions.csv')
Here we are! We have created a model to predict links between nodes, based on both article information and network attributes. We also saw how to test the performance of that model, and make predictions on new data.
Here, we did not actually represent our network, but if you are interested in such a thing, take a look at my article on the topic!