diff --git a/images/GAT_result1.png b/images/GAT_result1.png new file mode 100644 index 0000000..9ce50ff Binary files /dev/null and b/images/GAT_result1.png differ diff --git a/images/GAT_result2.png b/images/GAT_result2.png new file mode 100644 index 0000000..2559315 Binary files /dev/null and b/images/GAT_result2.png differ diff --git a/images/GAT_result3.png b/images/GAT_result3.png new file mode 100644 index 0000000..9547c7d Binary files /dev/null and b/images/GAT_result3.png differ diff --git a/images/GCN_results1.png b/images/GCN_results1.png new file mode 100644 index 0000000..899bdd7 Binary files /dev/null and b/images/GCN_results1.png differ diff --git a/images/GCN_results2.png b/images/GCN_results2.png new file mode 100644 index 0000000..3424bea Binary files /dev/null and b/images/GCN_results2.png differ diff --git a/images/GCN_results3.png b/images/GCN_results3.png new file mode 100644 index 0000000..8a2456e Binary files /dev/null and b/images/GCN_results3.png differ diff --git a/images/GCN_results4.png b/images/GCN_results4.png new file mode 100644 index 0000000..1fd5771 Binary files /dev/null and b/images/GCN_results4.png differ diff --git a/summaries/GraphAttentionNetwork.md b/summaries/GraphAttentionNetwork.md new file mode 100644 index 0000000..10f9e0b --- /dev/null +++ b/summaries/GraphAttentionNetwork.md @@ -0,0 +1,58 @@ +# GRAPH ATTENTION NETWORKS +Petar Velickovic ́,Guillem Cucurull,Arantxa Casanova,Adriana Romero,Pietro Lio`, +Yoshua Bengio,**ICLR** **2018** + +## Summary + +The paper introduces a novel, efficient, and parallelizable architecture based on spatial methods, termed the Graph Attention Network (GAT). This architecture draws significant inspiration from the seminal work "Attention is All You Need" by Vaswani et al. The central concept of GAT is to compute the hidden representations of each node in the graph by attending to its neighbors through a self-attention mechanism. This approach highlights the importance of attention weights, in contrast to previous spatial methods such as GCN and GraphSAGE, which assign constant weights to all neighbors. Additionally, the model adeptly addresses inductive learning problems and demonstrates robust generalization to entirely unseen graphs. + +## Contributions + +These are the major contributions of this paper: + +1.)It introduces the attention mechanism to Graph Neural Networks. This mechanism emphasizes the importance of attention weights, allowing each node to weigh the influence of its neighbors differently, thereby capturing more complex relationships within the graph. + +2.)To stabilize the learning process and improve performance, the Graph Attention Network (GAT) employs multi-head attention. Multiple attention mechanisms are applied independently, and their outputs are either concatenated or averaged to form the final node representations. Consequently, this parallelizable model is both efficient and scalable to large graphs. + +## Working + + + +1.)Encoding Features: In the paper, a single linear transformation is used to encode the feature vector of nodes into a low-dimensional vector space without any activation function. Generally, a feed-forward neural network might be used for this purpose. + +2.)Attention Coefficient: For each node and its one-hop neighbors, attention coefficients are calculated using a shared attention mechanism that considers the linear transformations of the node features. The attention mechanism employed in the paper differs significantly from the originally proposed method. The details are provided clearly in the paper, which I will summarize later. + +3.)Softmax Normalization: The attention coefficients are normalized using the softmax function to ensure they sum to one. + +4.)Feature Aggregation: Each node's new feature representation is computed as a weighted sum of its neighbors' transformed low-dimensional feature vectors, using the attention coefficients as weights. + +5.)Non-Linearity: A non-linear activation function, such as ReLU, is applied to the aggregated features to obtain the final representation of each node. This process is recursively applied, typically for two or three layers, and is bounded above by the network's depth. + +Important details:-> + +The paper focuses on the node classification task. Consequently, a head size of 8 is used, resulting in 8 learnable shared weight matrices to generate 8 aggregated feature representations of a node before step 5, along with Leaky ReLU as the activation function. These representations are averaged to maintain the same dimension as before. Additionally, the paper does not utilize two learnable weight matrices, namely query and key, for obtaining the attention coefficients. + +## Results + + + + + +The exact training setup is clearly given in the paper, including parameters such as dropout probability, strength of L2 regularization, number of layers, and the patience value for early stopping. The key takeaways are as follows: + + +1.)The proposed GAT model achieved state-of-the-art performance, surpassing all previous models and benchmarks. It significantly outperformed a recent model named SageGraph and was on par with the GCN model. + +2.)The results also underscored the importance of attention weights, demonstrating an improvement in model accuracy compared to a similar architecture that assigned constant and equal attention weights to all neighbors. + +## Two-cents + +The paper is not heavily mathematical and is written in simple, easy-to-understand language.The implementation details are presented straightforwardly, and the approach can be viewed as a combination of GCN and attention mechanisms. + +The paper also discusses possible future work, with the most significant suggestion being the incorporation of edge features along with node-level features into the model. This enhancement could result in a more powerful model, as current models like GAT, SageGraph, and GCN do not utilize edge features beyond their binary presence or absence. + +## Resources + +Paper link: https://arxiv.org/pdf/1609.02907 + +Video link: https://youtu.be/VyIOfIglrUM?si=o4NHnET8-zbQGkA1 \ No newline at end of file diff --git a/summaries/GraphConvolutionNetwork.md b/summaries/GraphConvolutionNetwork.md new file mode 100644 index 0000000..3956732 --- /dev/null +++ b/summaries/GraphConvolutionNetwork.md @@ -0,0 +1,81 @@ +# SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS +Thomas N. Kipf,Max Welling ,**ICLR** **2017** + +## Summary + +The paper introduces a new architecture, namely the Graph Convolutional Network (GCN). It's a scalable approach for semi-supervised learning on graph-structured data. It is essentially an efficient variant of convolutional neural networks that operate directly on graphs. The paper is motivated by the node-level classification task and justifies its architecture as a localized first-order approximation of spectral graph convolutions. Furthermore, their model is computationally very cheap compared to traditional spectral methods. The paper is highly mathematical and provides a theoretical explanation for the results.There are 3 prespective in which you could understand this paper + +a) You could see it as an novel architecture which is simple and scalable for large scale graphs. +b) The analogy between the model and Weisfeiler-Lehman algorithm related to graph isomorphism. +c) How it can be derived/motivated from traditional spectral methods. + +## Contributions + +The main contribution of the paper is that it introduces a simplified and computationally efficient variant of Graph Convolutional Networks (GCNs) for semi-supervised learning on graph-structured data. + +It is the most cited paper in the field of Graph Neural Networks (GNNs). It has inspired a wide range of subsequent research and applications in this domain, leading to the development of more advanced GCN variants and other types of graph neural networks like GAT and GraphSage + +## Working + + + +1.) Encoding Features: The graph encodes the feature vector of nodes to a low-dimensional vector space. This is usually done with a simple feedforward neural network without any activation function. + +2.)A CNN inspired aggregation and update: + +Aggregation: Aggregate the encoded feature vector of all 1-hop neighbors of the target node and the node itself. + +Update: Pass this aggregated feature vector through an activation function like ReLU. + +3.) Recursion: Recursively repeat this process for all the 1-hop neighbors of the target node and stack the GNN layers. + +4.) Final Representation: Finally, we get an encoded feature vector representation of each node which can be used for classification or regression tasks. + +Important details regarding the working of the architecture:-> + +a) Various types of aggregation functions have been proposed since this paper, such as max-pooling and LSTM. This paper simply adds up the hidden vector with normalization. The first part of step 2 in this paper is simply the sum operation. + +b) The mathematical equation of the update function includes the matrix A+I and sqrtD inverse matrix to include the target node itself for the aggregation step. Also, the sqrtD inverse is a diagonal matrix for normalizing the encoded vector representation of each node. + +c)The loss function used in the paper is cross-entropy loss. The paper employs a semi-supervised learning setup due to transductive learning, which means it also uses unlabeled nodes for training the model on labeled nodes to incorporate the structural details of the graph. + +d) Differences between GCN and CNN: + +1)The unlabeled nodes might include nodes that would be used at test time, which is very different from computer vision tasks where the test dataset is not used during the training process. + +2)Unlike CNNs, we don't have very deep models for GCNs. Most models stack only 2 layers, meaning their receptive field is very small compared to modern CNN architectures and includes at most 2-hop neighbors. + +e) The model doesn't uses minibatch idea that is the batch size in each epoch is the entire training dataset.Also it uses dropout . Both of this methods have been outdated and are not used in current architectures. + +## Results + + + +The key takeaways are :-> + +1.) The proposed architecture outperforms all previous methods and is now the baseline for modern GNN models. It also proves to be more efficient computationally while being a more complex network. + +2.) They used dropout in both layers but L2 regularization in the 2nd layer only. The exact details of dropout probability and regularization strength are given in the paper. + +3.) In the appendix section they showed that deeper models perform poorly. A depth of 2 or 3 layers is optimal, and depths greater than 7 work only with skip connections. Therefore, in most GNN models, we only have matrices associated with two GNN layers. + + + +4.) Another important result discussed in the appendix is the analogy of the architecture with the Weisfeiler-Lehman algorithm. The main outcome of this analogy is that the model is a powerful feature extractor even without any training. The results of this are discussed in great detail in the appendix section. + + + +## Two-cents + +The appendix of the paper is very insightful and discusses a third perspective in which you could view the paper, as mentioned in the summary section. Though it is very mathematical and involves a deeper understanding of the Weisfeiler-Lehman test and graph isomorphism, which I have not covered in the summary. + +The paper overall is highly mathematical but provides a clear implementation method and code. Therefore, it's good for both theoretical work, as it provides every mathematical detail, and is also quick to use for applications related to node classification due to the clear implementation given in the paper. + + +## Resources + +Paper link: https://arxiv.org/pdf/1710.10903 + +Video link: https://youtu.be/uFLeKkXWq2c?si=03-MxYlP00qOcXLk + +