Traditional rules-based and machine learning methods often operate on the same transaction or entity. This limitation fails to account for how transactions are connected to the wider network. Because fraudsters often operate across multiple transactions or entities, fraud may go undetected.

By analyzing graphs, we can capture dependencies and patterns between direct neighbors and more distant connections. This is very important for detecting money laundering where funds are moved through multiple transactions to trace their origin. GNNs illuminate dense subgraphs constructed by laundering methods.

An example of a related party transfer network Using GNN to detect financial fraud based on network of related party transactions

Message passing frameworks

As with other deep learning methods, the goal is to represent or embed data from a dataset. In GNNs, these node embeddings are constructed using a message passing framework. Messages are passed iteratively between nodes, enabling the model to learn both the local and global structure of the graph. Each node embedding is updated based on a collection of features of its neighbors.

The generalization of the framework works like this:

  • The beginning: Embeddings hv(0) An About node is initialized with feature-based embeddings, random embeddings, or pre-trained embeddings (such as account name word embeddings).
  • Message Passing: On every layer tNodes exchange messages with their neighbors. Messages are defined as characteristics of a sender node, characteristics of a receiver node, and characteristics of an edge connecting them in a function. The combination function can be a simple combination with a fixed weight scheme. Graph Convolutional NetworksGCNs) or attention-weighted, where weights are learned based on sender and receiver properties (and optionally edge properties) (using Graph attention networksGATs).
  • collection: After the message passing phase, each node accumulates the received messages (means, maximum, sum as simple as possible).
  • Update: The collected messages then update the node’s embedding via an update function (possibly MLPs (multilayer perceptrons) such as ReLU, GRUs (gated recurrent units) or attention mechanisms).
  • Final shape: Embeddings, like other deep learning methods, are finalized when the representations stabilize or reach a maximum number of iterations.
Update of node representation in Message Passing Neural Network (MPNN) layer. A node receives messages sent by all its nearest neighbors. Messages are computed via the message function, which accounts for properties of both the sender and the receiver. Graph neural network. (2024, November 14). i Wikipedia. https://en.wikipedia.org/wiki/Graph_neural_network

After learning the node embeddings, the fraud score can be calculated in a few different ways:

  • Classification: where the final embedding is transferred to a multilayer perceptron-like classifier, which requires a comprehensively labeled historical training set.
  • Anomaly detection: Where an embedding is classified based on how distinct it is from others. Distance-based scores or reconstruction errors can be used here for unsupervised approaches.
  • Graph Level Scoring: where the embeddings are collected into subgraphs and then fed into classifiers to detect fraud circles. (Again a historical dataset label is required)
  • Label Propagation: A semi-supervised approach where label information is propagated based on edge weights or graph connectivity makes predictions for unlabeled nodes.

Now that we have a basic understanding of GNNs for a familiar problem, we can turn to another application of GNNs: predicting protein functions.

We have seen great progress in the prediction of protein folding by alphafold. 2 And 3 and through protein design RF Diffusion. However, the prediction of protein function remains challenging. Prediction of function is important for many reasons but is particularly important in biosecurity to predict whether DNA will be parthenogenic before sequencing. Like traditional methods Explosions Rely on sequence similarity searches and do not falsify any structural data.

Today, GNNs are beginning to make significant progress in this area by exploiting the graph representation of proteins to model the relationships between residues and their interactions. As well as predicting protein function, it is considered appropriate to identify binding sites for small molecules or other proteins and to classify enzyme families based on active site geometry.

Among many examples:

  • Nodes are constructed as amino acid residues.
  • Edges as interactions between them

The rationale behind this approach is the inherent ability of a graph to capture long-range interactions between residues that are distant in sequence but close in fold structure. This is similar to why the transformer architecture was so helpful for Alphafold 2, allowing parallel computation across all pairs in a single configuration.

To make the graph information-rich, each node can be enriched with attributes such as residue type, chemical properties, or evolutionary conservation scores. Edges can optionally be enriched with properties such as type of chemical bonds, proximity in 3D space, and electrostatic or hydrophobic interactions.

Deep FRI There is a GNN approach to predict protein function from structure (specifically a graph convolutional network (GCN)). GCN is a special variant of GNN that extends the idea of ​​convolution (used in CNNs) of data to graphs.

DeepFRI diagram: The LSTM language model, pre-trained on ~2 million Pfam protein sequences, is used to extract residue-level features of PDB sequences. (B) GCN with 3 graph convolutional layers for learning complex structure-to-function relationships. from Structure-based function prediction using graph convolutional networks

In DeepFRI, each amino acid residue is a node enriched with attributes such as:

  • Amino acid type
  • Physiological properties
  • Evolutionary information from MSA
  • Sequence embedding from a pre-trained LSTM
  • Structural context such as solvent accessibility.

Each edge is defined to capture the spatial relationships between amino acid residues in the protein structure. An edge exists between two nodes (residues) if their distance is below a certain threshold, usually 10 Ã…. In this application, edges have no properties, acting as weightless connections.

The graph is initialized with sequence embeddings generated from node features LSTM, residue-specific features, and information generated from the residue contact map.

Once the graph is defined, message transfer takes place through adjacency bases at each of the three layers. Node features are collected from neighbors using the graph’s adjacency matrix. Stacking multiple GCN layers allows the embedding to rapidly acquire information from larger neighbors, starting with direct neighbors and spreading to neighbors of neighbors and so on.

The final node embeddings are globally aggregated to form a protein-level embedding, which is then used to classify proteins into taxonomically relevant functional classes (GO terms). Classification is performed by passing protein surface embeddings through fully connected layers (dense layers) with sigmoid activation functions, optimized using a binary cross-entropy loss function. Classification models are trained on data obtained from protein structures (eg, from the Protein Data Bank) and functional annotations from databases such as UniProt or Gene Ontology.

  • Graphs are useful for modeling many nonlinear systems.
  • GNNs capture the relationships and patterns that traditional methods struggle to model by incorporating both local and global information.
  • There are many variations of GNNs but the most important (currently) are graph convolutional networks and graph attention networks.
  • GNNs can be efficient and effective in identifying multi-hop relationships in money laundering schemes using supervised and unsupervised methods.
  • GNNs can only improve sequence-based protein function prediction tools such as BLAST by incorporating structural data. This enables researchers to predict the functions of new proteins with minimal sequence similarity to known ones, an important step in understanding biosecurity risks and enabling drug discovery.

Cheers and if you liked this post check out my other articles Machine learning and biology.



Source link