Analysis of Sampling Algorithms

Sampling algorithms are often used to reduce the complexity of analyzing very large graphs, when crawling subsets of large datasets, or to determine the fastest way to propagate information in a network (the 'maximum coverage problem').

To analyze and compare different sampling algorithms, we translate the process of sampling a (complete) network over time into a growing dynamic graph, i.e., starting from an empty graph, visited nodes as well as their edges to already visited nodes are added. Thereby, a dynamic graph is grown that equals the original network after visiting all nodes.

General Sampling Approach

When sampling from an existing graph, the procedure is determined by the following aspects: Where to start, how to proceed, how long to proceed, and when to stop.

  • start node selection: how the first (bootstrap) node is selected
  • neighbor selection: which nodes to visit next (a.k.a. the sampling algorithm)
  • total cost: total number of visited nodes (cost of 1 per visit)
  • cost per batch: step size of visited nodes during analysis
  • stop condition: stop when all nodes are visited or seen
Basic Sampling Algorithms

In our analysis, we considered the following 7 basic sampling algorithms. We call them basic because they are rather simple, well known, and have no parameters.

  • BFS: breadth-first sampling
  • DFS: depth-first sampling
  • RANDOM_WALK: allneighbors have same probability to be selected (nodes can be revisited)
  • RANDOM_WALK_NR: random walk, non-revisiting
  • UNIFORM: nodes are sampled Uniformly at random
  • GREEDY_ORACLE: greedy oracle
  • MOD: maximum observed degree
Extended Basic Sampling Algorithms

We extended some basic algorithms to investigate the influence of small changes to their behavior (e.g., when hitting a dead-end).

  • DFS_JUMP: jump to the next node when hitting a dead-end
  • DFS_RANDOM: randomized neighborhood order
  • DFS_RANDOM_JUMP: randomized neighborhood order with jumping
  • RANDOM_WALK_NR_JUMP: random walk non-revisiting jumping at dead-ends
Parametrized Sampling Algorithms

Other sampling algorithms have parameters and thereby a more complex behavior depending on the sampled network and the selected parameters.

  • FOREST_FIRE: neighbors selected (FIFO) with probability p (BFS: p=1)
  • FOREST_FIRE_NR: forest fire, non-revisiting
  • FRONTIER_SAMPLING: random walk with l walkers, each round highest degree neighbor is selected
  • RESPONDENT_DRIVEN: put l neighbors in FIFO queue, non-revisiting (RWnr: l=1)
  • SNOWBALL: visit max of l neighbors in FIFO order, non-revisiting (BFS: l=high, RWnr: l=1)
Analysis results:

As examples, we investigated the following scenarios:

  • Maximum Connected Network Coverage Problem
  • Visualization of Sampling algorithms
  • Graph-theoretic Properties during Sampling
Maximum Connected Network Coverage ProblemMaximum Connected Network Coverage Problem
The Maximum Connected Network Coverage (MCNC) Problem is to find a way (a sampling algorithm) to explore a network such that all nodes are seen as soon as possible.
Visualization of Sampling algorithmsVisualization of Sampling algorithms
To understand how sampling algorithms work, a visualization of the network they create is very helpful and help to distinguish the different algorithms.
Walking TypeWalking Type
MCNC results depending on walking type
Graph Properties measured during SamplingGraph Properties measured during Sampling
Monotone Sampling of NetworksMonotone Sampling of Networks
Tim Grube, Benjamin Schiller, Thorsten Strufe
In: Proceedings of the 2nd International Workshop on Dynamic Networks and Knowledge Discovery (DyNaK 2014)
Dynamic Properties of Network SamplesDynamic Properties of Network Samples
(Bachelor Thesis) Benedict Jahn
Sampling-based Network AnalysisSampling-based Network Analysis
(Master Thesis) Tim Grube
Subsampling of Complex NetworksSubsampling of Complex Networks
(Master Thesis) Kai Rathmann