headtail demo

We will demonstrate example usage of the headtail algorithm implementation. We'll start by importing the module.

In [1]:
import headtail as ht
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

We will use the as-Skitter graph from the SNAP database as an example. Since we will be computing over the graph as a stream of edges, we'll provide the number of nodes, n, and the number of edges, m, to set our probabilities for sampling.

In [3]:
#as_Skitter size
n, m = 1696415, 11095298

And run the algorithm. Here we will allow about 17K samples for each each of the head and the tail estimator, for a total sample space of less than $2\%$ of the graph nodes.

In [4]:
ccdh_est = ht.headtail_stream('as-skitter.txt', n, m, eps=0.5, shead=0.01*n, stail=0.01*n)

Now we will compare to the true ccdh.

In [5]:
ccdh_true = np.loadtxt('skitter_ccdh.txt')
ccdh_true = ht.list_to_dic(ccdh_true)
In [6]:
fig = plt.figure(figsize=(12,9))
plt.loglog([d for d in ccdh_est], [ccdh_est[d] for d in ccdh_est], 'rx')
plt.loglog([d for d in ccdh_true], [ccdh_true[d] for d in ccdh_true], 'k-')
plt.xlabel('Degree', fontsize=22)
plt.ylabel('ccdh', fontsize=22)
plt.title('ccdh estimate for the skitter graph', fontsize=22)
plt.legend(('Estimated ccdh values', 'Actual ccdh values'), fontsize=22)
plt.xticks(fontsize=22)
plt.yticks(fontsize=22)
Out[6]:
(array([  1.00000000e-01,   1.00000000e+00,   1.00000000e+01,
          1.00000000e+02,   1.00000000e+03,   1.00000000e+04,
          1.00000000e+05,   1.00000000e+06,   1.00000000e+07,
          1.00000000e+08]), <a list of 10 Text yticklabel objects>)

We will also compute the Relative Hausdorff distance.

In [10]:
min_err = 100
for degerr in np.arange(0.01, 0.1, 0.01):
    freq_err = ht.rel_hausdorff(ccdh_true, ccdh_est, degerr, pointwise_max=False)
    max_err = max( degerr, max([point[1] for point in freq_err[0]]), max([point[1] for point in freq_err[1]]) )
    ## min of the max
    print 'epsilon:', degerr, 'max(epsilon, delta):', max_err
    if max_err < min_err:
        min_err = max_err
print '\n\tRH distance:', min_err
summative RH distance: 1.0
epsilon: 0.01 max(epsilon, delta): 1.0
summative RH distance: 1.0
epsilon: 0.02 max(epsilon, delta): 1.0
summative RH distance: 0.5
epsilon: 0.03 max(epsilon, delta): 0.5
summative RH distance: 0.5
epsilon: 0.04 max(epsilon, delta): 0.5
summative RH distance: 0.0902302426882
epsilon: 0.05 max(epsilon, delta): 0.0902302426882
summative RH distance: 0.0754511512134
epsilon: 0.06 max(epsilon, delta): 0.0754511512134
summative RH distance: 0.07
epsilon: 0.07 max(epsilon, delta): 0.07
summative RH distance: 0.08
epsilon: 0.08 max(epsilon, delta): 0.08
summative RH distance: 0.09
epsilon: 0.09 max(epsilon, delta): 0.09

	RH distance: 0.07

In [11]:
err = ht.rel_hausdorff(ccdh_true, ccdh_est, 0.07)
summative RH distance: 0.07

Finally, we can plot the distance at each degree.

In [12]:
fig = plt.figure(figsize=(12,9))
plt.semilogx([d for d in err], [err[d] for d in err], 'rx', markersize=10, markeredgewidth=2)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$\delta$-distance for $\epsilon = 0.1$', fontsize=22)
plt.ylim(ymax=0.2)
plt.xticks(fontsize=22)
plt.yticks(fontsize=22)
Out[12]:
(array([ 0.  ,  0.05,  0.1 ,  0.15,  0.2 ]),
 <a list of 5 Text yticklabel objects>)
In []: