The model
The PageRank approach is based on the following model - an internet user has a complete list of all web pages and visits them using two possible methods:
- he randomly chooses any available web page
- he follows one of the links available on the page his is currently visiting
Note that the first method prevents the user from being blocked whenever he visits a web page without any link or from being limited to a fixed set of web pages that do not provide links to pages outside this family of pages. Hence the user can continue to "surf" the Internet indefinitely, being able to, at any moment, decide to visit any existing web page.
To build a mathematical model to describe this "surfing" of the Internet, we need some assumptions:
- The probability that a user follows one of the links available in the currently visited page, instead of just randomly choose some existing web page is constant and denoted by \(p\), where, of course, \(0<p<1\). A common choice for this parameter is \(p=0,85\). Now, it is obvious that, in order to be able to draw some useful conclusions using this model, we have to assume that the choice of the next web page to visit is, with a large probability, not random, that, it is convenient that we pick \(p\) relatively close to \(1\), but not too close. Otherwise the usefulness of the model could be compromised and, moreover, the computational part of the method would become too slow (see "convergence").
- If the user decides to randomly visit any existing web page, he chooses any one of them with equal probability. Hence, assuming the Internet has \(n\) available pages, the probability of picking any one of them is \(\frac{1}{n}\).
- If the user decides to follow one of the links available in his current web page, he chooses among them with equal probability. Hence, assuming the current web page, identified as \(j\), includes \(s_{j}\) links, the probability of selecting any of those links is \(\frac{1}{s_{j}}\).
So, assuming the user is visiting the web page identified as \(j\), what is the probability that the next visited page is the one identified as \(i\), denoted by \(p_{ij}\)? If the page \(j\) includes link, then the next page is chosen randomly among all the available pages so the probability of choosing page \(i\) is \[p_{ij}=\frac{1}{n}.\]
However, if page \(j\) includes some link, then the next page may either be chosen randomly among all existing pages or be a page to which one the existing links is pointing. Thus, we have:
- if page \(i\) is not the destination of the existing links on page \(j\), the only way to get to \(i\) is through random choice, hence \[p_{ij}=\left(1-p\right)\frac{1}{n}.\]
- if the page \(i\) is the destination of one the links included in page \(j\), it may be chosen by any of the two possibilities, hence \[p_{ij}=\left(1-p\right)\frac{1}{n}+p\frac{1}{s_{j}}.\]
For a general representation, we may write
\[p_{ij}=\left(1-p\right)\frac{1}{n}+p\frac{g_{ij}}{s_{j}}\] where \(g_{ij}\) is equal to \(1\) if there exists a link from page \(j\) pointing towards page \(i\), and equal to \(0\) otherwise. It is obvious that, leaving the index \(j\) fixed and allowing the index \(i\) to run between \(1\) and \(n\), the summation of all the corresponding values for \(g_{ij}\) will be equal to the total number of links that page \(j\) includes, that is, \({\displaystyle \sum_{i=1}^{n}g_{ij}=s_{j}}\).
Mathematicians call a system with finitely many possible states and such that, at any moment, the system can only be in one state and changing from one state to another depends exclusively on these two states a Markov chain. For the PageRank algorithm, we may take as possible states the family of available web pages in the Internet and that the changing mechanism is given be the "surfing" through pages according to the model described above. Hence, we are facing a Markov chain, as the probability of moving from state \(j\) (visiting page \(j\)) to state \(i\) (visiting page \(i\)) is given by \(p_{ij}\) defined above, which obviously depends only on the indices \(i\) and \(j\).
We should note that, even knowing which is the web page used by the user to start his Internet "surfing", it is not possible to determine the page he will be visiting at some given moment. However, we can describe the probability that, after visiting \(m\) pages, he will be visiting page \(i\). We shall denote this probability by \(x_{i}^{\left(m\right)}\). Moreover, having the knowledge of these probabilities for a given \(m\), we can compute the probability that the \((m+1)-th\) visited page is \(i\): \[x_{i}^{\left(m+1\right)}=p_{i1}x_{1}^{\left(m\right)}+p_{i2}x_{2}^{\left(m\right)}+...+p_{in}x_{n}^{\left(m\right)} =\sum_{j=1}^{n}p_{ij}x_{j}^{\left(m\right)},\forall m\epsilon\mathbb{N}_{0},\] as each tem \(p_{ij}x_{j}^{\left(m\right)}\) represents the probability that, after having visited \(m\) the user is on page \(j\) and decides to move to page \(i\). Define now:
- \(x^{\left(m\right)}=\left(x_{1}^{\left(m\right)},x_{2}^{\left(m\right)},...,x_{n}^{\left(m\right)}\right)\)
- \(P=\left(p_{ij}\right)_{1\leq i,j\leq n}\).
The above formula may then be written as \[x^{\left(m+1\right)}=Px^{\left(m\right)},\forall m\epsilon\mathbb{N}_{0},\] hence, it follows that \[x^{\left(m\right)}=P^{m}x^{\left(0\right)},\forall m\epsilon\mathbb{N}_{0},\] where \(x^{\left(0\right)}=\left(x_{1}^{\left(0\right)},x_{2}^{\left(0\right)},...,x_{n}^{\left(0\right)}\right)\) is a vector whose coordinate \(x_{i}^{\left(0\right)}\) represents the probability that his starts his "surfing" on the web page \(i\). Thus \(x^{\left(0\right)}\) is a vector of probabilities, that is, whose entries are nonnegative and sum up to \(1\).
The entries of the matrix \(P\) are all nonnegative, so the matrix is regular and, as a consequence of the Frobenius Theorem, the sequence of probability vectors \(x^{\left(m\right)}\) converges to a probability vector \(x=\left(x_{1},x_{2},...,x_{n}\right)\), that is independent from the initial state probability vector \(x^{\left(0\right)}\). Looking at \(P\) as a mapping, this defines a continuous transformations, which implies that \(x\) is a fixed point of \(P\); in fact, it is the unique solution of the equation \(Pz=z\).
Remark that, even not knowing the initial state probabilities vector \(x^{\left(0\right)}\) (that is, having no information about the web page where the user started "surfing" the Internet), we know that the probability of the user being on page \(i\) is getting closer to \(x_{i}\), as we have that \(x_{i}^{\left(m\right)}\rightarrow x_{i}\), for every \(i\epsilon\left\{ 1,2,...,n\right\} \). This means that we may look at \(x_{i}\) as the probability that, in the long run, the user will be visiting the web page \(i\).
Is there any connection between \(x_{i}\) and the PageRank value for the web page \(i\)?