Friendster.com's algorithm Hello. I would like to ask if anyone can tell me how friendster's search algorithm works? If everyone had only 20 contacts, it would take 20^4 searches to find someone 4 levels deep using the simplest algorithm.  Are there better ways to do that?  How should the database be designed? Thanks Alex Alex Wednesday, September 10, 2003 http://www.zephoria.org/ , that should get you started... Prakash S Wednesday, September 10, 2003 Be aware that 20^4 = 160,000 is a very small number; Assuming an in-memory database, and without a single memory cache hit (i.e., 200 cycles per memory access),  that would put you at 32 million cycles. Your 1.6Ghz Pentium can do 50 of these per second, give or take a few. A disk database would be much, much slower - but a properly configured database on decent hardware can deliver tens of thousands of results per second. You can do much better if your data is layed out coherently in memory, so that there are not many cache misses. And if you want to go to any depth, you can use e.g. an offline Floyd-Warshall algorithm that computes distances from any person to any person; How you'd store that result is a different question. Ori Berger Wednesday, September 10, 2003 Off the top of my head, here: you make a square 2d matrix of ones and zeros where 1 means "connected" and 0 means "not connected". Then you multiply this matrix by itself to get all the people connected by one degree of separation. Do it again to get 2 degress of separation, and so on. It's actuall quite basic Linear Algebra... pick up a Linear Algebra textbook and read a chapter or two and you'll get it. Joel Spolsky Wednesday, September 10, 2003 Joel. Thanks for your suggestion. Alex Wednesday, September 10, 2003 The discussion of shortest-path algorithms in "Introduction to Algorithms" by Cormen, Leiserson and Rivest is very good, though possibly rather heavy going for those who don't enjoy mathematics. Floyd-Warshall (mentioned above, and closely related to the matrix-multiplication approach Joel suggests) will get you the shortest paths from everywhere to everywhere else. I think that's overkill in this application. F-W takes time of order n^3 where n is the total number of vertices in the graph -- in this case, the total number of Friendster users. That sounds prohibitive to me. Dijkstra (not too different from breadth-first search) will get you the shortest paths from one place to all others; it takes time of order n log n + m where n is the number of vertices (users) and m is the number of edges (pairs of people who are friends). That might be too slow for typical queries, still, in which case you want to move to the straightforward fixed-depth search obviously envisioned by the original poster. You can trade off space against time a little. Store everyone's "second-order neighbours" (i.e., friends of friends) and then the depth you need to search to is halved in exchange for what's probably a modest cost in space and a slight slowing-down of (presumably relatively rare) modifications to the friendship graph. Of course the branching factor is increased, so this isn't as good as it might sound; but if it's common for people to be friend-of-friend in multiple ways then it might be worth while. The key point, I think, is Ori's: the space you're searching just isn't all that big :-). Gareth McCaughan Wednesday, September 10, 2003 I should mention that getting a running time of n log n + m from Dijkstra's algorithm requires some rather hairy data structures. A simpler approach will give you something more like m log n, which is worse by a factor of about log n when m >> n. (But the implied constant of proportionality will be less, so in realistic cases the simpler approach may be better.) Gareth McCaughan Wednesday, September 10, 2003 Maybe the Friendster people should be reading this thread. Based on my observations of how quickly Friendster.com operates, I assume their algorithm is some sort of linear search, with moderatly intelligent chimpanzees recording the results. The chimps then schedule a meeting to compile their results. These results are then FedEx'd to my ISP, which sends me the page. !El gato es muy gordo! Wednesday, September 10, 2003 My question would be whether you're trying to verify connected(A,B) or to find all B where connected(A,B)? In the former case, you can represent this as an N-way join where N is the depth of connectedness.  In other words, if you're looking for 1 degree of separation, it's a single table.  If you want two degrees, it's a two-table join, and so on. If "connected()" is a symmetrical relationship (i.e. connected(A,B) == connected(B,A)), then you probably want to standardize the direction of links in the DB so that all links go one direction, i.e. from the lower userid to the higher userid.  Doing this cuts down on the space used and the query time. If you're trying to find all B where connected(A,B), the same approach would work, but you might want to implement some of the other tricks people here have mentioned, such as pre-computing and caching.  One way to do this is to have a table with three columns: from, to, and depth.  If A is connected to B at depth 1, then you have (A,B,1).  If B is connected to C at depth 1, you add (B,C,1) and (A,B,2).  This increases your table by a factor of N, where N is the average depth value.  In other words, if you have 100,000 basic entries (at depth 1), and the average depth is 3, then you'll need 300K rows. Notice that you can't do this caching approach properly unless you remove the symmetry as mentioned earlier.  That is, given a pair of ID's, entries are only made in one direction (e.g. low ID -> high ID). This approach requires only linear space per edge in the overall graph, and with a properly indexed table can check for a given connected(A,B) relationship with a single BTree access.  It can also perform two index scans to list all people related to a given person, at any desired depth(s).  (One scan to list those with lower IDs, and another to list those with higher IDs.) Phillip J. Eby Wednesday, September 10, 2003 Actually, what I suggested for the multilevel caching doesn't really work.  A fourth column would be needed, a "reference count" of sorts, to indicate the number of routes between two points that are of the same length.  Otherwise, you have no way to garbage collect when entries are removed. Further, my way of estimating the number of rows required was based on a "tree" mental model rather than a "graph" mental model; the actual number of entries is actually more like N x M x P where N is the average depth and M is the average total number of connections a person has, and P is the number of people in the DB.  In other words, a *lot* of rows.  :) Phillip J. Eby Thursday, September 11, 2003   Fog Creek Home