I am Yuanyuan tian from IBM research. Today, in this talk I will focus on one recent project on scalable topic-specific influence analysis. I chose this project talk about because it touches on both analytics and system aspects of research, therefore It is a good representative on the kinds of work we are doing in the information management group in IBM Almaden.
Microblogging services, such as Twitter (twitter.com), have gained tremendous popularity in recent years. A large amount of microblog data has been accumulated overtime. According to a March 2012 report, Twitter had over 500 million users creating over 340 million tweets daily. Note the data contains both textual content reflected in the tweets, as well as social relationship information reflected by the follwer and follwee relationship. The rich text and social information in microblogs has become a
popular resource for marketing campaigns to monitor the opinions of consumers on particular products and to launch viral advertising. Identifying key influencers in microblogs is required for such marketing activities.
Although a lot of work has been done on social influence analysis, most of these studies infer influence only from the network structure, while ignoring the valuable text content that the users created. One of the most well-known studies is influcence maximazation work by Jon Kleinberg et al. As a result, the learned influence of each user is only global, with no way to assess the influence in a particular aspect of life (topic). For example, no one can deny that Presendent Obama is a key influencer in general, but if you want to advertise a database product, he is unlikely to influential, Whereas a database expert who won’t be identified as a general key influencer, probably has more say on this subject. So, what we want is Topic-spefic influcen analysis that can differentiate the influence in different aspects of life or different topics. In order to do that, we need to analyze not just the network structure but also the valuable textual content.
A number of PageRank-based methods, such as Topic-Sensitive PageRank [13] and TwitterRank [25], are able to compute per-topic ranks scores, but they require a separate process to create the topics from the text, and then for each topic apply the influcence analysis using the network structure. As content and links are related to each other in a microblog network, the separation between the analysis on content and the analysis on the network structure usually leads to inferior performance. A few existing works can analyze content and network together, such as Link-LDA. However, they were all designed for citation and hyperlink networks, and assume that links are soly caused by content. This assumption clearly does not apply to microblogs, since it is prevalent for a user to follow celebrities simply because of their fame and stardom, with nothing to do with what he/she actually tweets about.
The goal of this work is to support searching for topic-specific key influceners on microblogs. Altimately, we want to provide a search framework where users can simply type in key words to express his/her interested topic or combination of topics, and the search engine returns a ranked list of users who are influential in the corresponding subjects. In order to do that we need to first correctly model the topic-specific influence in microblog networks, and then learn the influence efficiently. These two are the focus of this talk. I will also briefly talk about how to put everything together in a search framework for topic-specfic influencers.
To meet the computational challenge posed by rapid growing microblog data, propose a distributed Gibbs Sampling algorithm to the FLDA model. Then we incorporate our proposed method in a general search framework for topic-specif influcencer. After that, I will present some experimental results. Now, let’s first talk about the new FLDA model.
Before I go into the details of the topic specific model, Let me first provide the intuition behindthe model. In a micrblog network, each user has both content which is a bag of words, and link structure whch is a set of followees. A user tweets in multiple topics. Based on the content of Alice, looks like she likes tweets about technology and food. Then a topic can be viewed as a mixture of words. Eg. The words web and cloud are likely to appear in the technology topic. As for the relationships among the users in a microblog network, there are dfferent reasons why a user follows another. Sometimes, a followship is content-based, because the followee tweets in similar topics. Other times, it is completely content-independent, because it is very prevalat for a user to follow a celebrity just because of the fame and stardom with nothing to do with what the follower tweets about. Finally, each topic should have a mixture of followees. In other words, given a topic, some users are more likely to be followed than others. For example, Mark Zuckerberg is more likely to be followed for the technology topic. We measure the topic-specific influence by the probability of a user u being followed for a given topic. So, if a user has a higher probabiity to be followed given an topic t, he/she is also more influentual in that topic.
To correctly model topic-specific influence on microblogs, we propose a new Bayesian model, called Followship-LDA (FLDA). It is A Bayesian generative model that extends Latent Dirichlet Allocation (LDA). The reason why call it a generative model is because this model specifies a probablistic procedure based on our described intuition by which the content and links of each user are generated in a microblog network. In order to explain the creation of content and links, we introduce some hidden structure or latent variables in our genertive process, including topics, the reasons of followships, the topic-specific influcence. Finally, given the model, and observed data, the goal is to reverse the generative process and find out what hidden structure is most likely to have generated the observed data.
Now let’s look at what hidden structures we have introduced in the generative model. Each user tweets in different topics. . Then each user has a topic distribution indicating how likely he/she tweets in different topics. Suppose, there are 3 latent topics, Tech, food and politics For example, user Alice tweets about tech 80% of time, food 19% time, and politics for the remaining 1% of time. Now, each topic is a mixture of words. So, each topic has a per-topic word distribution, indicating how likely different words are used in this topic. For example, for tech topic, the word web will be used 30% time, cookie 10% time, etc. As I mentioned before, there different reasons why a user follows another. So, each user has a preference of followship. For example, Alice follows for content 75% time, and the other 25% time she follows for popularity. Since some followships are content based. for each topic, we have a followee distribution, indicating the probabality of user being followed for a given topic. For example, for the tech topic, Mark Zuckerberg will be followed 70% time. Each number in this table is the probability of a user being followed by someone given a topic. This is exactly the topic-specific influence score we are after. If a user has a higher probability of being followed by someone, then this user has a higher influence on this topic. At the end, we also a global followee distribution, indicating the probabliity of a user being followed for content-independent reasons. For example, if afollowship is totally content-independent, then 50% of time, obabma will be the followee. It measures the global popularity of each user.
Now let’s describe the generative process of FLDA. This figure shows the plate notation of the FLDA model. (The boxes represent replication. The outer box represents repetation for the users, the inner right box represents repeated generations of words and the left inner box represents the repeated creation of links.) Don’t worry, if you don’t know the plate notation. You can still understand the generative process The generative process first repeat for each user. Now for the m-th user, Say Alice. It first pick a per-user topic distribution from a Dirichlet prior. For example, …. In addition, it also picks a per-user followhsip preference for this user. Now, we first generate the content of this user. To generate each word, we first choose a topic based on the topic-distribution, say we choose Tech, then pick a word to represent this topic from the per-topic word distribution. In our example, web is chosen. The process continues for the remaining words. After the content generation, we now generate the followees. For each followee, we first choose the reaon of the followship based on the per-user followship preference. For example, this followship is based on content. Then we pick a topic based on the same topic-distribution as in the content generation, followed by picking a followee who well address the picked topic from the per-topic followee distribution. For a different link, the generative process may decide that the followship is content-independent. In this case, a followee is chosen based on the global popularity distribution. So, this is how content and links are created in generative process of FLDA.
Now we have the topic specific influcen model, let’s look at how to learn the model based on the observed data.
FLDA a probablistic procedure with introduced latent variables to generate content and links of each user in a microblog network. Now given the observed text and links, we want to find out the various distributions of latent variables. We used the Gibbs sampling to do the inference because it is the mostly widely used approach for approximate the distribution of latent variables especially for high dimensional data
To learn the various distributions in the FLDA model, we use the Gibbs sampling method. Gibbs sampling is a Markov chain Monte Carlo algorithm to approximate distributions of latent variables based on the observed data. The Gibbs sampling process usually starts with some initial value for each variable, then iteratively sample each variable conditioned on the current values of the remaining variables, and then update the vairable with its new value. This process will repeat for 100s of iterations. And at the end, the produced samples can be used to approximate the distributions of latent variablw. The key and also the most challenging part of Gibbs sampling is to derive the conditional distributions of each latent variable. Here are the derived conditional probabilities for the FLDA model. They look very complicated. But essentially, what we derived are..
Gibbs sampling is a widely used approach to Bayesian Inference. We also use it here to learn the various distributions.
Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution,This sequence can be used to approximate the joint distribution and approximate the marginal distribution of latent variables. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. We begin with some initial value for each variable, for each sample, sample each varibale from the conditional distribution. That is sample each variable from the distribution of that variable condidtioned on all other variables, making use of the most recent values and updating the varibales with its new value as soon as it has been sampled.. Then samples then approximate the join distribution of all variables. The idea is that observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. The distribution of the remaining variables is then effectively a posterior distribution conditioned on the observed data. A collapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable.
Now I will give an overview of the gibbs sampling for flda. After initialized of all the latent variables, the algorithm exexutes in iterations. Each iteration, the algrithm will make one pass of the data. During each pass, for each user, we first sample on all the words, for the nth word of the mth user, we have the observed value of the word. We sample a new topic assignment of t’. After all words, we start to sample on the followees. For each followee, we first sample a new followship preference, if the proference is content based, we then sample a topic for the link. During the sample process, we keep track a number of counters, because they are used in definition of the conditional distributions for the sampling process. After the sampling process, again, we use the counters to estimate the posteriro distributions of latent variables, such as the per-user topic distributions.
The Gibbs sampling process we have described so far is inherently sequential. Each sample step relies on the most recent values of all the other variables. sequential algorithm does scale. For example, Sequential algorithm would run for 21 days, on a high end server (192 GB RAM, 3.2GHz processor) for a Twitter data set with 1.8M users, 2.4B words and 183M links. So, we definitely need to parallelize the computation. How can we parallelize a inherently sequential process. We notice that for our problem, there a huge number of words and large number of links. The dependency of between different variable is relatively weak. So, we proposed to relax the sequential constraint for Gibbs sampling and propose a distributed gibbs sampling method. We implemented our distributed algorithm on top of spark, which is a distributed processing framwork for iterative machine learning workloads developed by Bekerley Amplab.
Here is an overview of how it works. In a Spark cluster, there is a master and number of workers. We partition the set of users on to different workers. For each user, we hold the last topic assignment of each word. The last preference assignment and the last topic assignment for each link. Finally, each user holds the user-local counters, such as the # times a word is assigned to a topic for this user. The master is responsible for keeping track of the global counters, e.g. the #times a word is assigned to a topic across all users. The algorithm executes in 1 number ofiterations, at the beginning of each iteration, the master first broadcasts the global counters to all workers. Then each worker makes use of the global counters to sample and update the data of all its users. As the process goes on, the global counters become out of date. So, at the end of the iteration, the workers send their new local counters to the master. Then the master use them to update the global counters. After that the new iteration begins.
The distributed algorithm seems pretty simple, but there are a number of issues that we need to handel with special care. First of all, we need to ensure the fault tolerance of the algorithm, because a gibbs sampling requires 100s of iterations, we don’t want to restart whenever failure occurs. At time we developed the distributed algorithm. Spark relied on lineage for fault toleration. So, if a worker fails, it needs to restart from the very beginning. This is not desirable for an algorithm that needs to run 100s of iterations. So, we implemented a chekcpointing mechanism in spark. But we found out in the latest release of spark, a checking pointing mechnisam was introduced. The 2nd issue is to the freqency of global synchronization. As I mentioned before, we synchronize all the global counters for each iteration. We also tried out various other frequencies and found out even with synchronizing every 10 iterations, the quality of result is not affected. Another more sutle issue if the random number generator. Because, workers are spawned at roughly same time, if we just use the jave random number generators, we will have correlations between the psedo random numbers generated across the works. This will jeopardized the qualitof the retured results. In order to guaranee the correctness of the distributed monta carlo simulation, we need provable independent multiple streams of uniform numbers. We used a long-period, jump ahead random number generator. The last issue deals with the efficient local computation. We took extra care to take advantage of local memory hierachy and avoid random memory access by sample in a particular order.
Finally, to put everything togrther, we incorporate our FLDA model into a search frame work for topic-specific influencers, called SKIT. Using skit, a user simple enters a set of key words to express his/her interest, and SKIT returns a list of key influencers that satisfy the user’s intent. SKIT is a general search framework. Besides FLDA, it can also pluggin other key influencer methods, such as link-lda, topic-sensitive pagerank, twitterrank. Now, I will only focus on how it is implemented using FLDA. In order to support the search, SKIT first needs to derive the interested topics from the key words. In FLDA, we can simply treat the key words as the content of a new user and using the “folding in” approach to quickly compute the topic distribution of is new user. Each value indicates the probablity the query is on a particular topic. From LDA, we also get the per-topic influnce score for each user. So, to compute the influence score of a user on the key word query, we simply need to compute the weighted sum across all topics. At the end, we will sort the users by their influence score and return the top influencers.
Now I will present some performance numbers.
We first check whether the top influencers returned by our method make sense or not. Here, we use a Twitter dataset crawled in 2010. It contains …. In this table, we showed some example topics with their top key words and top influencers produced by FLDA. Here we named the topics for better presentation. Intuitively, it is clear that the influencers are very relevant to the corresponding topics. For example, one would expect O’Reily publishers, Gartner research, and popular software bloggers to be influential for an IT-related topic. And pro-cycling atheletes, pro cycling team and team director to be influential for a topic related to cycling and running. FLDA separated the “globally” popular users from the content specific influencers, and elected that 15% of all links were content independent. In other words, 15% of the time, these popular users were followed regardless of what people tweet about. And the top-5 global popular users are singers, actors and talk show host.
anecdotal evidence is very hard to generalize and quantify. Luckily, 2012 KDD Cup provided us with the data needed to objectively measure the quality of FLDA and other approaches. This Tencent Weibo dataset contains ….. A very nice feature of the Weibo dataset is the set of provided VIP users. These VIP users are organized in categories. One example category is …. These VIP users are manually labelled “key influencers” in their corresponding category. Therefore, we can use them as “ground truth” to evaluate the quality of results. Note that the category doesn’t not necessary align with the topics we detected in FLDA. They represent a combinition of topics. If we use all the content of a VIP user as the search input, then we can check how many of the top-k results are the other VIP users in the same category and use the percentage as the precison. Overall, we can compute the mean Average Precison of all the VIP users across all categories. This chart compares the FLDA with existing approaches, such as Link-lda, TSPR and twitter rank. As we can see, FLDA produces significantly better results than existing approaches. It is over 2x better than TSPR and TwitterRank and 1.6x better than Link-LDA. We also compare the result of the sequential algorithm against the distributed algorithm. They produce comparable result, which confirms our intuition.
Now we know sequential algorithm and the distributed algorithm produce comparable results, we next compare the execution time of the two. The sequential algorithm was run on a high end server … and the distributed algorithm was run on a Spark cluster with 27 servers. Again, we run on the Weibo and Twitter dataset. Note that Twitter dataset is the larger dataset. although the twitter dataset has fewer number of users, but it has significant more # words and #links. This table shows the execution time for the sequential and the distributed flda algorithms running 500 iterations. For the Weibo dataset, distributed algorithm reduces the runtime from 4.6days to 8 hours, whereas for twitter dataset, it reducers the runtime from 21 days to 1.5 days, more than an order of mangnitude faster.
We now evaluate the scalability of the distributed algorithm long three dimensions: data size, number of topics, and the number of con-
current workers.
. We explore a wide range of sizes (from 12.5% all the way up to 100%), number of topics (from 25 to 200) and number of workers (from 25 to 200). The figure shows that the distributed FLDA scales well along all dimensions.
To summarize, we proposed a novel flda model for topic-specific influence analysis. This model combines content and link structure in the same generative process and is able to differenetiate the different reasons why one follows another. in order to apply FLDA to a webscale microblog network, we design a distributed Gibbs sampling algorithm for FLDA. Finally, the FLDA model is incorporated in a proposed general search framework for topic-specific key influencers. Through experiments on two real-world microblog datasets, we demonstrate that FLDA significantly outperforms the state-of-the art methods in terms of precision. Furthermore, the distributed Gibbs sampling algorithm for FLDA provides excellent speed-up to hundreds of workers.