“I love Salt!” a user enthusiastically tweeted. Within seconds, the tiny tweet had arrived at @WalmartLabs, where it was analyzed in lightning-fast fashion. A few minutes later, a message arrived in a close friend’s inbox: “Good morning, Juliana! You asked us to remind you that Hanna’s birthday is coming up. She’s just tweeted positively about “Salt,” a new Angelina Jolie movie. Would you like to buy something related for her? We have a few suggestions.”
At the same time, another message was sent from @WalmartLabs to a Facebook user: “Hello David! Today’s deal is the fabulous new DeLonghi EC266 coffee maker, only $109 (a 45% discount!) if at least 50 customers sign up. Based on your recent Facebook postings on gourmet coffee, we thought you might like this deal – take advantage of it now!”
What is @WalmartLabs doing to realize such scenarios? How can we tell that a user meant “Salt” the movie, and not the condiment? And, can we tell that a person loves gourmet coffee even if the word “coffee” has never been mentioned?
Yes, we can. The secret lies in something we call “The Social Genome.” The Social Genome is a giant knowledge base that captures interesting entities and relationships in the social world.
Examples of entities include:
Example relationships include:
- A person being interested in a topic
- A person attending an event
- An event about a certain topic
- An organization’s association with a product
The following figure illustrates certain kinds of entities and relationships in the Social Genome:
And the following figure illustrates a part of the Social Genome:
In a sense, the social world — all the millions and billions of tweets, Facebook messages, blog postings, YouTube videos and more – is a living organism, constantly pulsating and evolving. The Social Genome is the genome of this organism, distilling it to the most essential aspects.
At @WalmartLabs, we have spent the past few years building and maintaining the Social Genome. We do this using public data from the Web, proprietary data and a lot of social media. From this data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, and then add them to the Social Genome.
For example, when Susan Boyle was first mentioned on the Web, we quickly detected that she was becoming an interesting person in the world of social media. So we added her to the Social Genome, and then monitored social media to collect more information about her. Her appearances became events, and the bigger events were added to the Social Genome as well. Another example: When a new coffee maker was mentioned on the Web, we detected and added it to the Social Genome. We strive to keep the Social Genome up to date – typically, we detect and add information from a tweet into the Social Genome within two seconds of that tweet arriving in our labs.
As a result of our effort, the Social Genome is a vast, constantly changing, up-to-date knowledge base with hundreds of millions of entities and relationships. We then use the Social Genome to perform semantic analysis of social media and to power a broad array of e-commerce applications.
Even if David the Facebook user never mentions the word “coffee,” he has mentioned many gourmet coffee brands (such as “Kopi Luwak”) in his status updates, so we can use the Social Genome to detect the brands and infer that he is interested in gourmet coffee. Using the Social Genome, we may find that a user frequently mentions movies in her tweets. As a result, when she tweeted “I love salt!” we can infer that she is probably talking about the movie “Salt” and not the condiment (although both appear as entities in the Social Genome).
Building and using the Social Genome raised numerous interesting technical challenges. At the lowest level, we must process thousands of data pieces (tweets, Facebook updates, blogs) per second. The data is streaming in so fast that we call it the “Fast Data” problem (as opposed to the well-known “Big Data” problem). We found that the Map-Reduce/Hadoop framework typically used to solve Big Data issues was not well-suited for solving Fast Data problems. Instead, we have developed our in-house solution called Muppet, which processes streaming fast data in a blink over large clusters of machines. On top of Muppet, we employ a broad range of semantic analysis techniques, including information extraction and integration, natural language processing, and machine learning. While well known, these techniques have had to be significantly adapted or extended to deal with the peculiarities of social media. Finally, we have also developed techniques to effectively use crowdsourcing and human computation in building and maintaining the Social Genome.
In summary, the Social Genome is indeed a crown jewel at @WalmartLabs. It is a giant entity-relationship knowledge base built using a wide range of cutting-edge data management, semantic analysis and human computation techniques. It is used to perform deep semantic analysis of social media, the result of which is leveraged to power a broad array of social commerce applications.