NLP Technique: Topic Modeling Is the Key to Gaining Rich Insights

Topic Modeling

If you’re like most brands, you have access to an abundance of data, whether it’s first-party data, data from trusted data providers, or third-party cookie data. But to access the insights buried deep within your data treasure chests, you need much better tools than the manual analysis methods from way back when: You need a natural language processing (NLP) toolbox

And within this toolbox there’s a technique that’s both easy to use and quick to surface results. Its name is topic modeling, and it has a singular goal: To extract topics from piles of textual data, and then sort this data into groups based on these topics. In this guide, we’ll show you how topic modeling works, and how it can dig into your data to power some very common business use cases. 

What Is Topic Modeling?

Topic modeling is an NLP technique that uses pattern recognition and machine learning to:

  • identify topics within each text or document it analyzes
  • infer topic clusters from the text data overall
  • group together texts or documents containing similar topic clusters

When compared to manual analysis, topic modeling lets you quickly analyze a large collection of documents—for example, a web page, an individual survey response, or an online review— in one go.

Let’s say, for example, you need to sort and organize 500,000 documents containing approximately 750 words each. Using topic modeling, you’re able to determine that your collection of documents contains 12 topic clusters overall. Your model then groups the documents according to their topic clusters. The result? Instead of the need to process and analyze 375 million words (500,000 documents X 750 words), you’re able to base your analysis on these topic clusters. This reduces your analysis down to a quicker-to-analyze 9,000 words (12 topic clusters X 750 words).

Unsupervised Learning vs. Supervised Learning

Unlike sentiment analysis and named entity recognition (NER), two supervised learning NLP techniques discussed in-depth here in previous posts, topic modeling is an unsupervised learning technique. Unsupervised techniques are typically quicker and easier to use because there’s no need to first train the model you’re using.

Trained models do have their advantages, though. While you end up investing more time in preparing the training data for supervised learning techniques, this training means you’ll get a more accurate classification of topics within your text that better matches the topics you’re looking for. And in fact, the supervised learning version of topic modeling is called topic classification.

How Topic Modeling Works

Topic modeling determines both word patterns and word frequencies within a document to identify a list of topics or topic clusters in that document. It’s useful for analyzing and sorting a large collection of documents or texts based on the topics extracted. 

Here’s how the following (fictitious) reviews of the ShareThis button might be grouped into topic clusters:

  • I like ShareThis’s ease of use, and the simplicity of its dashboard. It’s super flexible and gives me a lot of options.” Topic modeling might use ease of use and simplicity to group this review with reviews about how easy it is to use Sharethis.
  • ShareThis gives me the ability to see user engagement with my content, as well as other analytical data.” Topic modeling might use engagement and analytical data to group this review with reviews about ShareThis’s analytics tools.

There are several topic modeling methods in use today, but the two most popular techniques are Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Both techniques are “bag of words” models—they treat documents as collections of words—that rely on the following hypotheses:

  • the distributional hypothesis, which assumes that words or expressions refer to similar topics if they occur in similar contexts
  • the statistical mixture hypothesis, which assumes that documents contain a variety of topics

Latent Dirichlet Allocation (LDA). LDA is a probabilistic model that assumes that words in a document can each be associated with a topic within the document. It calculates the probability of a topic generating certain words as well as the frequency with which these words are distributed. This in turn enables it to determine the words that are associated with the cluster of topics in a document, and then group the document with other documents containing a similar topic cluster.

Latent Semantic Analysis (LSA). Unlike LDA models, LSA models are based only on the frequency of words within textual data, and do not take into account the probabilities of a topic generating specific words. It uses these frequencies to group a document with other documents containing a similar distribution of these words.

Limitations of Topic Modeling

While topic modeling is a popular NLP technique, its drawbacks can limit its use cases. For example:

Short vs. long texts. While both LDA and LSA models can work well with both short texts and long texts, other topic modeling methods face challenges when processing short texts. This reduces the accuracy of any analysis you perform on, for example, social media text.

Topics. The topics generated by topic modeling will not be as accurate as the topics produced by a supervised learning model such as topic classification, which means you often can’t use the results for a more fine-grained analysis.

Topic number. Topic models must be given the number of topics to look for. This means its results are directly related to how accurate the number inputted is, in relation to the actual number of topics in the data set being analyzed.

Large datasets. To get the most accurate results, topic modeling needs a large volume of quality data to work with. This means a brand may not be able to collect enough first-party data to run a topic modeling analysis. (However, data such as ShareThis data can be used to enhance a too-small first-party dataset.)

Despite these limitations, topic modeling can be effectively applied to a number of marketing use cases.

Recommendation system. On publisher sites, topic modeling can be used to provide recommendations of articles similar to the one on the page a visitor is currently on. For example, on a pet food site, an article about feeding small mammals might be accompanied by links to recommended articles about hamsters and rabbits, but not cats or dogs.

Customer support ticket routing and triage. Topic modeling can automatically send tickets matching specific topics directly to the relevant department, reducing support staff’s ticket processing time. Topic modeling can also prioritize the urgency of incoming support tickets so staff can tackle more urgent issues first. For example, tickets in a “credit card refunds” group could be automatically sent to accounting and billings, while tickets containing words like “crash” or “won’t start” could be flagged as urgent. 

Customer review analysis. With the advent of social media, as well as the popularity of review sites such as Google’s Business Profile, most companies have access to customer reviews about their brands. Topic modeling can be a quick way to analyze what improvements your product or service might need. For example, by using topic modeling to analyze customer reviews, a home goods store might discover their customers are dissatisfied with its weekend hours of operation.

Targeting/audience creation. Topic modeling can help you target or create new audiences, by distilling information you can use to define previously hidden audience segments. ShareThis does this by bundling together visitors’ actions on websites based on specific topics. So, for example, it can create a pet segment for marketers to tap into, by clustering website actions that are related to pets. 

Trends analysis. With topic modeling, you can detect new trends from within text data, providing information that can drive strategies such as product improvement or development, or content creation. For example, topic modeling analysis of social media data might reveal a trending use of phrases such as “succulents” or “cactus” that indicate a need for a gardening center to expand its inventory or publish more educational content about desert plants. 

Conclusion

In today’s digital world, your brand has access to an abundance of text data, from your own data to the treasure trove of data available from providers like ShareThis. But you need a tool that can readily dig into all that data for the invaluable information contained within. ShareThis, for example, uses NLP tools like topic modeling to cluster its data and build out its own rich insights. With its ease of use and the delivery of quick results, topic modeling is a tool that can extract the useful information hidden in your text data, making it an ideal fit for your NLP toolboxes.

About the author
ShareThis

ShareThis has unlocked the power of global digital behavior by synthesizing social share, interest, and intent data since 2007. Powered by consumer behavior on over three million global domains, ShareThis observes real-time actions from real people on real digital destinations.

About Us

ShareThis has unlocked the power of global digital behavior by synthesizing social share, interest, and intent data since 2007. Powered by consumer behavior on over three million global domains, ShareThis observes real-time actions from real people on real digital destinations.