Building a Q&A machine learning model, for answering White House Press Briefing questions about Brexit, using Latent Dirichlet Allocation.
Introduction
There is a lot of fuss about lifelike chat AI these days. The truth is that we are still not yet there and it will likely be a while before the technology exists. Fundamentally a lifelike chat AI, that is genuinely lifelike, would be able to pass the Turing test, which would be a massive milestone in AI development.
As the technology does not yet exist, it is hard to say precisely what is missing to make a lifelike chat AI possible. It is likely that a lifelike chat AI would need 3 main components: language understanding, longterm contextual awareness, and a general knowledgebase. Each of these parts is on their own somewhat possible today, but the technology to combine them doesn’t yet exist. For example, language understanding allows us to create a translation AI, which in theory could be used to "translate" from questions to answers. However, the amount of different questions one can ask makes it hard to embed that amount of information in a neural network. Hence, we likely need an external knowledgebase that it can access, and that connection hasn't been made efficiently yet. Another example is that we can teach AIs to play Go and Chess which requires longterm contextual awareness. However, it is only possible to do with reinforcement learning which heavily relies on training with the help of a simulation system. Chats are not something we can efficiently simulate, because the rules are not explicitly known.
Not all aspects of a chat AI are out of reach today. In this article it is shown how to implement a Q&A "AI", using a machine learning technique for natural language processing.
A Q&A "AI" can be useful in many cases, such as aiding helpdesk support. The base technology also extends to smarter and more flexible searching through documents, or even discovering categories in your documents that you weren't aware of. If your users produce your documents, this could reveal user groups that you weren't aware you had.
Brexit Questions and Answers in the White House
In this example, we use the press briefings Q&A sessions from the White House during the Obama Presidency. They are in the public domain documents and are available here: https://obamawhitehouse.archives.gov/briefingroom/pressbriefings.
We build a semantic model from the press briefing questions. This means the "AI" understands the political language typically used in those press briefing sessions. To ensure we're not providing outdated or offensive answers, we create a curated list of mostly Brexit questions. This means that only questions related to Brexit can be answered, but you can still add answers to questions about other subjects.
Please follow the walkthrough guide below and feel free play around with training it using your own questions and answers to see how well the "AI" understands them on a general level. All the evaluation is done locally on your machine using TensorFlow.js, so don't worry about sending us inappropriate questions.
Semantic Search Machine
The core concept of semantic search is to have an algorithm that can represent a document of words as a mixture of topics. If there are 3 topics, this could be 20% of topic 1, 80% of topic 2, and 0% of topic 3.
The intrusion here is that a specific topic is more likely to contain some words compared to others. The classic example of this is that a topic about cats is more likely to contain words such as milk, meow, and kitten. While a topic about dogs is more likely to contain words such as puppy, bark, and bone. However dogs are also mammals, and cat also has an endoskeleton. Therefore, a document about dogs could also contain the word milk. Likewise, a document about cats can also contain the word bone. It is just less likely. In the end, the topic mixture of a document should be the mixture that is most likely.
The exact algorithm of how this topic mixture is identified is explained later. For now, think of it as a blackbox algorithm.
Similarity measure
To compute the similarity between two documents, where one is the question
asked by the user and the other being a reference question with a
predefined answer, the cosine similarity measure is used. Consider,
The cosine similarity is
Indexing
The cosine similarity is rather cheap to compute, but even so, it does
need to be recomputed for all queryreference question pairs. A simple
solution to speed this up, is not to store
This linear time search algorithm is usually sufficient for even a reasonably sized Q&A database. If not, there are easy ways to turn the cosine similarity into a proper distance metric which enables logarithmic time search algorithms.
LDA as a Generative Model
The easiest way to understand how Latent Dirichlet Allocation identifies topics in documents, is to ask the opposite question, "given a set of topics what document can one expect to see?".
Given a model that can answer this question, one can then flip the model to answer the following questions, "given a document which of the defined topics are most likely to generate that document?".
Finally, with the answer to that question. One can solve the problem of, "given a corpus of documents, what kinds of topics are most likely?".
The LDA parameters
The Latent Dirichlet Allocation consists of two groups of parameters:
: Controls the likelihood of the topic being used. : Controls the likelihood of the word being used in topic .
Topic composition
The first step is to use the
Such a composition of topics, or probabilities to be more general, can
be sampled using a
Dirichlet distribution. Intuitively, the Dirichlet distribution uses the

The higher the
is, the higher the probability for topic is. 
If the
parameters are less than . Then only a few topics are selected, and the rest gets assigned 0% probability.
For 3 topics, the "composition of topics" can be visualised as points
on a triangle.
If the point is close to
In Latent Dirichlet Allocation, the
Selecting words
With a known topic composition for the document. The next step is to generate the words one by one, this is done by repeating the following three steps:
 Using a known topic composition for the given document a specific topic is sampled uniform randomly.

Once a specific topic
is sampled, the word probabilities for that topic are also sampled, again using a Dirichlet distribution with the parameters . It is essential to not sample new word probabilities for the same topic for than once, as this ruins the meaning of a topic. 
Finally, once the word probabilities for the topic
are known, a specific word is sampled uniformly from the word probabilities.
The generative algorithm
In mathematical terms, the algorithm can be expressed as:
 Choose document length
.  Sample topic composition:
. 
For each word
, in a document with words in total:  Sample a topic:
.  Sample the word composition:
.  Sample a word:
.
 Sample a topic:
This algorithm completes the answer to "given a set of topics what document can one expect to see?". As promised from this answer, one can derive math to answer the questions "given a document which of the defined topics are most likely to generate that document?" and "given a corpus of documents, what kinds of topics are most likely?".
Digging Into the Math
This section is where things get tough. So feel free to skip this. Deriving these results is not incredibly complex, but involves advanced mathematical concepts such as solving a constrained optimisation problem with some probability theory mixed in. For this reason, only the highlights are covered here, for more thorough explaining please see the original Latent Dirichlet Allocation paper.
Notation and the generative distribution
Continuing with the generative model as described before. The
topic composition distribution is denoted
Inference of the topics
To infer the topics gives a document Bayes theorem is used:
The primary reason why this is impossible to simplify to something more
computationally feasible is that
The trick to solving this problem is to create an approximation to the
desired distribution
In the end,
To find the
One metric for the difference between two random distributions is the
Kullback–Leibler divergence. The goal is then to minimise this
difference.
The highlight of this, is that the
 Initialize
 Initialize
 repeat until convergence of
:  for each word at position
:  for each topic
:  normalize:
 for each topic
 for each topic
:
 for each word at position
Parameter estimation
The inference part of the LDA is by far the most complex part. Once this
algorithm is implemented it is fairly straight forward to optimize the
Optimizing
Estep: Compute
Mstep: Update
Conclusion
While Latent Dirichlet Allocation is not a very new approach to language understanding, it is powerful when the Q&A dataset isn’t massive. The White House press briefings contain about 60,000 questions and answers. Such size may sound like a lot, but a modern translation AI that would use neural networks would typically depend on millions of translated sentences. Because Latent Dirichlet Allocation is built on strong mathematical assumptions about how topics and documents relate, it can achieve much more than a neural network would be able to do when the dataset is small. On the other hand, these mathematical assumptions are also what limits it. Therefore, if the dataset were much more massive, a neural network would typically be superior.
In the end, machine learning is not always about reaching for some ultraadvanced neural network but instead about achieving meaningful results given the data and resources that are available.
Contact
If you think your business could benefit from a Q&A solution like this, or something entirely different, then please feel free to contact us at https://www.nearform.com/contact/.