Building a Questions and Answers machine learning model

26 Jul 2018

Building a Q&A machine learning model, for answering White House Press Briefing questions about Brexit, using Latent Dirichlet Allocation.

Introduction

There is a lot of fuss about life-like chat AI these days. The truth is that we are still not yet there and it will likely be a while before the technology exists. Fundamentally a life-like chat AI, that is genuinely life-like, would be able to pass the Turing test, which would be a massive milestone in AI development.

As the technology does not yet exist, it is hard to say precisely what is missing to make a life-like chat AI possible. It is likely that a life-like chat AI would need 3 main components: language understanding, long-term contextual awareness, and a general knowledge-base. Each of these parts is on their own somewhat possible today, but the technology to combine them doesn’t yet exist. For example, language understanding allows us to create a translation AI, which in theory could be used to "translate" from questions to answers. However, the amount of different questions one can ask makes it hard to embed that amount of information in a neural network. Hence, we likely need an external knowledge-base that it can access, and that connection hasn't been made efficiently yet. Another example is that we can teach AIs to play Go and Chess which requires long-term contextual awareness. However, it is only possible to do with reinforcement learning which heavily relies on training with the help of a simulation system. Chats are not something we can efficiently simulate, because the rules are not explicitly known.

Not all aspects of a chat AI are out of reach today. In this article it is shown how to implement a Q&A "AI", using a machine learning technique for natural language processing.

A Q&A "AI" can be useful in many cases, such as aiding help-desk support. The base technology also extends to smarter and more flexible searching through documents, or even discovering categories in your documents that you weren't aware of. If your users produce your documents, this could reveal user groups that you weren't aware you had.

Brexit Questions and Answers in the White House

In this example, we use the press briefings Q&A sessions from the White House during the Obama Presidency. They are in the public domain documents and are available here: https://obamawhitehouse.archives.gov/briefing-room/press-briefings .

We build a semantic model from the press briefing questions. This means the "AI" understands the political language typically used in those press briefing sessions. To ensure we're not providing outdated or offensive answers, we create a curated list of mostly Brexit questions. This means that only questions related to Brexit can be answered, but you can still add answers to questions about other subjects.

Edit: The original version of this post contained an interactive Q&A model which we are unable to replicate on this version of the website. We are including screenshots of the walkthrough that was previously a part of this article.

Semantic Search Machine

The core concept of semantic search is to have an algorithm that can represent a document of words as a mixture of topics. If there are 3 topics, this could be 20% of topic 1, 80% of topic 2, and 0% of topic 3.

The intrusion here is that a specific topic is more likely to contain some words compared to others. The classic example of this is that a topic about cats is more likely to contain words such as milk, meow, and kitten. While a topic about dogs is more likely to contain words such as puppy, bark, and bone. However dogs are also mammals, and cat also has an endoskeleton. Therefore, a document about dogs could also contain the word milk. Likewise, a document about cats can also contain the word bone. It is just less likely. In the end, the topic mixture of a document should be the mixture that is most likely.

The exact algorithm of how this topic mixture is identified is explained later. For now, think of it as a black-box algorithm.

Similarity measure

To compute the similarity between two documents, where one is the question asked by the user and the other being a reference question with a predefined answer, the cosine similarity measure is used. Consider, [katex]q[/katex] being the topic representation for the question, while [katex]r[/katex] is the topic representation for the reference question. Then the cosine similarity can be computed as [katex]s = \frac{q \bullet r}{||q||_2 ||r||_2}[/katex]

The cosine similarity is [katex]1[/katex] if the topic mixture is the same. Likewise, it is [katex]-1[/katex] if the topics are the opposite.

Indexing

The cosine similarity is rather cheap to compute, but even so, it does need to be recomputed for all query-reference question pairs. A simple solution to speed this up, is not to store [katex]r[/katex] but rather [katex]\frac{r}{||r||_2}[/katex]. The same should be done for the query question. This way, only the dot product [katex]\tilde{q} \bullet \tilde{r}[/katex] needs to be computed, and this can typically be computed very fast on a GPU, even an integrated GPU is usually sufficient for this.

This linear time search algorithm is usually sufficient for even a reasonably sized Q&A database. If not, there are easy ways to turn the cosine similarity into a proper distance metric which enables logarithmic time search algorithms.

LDA as a Generative Model

The easiest way to understand how Latent Dirichlet Allocation identifies topics in documents, is to ask the opposite question, "given a set of topics what document can one expect to see?".

Given a model that can answer this question, one can then flip the model to answer the following questions, "given a document which of the defined topics are most likely to generate that document?".

Finally, with the answer to that question. One can solve the problem of, "given a corpus of documents, what kinds of topics are most likely?".

The LDA parameters

The Latent Dirichlet Allocation consists of two groups of parameters:

[katex]\alpha_{t}[/katex]: Controls the likelihood of the topic [katex]t[/katex] being used.
[katex]\beta_{t,w}[/katex]: Controls the likelihood of the word [katex]w[/katex] being used in topic [katex]t[/katex].

Topic composition

The first step is to use the [katex]\alpha_{t}[/katex] parameters to select a composition of topics randomly. Let us say there are three topics, a "composition of topics" could then be: 20% probability of topic [katex]1[/katex], 80% probability of topic [katex]2[/katex], and 0% probability of topic [katex]3[/katex].

Such a composition of topics, or probabilities to be more general, can be sampled using a Dirichlet distribution . Intuitively, the Dirichlet distribution uses the [katex]\alpha_{t}[/katex] parameters, that must all be greater than [katex]0[/katex], the following way:

The higher the [katex]\alpha_{t}[/katex] is, the higher the probability for topic [katex]t[/katex] is.
If the [katex]\alpha_{t}[/katex] parameters are less than [katex]1[/katex]. Then only a few topics are selected, and the rest gets assigned 0% probability.

For 3 topics, the "composition of topics" can be visualised as points on a triangle.

If the point is close to [katex]1[/katex] then there is a high probability of topic [katex]1[/katex]:

Similarly, if a point is a very far way the probability is small. Consequently, this means that if a point is on the edge then at least one of the topics gets 0% probability.

Dirichlet distribution: shows sampling from a 3 dimensional Dirichlet distribution. In Latent Dirichlet Allocation, the [katex]\alpha_{t}[/katex] should be close to zero, because often a document is only composed of a small set of topics. With the [katex]\alpha_{t}[/katex] close to zeros, topics often get 0% probability of being selected when generating a document. Although, other documents may be composed of a different set of topics.

Selecting words

With a known topic composition for the document. The next step is to generate the words one by one, this is done by repeating the following three steps:

Using a known topic composition for the given document a specific topic is sampled uniform randomly.
Once a specific topic [katex]t[/katex] is sampled, the word probabilities for that topic are also sampled, again using a Dirichlet distribution with the parameters [katex]\beta_{t, w}[/katex]. It is essential to not sample new word probabilities for the same topic more than once, as this ruins the meaning of a topic.
Finally, once the word probabilities for the topic [katex]t[/katex] are known, a specific word is sampled uniformly from the word probabilities.

The generative algorithm

In mathematical terms, the algorithm can be expressed as:

Choose document length [katex]N[/katex].
Sample topic composition: [katex]\theta \sim Dir(\alpha)[/katex].
For each word [katex]n[/katex], in a document with [katex]N[/katex] words in total:
1. Sample a topic: [katex]t_n \sim Cat(\theta)[/katex].
2. Sample the word composition: [katex]\phi_{t_n} \sim Dir(\beta_{t_n})[/katex].
3. Sample a word: [katex]w_n \sim Cat(\phi_{t_n})[/katex].

This algorithm completes the answer to "given a set of topics what document can one expect to see?". As promised from this answer, one can derive math to answer the questions "given a document which of the defined topics are most likely to generate that document?" and "given a corpus of documents, what kinds of topics are most likely?".

Digging Into the Math

This section is where things get tough. So feel free to skip this. Deriving these results is not incredibly complex, but involves advanced mathematical concepts such as solving a constrained optimisation problem with some probability theory mixed in. For this reason, only the highlights are covered here, for more thorough explaining please see the original Latent Dirichlet Allocation paper .

Notation and the generative distribution

Continuing with the generative model as described before. The topic composition distribution is denoted [katex]p(\theta|\alpha)[/katex]. The sampling of the topic for word [katex]n[/katex] is denoted [katex]p(z_n|\theta)[/katex]. Finally, the distribution of the word given the topic is denoted [katex]p(w_n|z_n, \beta)[/katex]. Note here that the distribution [katex]p(w_n|z_n, \beta)[/katex] is not explicitly split up into the sampling of the word composition and the word itself, as this is not directly relevant for the LDA model. All this gives the following generative distribution: [katex display=true]p(\theta, z, w|\alpha, \beta) = p(\theta|\alpha) \prod_{n=1}^N p(z_n|\theta)p(w_n|z_n, \beta)[/katex]

Inference of the topics

To infer the topics gives a document Bayes theorem is used: [katex display=true]p(\theta, z|w, \alpha, \beta) = \frac{p(\theta, z, w|\alpha, \beta)}{p(w|\alpha, \beta)}[/katex] However, computing this distribution is computationally infeasible as the [katex]p(w|\alpha, \beta)[/katex] becomes: [katex display=true]p(w | \alpha, \beta) = \frac{\Gamma(\sum_t \alpha_t)}{\prod_t \Gamma(\alpha_t)} \int \left(\prod_k \theta_k^{\alpha_k - 1}\right) \left( \prod_{n=1}^N \sum_{k=1}^K \prod_{c=1}^V (\theta_k \beta_{k,w})^{w_n^c} \right) \partial \theta[/katex]

The primary reason why this is impossible to simplify to something more computationally feasible is that [katex]z[/katex] depends on both [katex]\theta[/katex] and [katex]\beta[/katex]. And simultaneously [katex]\theta[/katex] depends on [katex]\alpha[/katex].

The trick to solving this problem is to create an approximation to the desired distribution [katex]p(\theta, z, w|\alpha, \beta)[/katex], where these dependencies don't exist. That distribution looks like this: [katex display=true]q(\theta, z|\gamma, \phi) = q(\theta|\gamma) \prod_{n=1}^N q(z_n|\phi_n)[/katex] In this distribution [katex]\gamma[/katex] is related to [katex]\alpha[/katex] and [katex]\phi[/katex] is related to [katex]\theta[/katex]. Note that [katex]\theta[/katex] itself also appears in the distribution however that is as an output variable. The purpose of [katex]\phi[/katex] is to remove the coupling where [katex]\theta[/katex] would otherwise be used both as an output variable and an input variable.

In the end, [katex]\gamma[/katex] is the most relevant output. As this informs us of the Dirichlet parameters, that were most likely to generate a topic composition that generated the input document. Hence, it will indirectly, but directly enough, inform us what topics the document are composed of.

To find the [katex]\gamma[/katex] and [katex]\phi[/katex] parameters, the parameters will be optimized such that the {katex]q(\theta, z|\gamma, \phi)[/katex] distribution matches the [katex]p(\theta, z|w, \alpha, \beta)[/katex] the best.

One metric for the difference between two random distributions is the Kullback–Leibler divergence . The goal is then to minimise this difference. [katex display=true](\gamma^*, \phi^*) = \arg\min_{\gamma, \phi} D_{KL}(q(\theta, z|\gamma, \phi)\ ||\ p(\theta, z|w, \alpha, \beta))[/katex]

Solving this optimisation problem is where it gets tough. Thus that won't be discussed here. Instead, see Appendix A.3 in original Latent Dirichlet Allocation paper for the details.

The highlight of this, is that the [katex]\gamma[/katex] and [katex]\phi[/katex] parameters can be found using the following algorithm:

Initialize [katex]\phi_{n,k}^0 = \frac{1}{K}[/katex]
Initialize [katex]\gamma_{k}^0 = \alpha_k + \frac{N}{K}[/katex]
repeat until convergence of [katex]\gamma[/katex]:
1. for each word at position [katex]n[/katex]:
  1. for each topic [katex]k[/katex]: [katex]\phi_{n,k}^{t+1} = \beta_{k, w_n} \exp(\Psi(\gamma_k^t))[/katex]
  2. normalize: [katex]\phi_n^{t+1}[/katex]
2. for each topic [katex]k[/katex]: [katex]\gamma_k^{t+1} = \alpha_k + \sum_{n=1}^N \phi_{n,k}^{t+1}[/katex]

Parameter estimation

The inference part of the LDA is by far the most complex part. Once this algorithm is implemented it is fairly straight forward to optimize the [katex]\beta[/katex] parameters. The [katex]\alpha[/katex] parameter is a Dirichlet prior and is thus a hyper parameter that doesn't need optimization. The original paper do suggests a Newton-Raphson approach for optimizing [katex]\alpha[/katex], but in practice this is not necessary.

Optimizing [katex]\beta[/katex] is done with the following EM-algorithm:

E-step:

Compute [katex]\gamma_d^*[/katex], as described in the previous section.

M-step:

Update [katex]\beta[/katex] with: [katex display=true]\beta_{k,c} \propto \sum_{d=1}^D \sum_{n=1}^{N_d} \phi_{d,n,k}^* w_{d,n}^j[/katex]

Conclusion

While Latent Dirichlet Allocation is not a very new approach to language understanding, it is powerful when the Q&A dataset isn’t massive. The White House press briefings contain about 60,000 questions and answers. Such size may sound like a lot, but a modern translation AI that would use neural networks would typically depend on millions of translated sentences. Because Latent Dirichlet Allocation is built on strong mathematical assumptions about how topics and documents relate, it can achieve much more than a neural network would be able to do when the dataset is small. On the other hand, these mathematical assumptions are also what limits it. Therefore, if the dataset were much more massive, a neural network would typically be superior.

In the end, machine learning is not always about reaching for some ultra-advanced neural network but instead about achieving meaningful results given the data and resources that are available.

Contact

If you think your business could benefit from a Q&A solution like this, or something entirely different, then please feel free to contact us at https://www.nearform.com/contact/ .

Image: palesa

Insight, imagination and expertly engineered solutions to accelerate and sustain progress.

Contact