Document Based Question Answering Heuristics

The task of Document Based Question Answering(DBQA) is to answer questions by selecting one or multiple sentences from a set of unstructured documents as answers. Formally, given an utterance and a document set , the document-based chatbot engine retrieves response based on the following three steps:

  1. Response Retrieval, which retrieves response candidates from based on :

    Each S ∈ C is a sentence existing in D.
  2. Response Ranking, which ranks all response candidates in C and selects the most possible response candidate as :
  3. Response Triggering, which decides whether it is confident enough to response using :

    where is a binary value. When equals to true, let the response and output ; otherwise, output nothing.

Response Retrieval

Search Engine 101

Index

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.
The most commonly used indexing method is Inverted Indexing. An inverted index consists of a list of all the unique words that appear in any document, and for each word, we attach a list of the documents in which it appears.

inverted index

Given user query , we first tokenize , such that , for each word , we can read their corresponding inverted index and get a list of documents contain the word . The final candidate document set is .
We could have more constrains on the candidate document set, for example, a binary boolean operator(should, must) could be specified on each word to indicate this world must be contained by each of the final candidate documents or not. The operator and the word constitute a boolean clause, clause can be integrated into a more complex clause.

Ranking

The candidate document set could be large as 10,000, we need a facility to score these documents to find out the most relevance ones.

Vector Space Model and TF-IDF statistics

Say we have a vocabulary represents the uniq terms in the corpus(all documents in search engine). Each query and document can be represented as a vocab-size vector, such that we can calculate the similarity score between and a candidate document

Each dimension of the vector is the weight of -th term in the vocabulary. The most simple weighting strategy is Boolean weighting, if term appear in document, so set . Documents share more common terms with th query have higher score.

Boolean weighting has its own weakness. Terms in query are not identical important. TF-IDF(Term Frequency–Inverse Document Frequency) weighting believes that the more some term appears in the document, and the less is contained by other document, the more important is.

Response Ranking

Given a user utterance Q and a response candidate , the ranking function is designed as an ensemble of individual matching features:

where denotes the -th feature function, denotes ’s corresponding weight.

Framework

We need to design multi-level(Paraphrase, Causality, Discourse Relationship, etc.) semantic and syntactic features to rank pair precisely, which is often laborious. Deep learning approaches have gained a lot of attention from the research community and industry for their ability to automatically learn optimal feature representation for a given task, while claiming state-of-the-art performance in many tasks.

Deep QA

Data

正例

  1. Q:吃 了 罗 红霉素 分散 片 可以 怀孕 吗 ?
    A: 怀孕 三十八 周 可以 口服 罗 红霉素 分散 片 的 , 但 在 怀孕 期间 不 能 盲目 用药 , 盲目 用药 是 会 影响 胎儿 发育 的 , 建议 你 在 医生 的 指导 下用药 为宜 。
  2. Q:拍 胸片 后 多久 可以 怀孕 ?
    A:因此 , 妇女 平时 应 尽量 减少 x光 的 照射 机会 , 怀孕 前 4 周 内 必须 禁忌 照射 X光
  3. Q:剖腹产 后 多久 可以 做 瑜伽 ?
    A:那剖腹产 后 多久 可以 做 瑜伽 ?

正例噪声很大, QA库里面的答案不一定是全部跟问题相关。

负例

  1. Q:孕前 优生优育 如何 检查 ?
    A: 宝宝 补 锌 吃 什么 好 ?
  2. Q: 哺乳期 用 什么 护肤品 好 ?
    A: 头 围 是 指 绕 胎 头 一 周 的 最 大 长度 。

负例太负,跟实际的数据分布不一致。实际数据是一篇篇的文章,这些文章里面的句子都是相关的。我们实际需要的是一些看起来相近的负例。比如问题是问因果,答案是说为什么。

Demo

DEMO

Cases:

  1. 怀孕可以锻炼吗 (G)
  2. 宝宝吃手指(G)
  3. 宝宝吃手指怎么办(G)
  4. 宝宝发烧身上出红疹子怎么办 (N)
  5. 宝宝晚上不睡觉 (N)
  6. 宝宝感冒怎么办(B)

Reference

  1. A first take at building an inverted index
  2. Search engine indexing
  3. Vector Space Model
  4. TF-IDF
  5. DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents
  6. Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks