The task of Document Based Question Answering(DBQA) is to answer questions by selecting one or multiple sentences from a set of unstructured documents as answers. Formally, given an utterance and a document set , the document-based chatbot engine retrieves response based on the following three steps:
The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.
The most commonly used indexing method is Inverted Indexing. An inverted index consists of a list of all the unique words that appear in any document, and for each word, we attach a list of the documents in which it appears.
Given user query , we first tokenize , such that , for each word , we can read their corresponding inverted index and get a list of documents contain the word . The final candidate document set is .
We could have more constrains on the candidate document set, for example, a binary boolean operator(should, must) could be specified on each word to indicate this world must be contained by each of the final candidate documents or not. The operator and the word constitute a boolean clause, clause can be integrated into a more complex clause.
The candidate document set could be large as 10,000, we need a facility to score these documents to find out the most relevance ones.
Say we have a vocabulary represents the uniq terms in the corpus(all documents in search engine). Each query and document can be represented as a vocab-size vector, such that we can calculate the similarity score between and a candidate document
Each dimension of the vector is the weight of -th term in the vocabulary. The most simple weighting strategy is Boolean weighting, if term appear in document, so set . Documents share more common terms with th query have higher score.
Boolean weighting has its own weakness. Terms in query are not identical important. TF-IDF(Term Frequency–Inverse Document Frequency) weighting believes that the more some term appears in the document, and the less is contained by other document, the more important is.
Given a user utterance Q and a response candidate , the ranking function is designed as an ensemble of individual matching features:
where denotes the -th feature function, denotes ’s corresponding weight.
We need to design multi-level(Paraphrase, Causality, Discourse Relationship, etc.) semantic and syntactic features to rank pair precisely, which is often laborious. Deep learning approaches have gained a lot of attention from the research community and industry for their ability to automatically learn optimal feature representation for a given task, while claiming state-of-the-art performance in many tasks.
正例噪声很大, QA库里面的答案不一定是全部跟问题相关。
负例太负,跟实际的数据分布不一致。实际数据是一篇篇的文章,这些文章里面的句子都是相关的。我们实际需要的是一些看起来相近的负例。比如问题是问因果,答案是说为什么。
Cases: