Question generation using NLP

Data Science Milan
3 min readJun 7, 2021

“From NLP a helpful tool for teaching”

On 20th May 2021, Data Science Milan has organized a webMeetup hosting Ramsri Goutham to talk about Questgen, an open-source library used for Multiple Choice Questions generation.

“Question generation using NLP by QuestGen.AI”, by Ramsri Goutham, CTO of QuestGen.AI

Ramsri explained QuestGen open-source library used to generate questions automatically from text. The intuition is coming from the needs to create a tool to automate the assessment process helping teachers in their job. This tool is able to generate from an article/text: Multiple Choice Questions (MCQs), true or false questions, FAQs, paraphrasing, and question answering. Ramsri showed a use case on how to generate multiple choice questions using T5 Transformers.

Given, for instance an article, the process follows these steps:

-Extractive summarization;

-Identify key sentences/concepts;

-Identify keywords from sentences;

-Form multiple-choice questions;

-Distractors generation.

T5 Transformer model affords to reframe all NLP tasks into a text-to-text-format where the input and output are always text strings compared to BERT models where output can either a class label or a span of the input. It’s an encoder decoder Transformer model, so giving an input text it will learn to generate an output text, more precisely it will automatically generate a question that is ideally suited, to the context and an answer.

To train this model is been used a SQuAD data set (Stanford Question Answering Dataset). It is a reading comprehension data set, consisting of questions posed by crowd workers on a set of Wikipedia articles. Answers of questions are actually the keywords extracted from the context, and to obtain them can be used several python keyword extraction libraries.

The next step is to understand the right contextual meaning of a keyword, and for this activity are enrolled distractors (wrong answer choices) to discover the correct meaning for a given word in a sentence. To generate distractors can be used several algorithms, and Ramsri showed WordNet and Sense2Vec.

The first one is a large lexical database of English to capture broader relationships between words, the second one captures contextual information from a word, generating synonyms.

References

Colab notebooks

Recording&Slides:

video

slides

Written by Claudio G. Giancaterino

--

--

Data Science Milan

Blog and summary of events of the Data Science Milan community.