CNeRG

Datasets

Title	Short Dataset Description	Tag(s)	Link(s)	Contact
EvalSan: Evaluation Toolkit for Sanskrit Embeddings	We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in Sanskrit word embeddings.	Word embeddings, intrinsic evaluation, Sanskrit, WSC23	Link	jivneshsandhan@gmail.com
FinRED: A Dataset for Relation Extraction in Financial Domain	Relation Extraction dataset created from Webhose Financial News articles and Earning Call Transcripts. Contains a total of 7,775 sentences.	financial, relation extraction, www22,	Link	soumyasharma20@gmail.com
ILSI: Indian Legal Statute Identification Dataset	English language dataset for the Legal Statute Identification (LSI) task based on Indian court case documents. The LSI task requires one to identify the relevant statutes (written laws) given the facts/evidences of a situation/legal case. Contains fact portions from ~66k legal case documents from the Supreme Court and 6 High Courts of India. The label set contains 100 Sections (statutes) of the Indian Penal Code.	multi-label classification, nlp, legal, aaai22	Link	shounakpaul95@kgpian.iitkgp.ac.in
Placing (Historical) Facts on a Timeline: A Classification cum Coref Resolution Approach	The curated sentences from two historical corpus - Collected Works of Mahatma Gandhi and Collected Works of Abraham Lincoln	sentence classification, event extraction, event coreference resolution, ECML-PKDD 2022	Link	sayantanadak.skni@gmail.com
Winds of Change: Impact of COVID-19 on Vaccine-Related Opinions of Twitter Users	Dataset of 1600 COVID-19 vaccine related tweets, labelled into Anti-Vax, Pro-Vax and Neutral. Also contains our classifier trained on multiple datasets for vaccine stance detection.	vaccine stance, social, icwsm2022	Link	sohampoddar@kgpian.iitkgp.ac.in
CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines	Benchmark dataset for explainable classification of concerns that people have towards COVID-19 vaccines. The dataset contains 9,921 anti-vax tweets labelled into 12 concerns in a multi-label setting. Each label is provided wth an explanation. The dataset also consists of class-wise summaries that can be used to evaluate multi-document summarization algorithms.	anti-vax concerns, explainability, nlp, social, sigir2022	Link	sohampoddar@kgpian.iitkgp.ac.in
Identification of Rhetorical Roles of Sentences in Indian Legal Judgments	Dataset containing 50 documents of the Indian Supreme Court. Each sentence is labelled with one of the 7 rhetorical roles.	semantic segmentation, sentence classification, legal	Link	paheli@iitkgp.ac.in
Zalando products	Dataset of Zalando female fashion products, metadata, types and styles, images	image processing, outfit recommendation, ecir2022	Link	harshm121@gmail.com
MTLTS: A Multi-Task Framework To Obtain Trustworthy Summaries From Crisis-Related Microblogs	We include pre-processed PHEME dataset consisting of 4,659 Twitter conversation threads, each with a source tweet and its replies, posted during four breaking-news events related to man-made disasters. For each thread, the source tweet is labeled as either rumour or non-rumour by a team of journalists. Stance labels for a subset of rumourous threads were obtained from the RumourEval 2019 dataset. For each of the four events, we also include an experts-curated extractive summary of around 250 words long. The dataset is used for evaluating trustworthiness of model-generated summaries.	trustworthy-summarizarion, disaster-tweets, ai-for-social-good, nlp, ir, wsdm22	Link	rajdeep1989.iitkgp@gmail.com
A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit	We release datasets for context sensitive compound type identification. It is multi-class classification problem. The dataset is available for multiple languages.	Context-sensitive task, multi-class classification, Sanskrit, low-resource NLP, COLING22	Link	jivneshsandhan@gmail.com
Dataset of Annotated Intent Phrases on Indian and Australian Legal Case Proceedings	Contains the main indian and australian data with NER labels as well as sentence classification labels. The folder named as "ind_phrases_2" and "aus_phrases" contains the extracted intent phrases (indian were manually annotated, australian were automatically extracted using JointBERT). The files "ind_labels.csv" and "aus_labels.csv" contains the different types of labels used for our experiments.	nlp, summarization, legal, evaluation metrics, lrec2022	Link 1, Link 2	aankanmullick@gmail.com, nandyabhilash@gmail.com
Two-Face: Adversarial Audit of Commercial Face Recognition Systems	Dataset of adversarial images generated from standard face image datasets- CelebSET, CFD, FairFace.	face recognition, adversarial audit, fairness and bias, icwsm22	Link	siddsjaiswal@kgpian.iitkgp.ac.in
DriBe: On-road Mobile Telemetry for Locality-Neutral Driving Behavior Annotation	Dataset for driving containing road-view video clips and IMU/GPS data	Driving behavior, Accident prediction, Driving safety, PerCom Workshops'22, MDM'22	Link 1, Link 2	debasreedas1994@gmail.com
Joint Autoregressive and Graph Models for Software and Developer Social Networks	The dataset contain multiple packages with their dependency links, maintained by over 3800 developers, with over 280k bug reports.	Ubuntu packages, software dependency network, bug urgency prediction, developer recommendation.	Link	to_rima@iitkgp.ac.in
Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis	We include Aspect-based Sentiment Analysis datasets correspnding to reviews from four different domains: Laptop (Source: SemEval 2014), Restaurant (Source: SemEval 2014), Men's T-Shirt (new), and Television (new)	absa, reproducibility, nlp, ecir22	Link	rajdeep1989.iitkgp@gmail.com
Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages	we introduce Prabhupadavani, a multilingual code-mixed ST dataset for 25 languages, covering ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language.	Code-mixed, multilingual dataset, speech translation, automatic speech recognition, machine translation.	Link	jivneshsandhan@gmail.com
When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia	The dataset contains two individual lists of prolific editors in the Wikipedia community- missing and active as well as different features representing the editors' editing activity.	Wikipedia, Missing editor, Prolific contributor, Platform moderation	Link	dasparamita1708@gmail.com
Quality Change: norm or exception? Measurement, Analysis and Detection of Quality Change in Wikipedia	This work shows a dynamic portfolio of quality changes in English Wikipedia articles and further proposes a novel unsupervised page-level approach to detect quality changes in advance. The dataset contains human-assessed quality labels as well as a set of features representing the revisions for 33k English Wikipedia articles.	Wikipedia, Quality classes, Change point detection, Unsupervised approach	Link	dasparamita1708@gmail.com
Is this bug severe? A text-cum-graph based model for bug severity prediction	The dataset consists of 280K bugs along with their meta data (e.g., the textual description of the bug, etc.). The ground-truth severity scores of these bugs have been collected in two different time points to facilitate the prediction experiments	Ubuntu Bugs, Bug severity prediction, Package-affect graph	Link	to_rima@iitkgp.ac.in
Autonomous driving data from CARLA simulator	Contains 262 episodes of driving data using CARLA simulator along with annotations for training Conditonal Affordance Learning (CAL) [http://proceedings.mlr.press/v87/sauer18a/sauer18a.pdf] driving model	self-driving, CARLA, autonomous driving	Link	soumid.04@gmail.com
"Short is the Road that Leads from Fear to Hate" :Fear Speech in Indian WhatsApp Groups	First ever dataset for fear speech detection from public political Whatsapp groups. Contains 1000 unique fear speech and around 4000 normal posts. users posting fear speech use violent events, emojis and controversial websites to create fear about community.	fear speech, nlp, social, www2021	Link	punyajoys@iitkgp.ac.in
Dataset on Question Answering over Electronic Devices	Benchmark QA datasets which include pre-training corpus of E-Manuals, question answer pairs curated by experts based upon two E-manuals, real user questions from Community Question Answering Forum pertaining to E-manuals etc	nlp, question-answering, emnlp2021	Link	nandyabhilash@gmail.com
HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection	HateXplain dataset is the first benchmark dataset for hate speech with word and phrase level span annotations that capture human rationales for the labeling. Using MTurk, a large dataset of around 20Kposts was collected and annotated to cover three aspects of each post - label, rationales and target.	hate speech, explainability, nlp, aaai2021	Link	punyajoys@iitkgp.ac.in
You too Brutus! Trapping Hateful Users in Social Media:Challenges, Solutions & Insights	Benchmark dataset for hateful users detection on Gab. Contains 423 hateful and 375 non-hateful users along with their posts and user networks (available on request)	hate speech, user level detection, ht2021	Link	mithundas@iitkgp.ac.in
Understanding the Role of Affect Dimensions in Detecting Emotions from Tweets: A Multi-task Approach	We include three datasets: (1) EmoBank: a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme (2) SenWave: The public sentimental analysis dataset SenWave for Covid-19 research, and (3) Affect In Tweets: created as part of SemEval 2018 Task 1, it consists of around 10K tweets labeled with the presence/absence of a total of 11 emotions.	multi-task learning, emotion-analysis, sentiment-analysis, valence-arousal-dominance, nlp, sigir21	Link	rajdeep1989.iitkgp@gmail.com
PASTE: A Tagging-Free Decoding Framework Using Pointer Networks for Aspect Sentiment Triplet Extraction	We preprocess and include in this repository the latest version (Version 2) of the benchmark dataset for the task of Aspect-Sentiment-Triplet-Extraction (the most challenging subtask under the umbrella of ABSA)	aspect-sentiment-triplet-extraction, nlp, emnlp21	Link	rajdeep1989.iitkgp@gmail.com
IndoRE:Relation extraction dataset in Indian languages	Relation extraction dataset created from wikipedia and webcrawling for Bengali,Hindi,Telugu and English. In each sentence two entities are marked and corresponding NER tags are also provided.	Relation extraction,CoNLL2021,Indic language	Link	arijitnag.iitkgp@gmail.com
Local Recommendations	Dataset of local point-of-interest recommendations on Yelp and Google Local (Google Maps)	web, recommendation, RecSys2020	Link	patrogourab@gmail.com
Bias Stance	A Dataset to Assert the Role of Target Entities for Detecting Stance	nlp, stance detection, naacl2020	Link	Ayushk4@gmail.com
Thou Shalt Not Hate: Countering Online Hate Speech	First ever large dataset on counter-speech. The dataset is based on counterspeech targeted to three different communities:Jews,Blacks, andLGBT. It consists of 6,898 comments annotated as counterspeech andan additional 7,026 comments tagged as non-counterspeech.	counter speech, nlp, social, icwsm2019	Link	punyajoys@iitkgp.ac.in
Tweets on Disaster Events and Ground Truth Summaries	Codes and datasets related to sub-event identification and summarizing information during disaster	nlp, summarization, sigir2018	Link	koustav@iitism.ac.in
Coauthorship Network and Social Circles	One of the largest publicly available data- sets from Microsoft Academic Search (MAS) which houses over 4.1 million publications and 2.7 million authors. All the papers specifically published in the computer science domain and indexed by MAS.	network, social, kdd2015	Link	animeshm@cse.iitkgp.ac.in