Datasets


TitleShort Dataset DescriptionTag(s)Link(s)Contact
EvalSan: Evaluation Toolkit for Sanskrit Embeddings We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in Sanskrit word embeddings. Word embeddings, intrinsic evaluation, Sanskrit, WSC23 Link jivneshsandhan@gmail.com
FinRED: A Dataset for Relation Extraction in Financial Domain Relation Extraction dataset created from Webhose Financial News articles and Earning Call Transcripts. Contains a total of 7,775 sentences. financial, relation extraction, www22, Link soumyasharma20@gmail.com
ILSI: Indian Legal Statute Identification Dataset English language dataset for the Legal Statute Identification (LSI) task based on Indian court case documents. The LSI task requires one to identify the relevant statutes (written laws) given the facts/evidences of a situation/legal case. Contains fact portions from ~66k legal case documents from the Supreme Court and 6 High Courts of India. The label set contains 100 Sections (statutes) of the Indian Penal Code. multi-label classification, nlp, legal, aaai22 Link shounakpaul95@kgpian.iitkgp.ac.in
Placing (Historical) Facts on a Timeline: A Classification cum Coref Resolution Approach The curated sentences from two historical corpus - Collected Works of Mahatma Gandhi and Collected Works of Abraham Lincoln sentence classification, event extraction, event coreference resolution, ECML-PKDD 2022 Link sayantanadak.skni@gmail.com
Winds of Change: Impact of COVID-19 on Vaccine-Related Opinions of Twitter Users Dataset of 1600 COVID-19 vaccine related tweets, labelled into Anti-Vax, Pro-Vax and Neutral. Also contains our classifier trained on multiple datasets for vaccine stance detection. vaccine stance, social, icwsm2022 Link sohampoddar@kgpian.iitkgp.ac.in
CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines Benchmark dataset for explainable classification of concerns that people have towards COVID-19 vaccines. The dataset contains 9,921 anti-vax tweets labelled into 12 concerns in a multi-label setting. Each label is provided wth an explanation. The dataset also consists of class-wise summaries that can be used to evaluate multi-document summarization algorithms. anti-vax concerns, explainability, nlp, social, sigir2022 Link sohampoddar@kgpian.iitkgp.ac.in
Identification of Rhetorical Roles of Sentences in Indian Legal Judgments Dataset containing 50 documents of the Indian Supreme Court. Each sentence is labelled with one of the 7 rhetorical roles. semantic segmentation, sentence classification, legal Link paheli@iitkgp.ac.in
Zalando products Dataset of Zalando female fashion products, metadata, types and styles, images image processing, outfit recommendation, ecir2022 Link harshm121@gmail.com
MTLTS: A Multi-Task Framework To Obtain Trustworthy Summaries From Crisis-Related Microblogs We include pre-processed PHEME dataset consisting of 4,659 Twitter conversation threads, each with a source tweet and its replies, posted during four breaking-news events related to man-made disasters. For each thread, the source tweet is labeled as either rumour or non-rumour by a team of journalists. Stance labels for a subset of rumourous threads were obtained from the RumourEval 2019 dataset. For each of the four events, we also include an experts-curated extractive summary of around 250 words long. The dataset is used for evaluating trustworthiness of model-generated summaries. trustworthy-summarizarion, disaster-tweets, ai-for-social-good, nlp, ir, wsdm22 Link rajdeep1989.iitkgp@gmail.com
A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit We release datasets for context sensitive compound type identification. It is multi-class classification problem. The dataset is available for multiple languages. Context-sensitive task, multi-class classification, Sanskrit, low-resource NLP, COLING22 Link jivneshsandhan@gmail.com
Dataset of Annotated Intent Phrases on Indian and Australian Legal Case Proceedings Contains the main indian and australian data with NER labels as well as sentence classification labels. The folder named as "ind_phrases_2" and "aus_phrases" contains the extracted intent phrases (indian were manually annotated, australian were automatically extracted using JointBERT). The files "ind_labels.csv" and "aus_labels.csv" contains the different types of labels used for our experiments. nlp, summarization, legal, evaluation metrics, lrec2022 Link 1, Link 2 aankanmullick@gmail.com, nandyabhilash@gmail.com
Two-Face: Adversarial Audit of Commercial Face Recognition Systems Dataset of adversarial images generated from standard face image datasets- CelebSET, CFD, FairFace. face recognition, adversarial audit, fairness and bias, icwsm22 Link siddsjaiswal@kgpian.iitkgp.ac.in
DriBe: On-road Mobile Telemetry for Locality-Neutral Driving Behavior Annotation Dataset for driving containing road-view video clips and IMU/GPS data Driving behavior, Accident prediction, Driving safety, PerCom Workshops'22, MDM'22 Link 1, Link 2 debasreedas1994@gmail.com
Joint Autoregressive and Graph Models for Software and Developer Social Networks The dataset contain multiple packages with their dependency links, maintained by over 3800 developers, with over 280k bug reports. Ubuntu packages, software dependency network, bug urgency prediction, developer recommendation. Link to_rima@iitkgp.ac.in
Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis We include Aspect-based Sentiment Analysis datasets correspnding to reviews from four different domains: Laptop (Source: SemEval 2014), Restaurant (Source: SemEval 2014), Men's T-Shirt (new), and Television (new) absa, reproducibility, nlp, ecir22 Link rajdeep1989.iitkgp@gmail.com
Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages we introduce Prabhupadavani, a multilingual code-mixed ST dataset for 25 languages, covering ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. Code-mixed, multilingual dataset, speech translation, automatic speech recognition, machine translation. Link jivneshsandhan@gmail.com
When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia The dataset contains two individual lists of prolific editors in the Wikipedia community- missing and active as well as different features representing the editors' editing activity. Wikipedia, Missing editor, Prolific contributor, Platform moderation Link dasparamita1708@gmail.com
Quality Change: norm or exception? Measurement, Analysis and Detection of Quality Change in Wikipedia This work shows a dynamic portfolio of quality changes in English Wikipedia articles and further proposes a novel unsupervised page-level approach to detect quality changes in advance. The dataset contains human-assessed quality labels as well as a set of features representing the revisions for 33k English Wikipedia articles. Wikipedia, Quality classes, Change point detection, Unsupervised approach Link dasparamita1708@gmail.com
Is this bug severe? A text-cum-graph based model for bug severity prediction The dataset consists of 280K bugs along with their meta data (e.g., the textual description of the bug, etc.). The ground-truth severity scores of these bugs have been collected in two different time points to facilitate the prediction experiments Ubuntu Bugs, Bug severity prediction, Package-affect graph Link to_rima@iitkgp.ac.in
Autonomous driving data from CARLA simulator Contains 262 episodes of driving data using CARLA simulator along with annotations for training Conditonal Affordance Learning (CAL) [http://proceedings.mlr.press/v87/sauer18a/sauer18a.pdf] driving model self-driving, CARLA, autonomous driving Link soumid.04@gmail.com
"Short is the Road that Leads from Fear to Hate" :Fear Speech in Indian WhatsApp Groups First ever dataset for fear speech detection from public political Whatsapp groups. Contains 1000 unique fear speech and around 4000 normal posts. users posting fear speech use violent events, emojis and controversial websites to create fear about community. fear speech, nlp, social, www2021 Link punyajoys@iitkgp.ac.in
Dataset on Question Answering over Electronic Devices Benchmark QA datasets which include pre-training corpus of E-Manuals, question answer pairs curated by experts based upon two E-manuals, real user questions from Community Question Answering Forum pertaining to E-manuals etc nlp, question-answering, emnlp2021 Link nandyabhilash@gmail.com
HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection HateXplain dataset is the first benchmark dataset for hate speech with word and phrase level span annotations that capture human rationales for the labeling. Using MTurk, a large dataset of around 20Kposts was collected and annotated to cover three aspects of each post - label, rationales and target. hate speech, explainability, nlp, aaai2021 Link punyajoys@iitkgp.ac.in
You too Brutus! Trapping Hateful Users in Social Media:Challenges, Solutions & Insights Benchmark dataset for hateful users detection on Gab. Contains 423 hateful and 375 non-hateful users along with their posts and user networks (available on request) hate speech, user level detection, ht2021 Link mithundas@iitkgp.ac.in
Understanding the Role of Affect Dimensions in Detecting Emotions from Tweets: A Multi-task Approach We include three datasets: (1) EmoBank: a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme (2) SenWave: The public sentimental analysis dataset SenWave for Covid-19 research, and (3) Affect In Tweets: created as part of SemEval 2018 Task 1, it consists of around 10K tweets labeled with the presence/absence of a total of 11 emotions. multi-task learning, emotion-analysis, sentiment-analysis, valence-arousal-dominance, nlp, sigir21 Link rajdeep1989.iitkgp@gmail.com
PASTE: A Tagging-Free Decoding Framework Using Pointer Networks for Aspect Sentiment Triplet Extraction We preprocess and include in this repository the latest version (Version 2) of the benchmark dataset for the task of Aspect-Sentiment-Triplet-Extraction (the most challenging subtask under the umbrella of ABSA) aspect-sentiment-triplet-extraction, nlp, emnlp21 Link rajdeep1989.iitkgp@gmail.com
IndoRE:Relation extraction dataset in Indian languages Relation extraction dataset created from wikipedia and webcrawling for Bengali,Hindi,Telugu and English. In each sentence two entities are marked and corresponding NER tags are also provided. Relation extraction,CoNLL2021,Indic language Link arijitnag.iitkgp@gmail.com
Local Recommendations Dataset of local point-of-interest recommendations on Yelp and Google Local (Google Maps) web, recommendation, RecSys2020 Link patrogourab@gmail.com
Bias Stance A Dataset to Assert the Role of Target Entities for Detecting Stance nlp, stance detection, naacl2020 Link Ayushk4@gmail.com
Thou Shalt Not Hate: Countering Online Hate Speech First ever large dataset on counter-speech. The dataset is based on counterspeech targeted to three different communities:Jews,Blacks, andLGBT. It consists of 6,898 comments annotated as counterspeech andan additional 7,026 comments tagged as non-counterspeech. counter speech, nlp, social, icwsm2019 Link punyajoys@iitkgp.ac.in
Tweets on Disaster Events and Ground Truth Summaries Codes and datasets related to sub-event identification and summarizing information during disaster nlp, summarization, sigir2018 Link koustav@iitism.ac.in
Coauthorship Network and Social Circles One of the largest publicly available data- sets from Microsoft Academic Search (MAS) which houses over 4.1 million publications and 2.7 million authors. All the papers specifically published in the computer science domain and indexed by MAS. network, social, kdd2015 Link animeshm@cse.iitkgp.ac.in