EvalSan: Evaluation Toolkit for Sanskrit Embeddings | We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in Sanskrit word embeddings. | Word embeddings, intrinsic evaluation, Sanskrit, WSC23 | Link | jivneshsandhan@gmail.com |
FinRED: A Dataset for Relation Extraction in Financial Domain | Relation Extraction dataset created from Webhose Financial News articles and Earning Call Transcripts. Contains a total of 7,775 sentences. | financial, relation extraction, www22, | Link | soumyasharma20@gmail.com |
ILSI: Indian Legal Statute Identification Dataset | English language dataset for the Legal Statute Identification (LSI) task based on Indian court case documents. The LSI task requires one to identify the relevant statutes (written laws) given the facts/evidences of a situation/legal case. Contains fact portions from ~66k legal case documents from the Supreme Court and 6 High Courts of India. The label set contains 100 Sections (statutes) of the Indian Penal Code. | multi-label classification, nlp, legal, aaai22 | Link | shounakpaul95@kgpian.iitkgp.ac.in |
Placing (Historical) Facts on a Timeline: A Classification cum Coref Resolution Approach | The curated sentences from two historical corpus - Collected Works of Mahatma Gandhi and Collected Works of Abraham Lincoln | sentence classification, event extraction, event coreference resolution, ECML-PKDD 2022 | Link | sayantanadak.skni@gmail.com |
Winds of Change: Impact of COVID-19 on Vaccine-Related Opinions of Twitter Users | Dataset of 1600 COVID-19 vaccine related tweets, labelled into Anti-Vax, Pro-Vax and Neutral. Also contains our classifier trained on multiple datasets for vaccine stance detection. | vaccine stance, social, icwsm2022 | Link | sohampoddar@kgpian.iitkgp.ac.in |
CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines | Benchmark dataset for explainable classification of concerns that people have towards COVID-19 vaccines. The dataset contains 9,921 anti-vax tweets labelled into 12 concerns in a multi-label setting. Each label is provided wth an explanation. The dataset also consists of class-wise summaries that can be used to evaluate multi-document summarization algorithms. | anti-vax concerns, explainability, nlp, social, sigir2022 | Link | sohampoddar@kgpian.iitkgp.ac.in |
Identification of Rhetorical Roles of Sentences in Indian Legal Judgments | Dataset containing 50 documents of the Indian Supreme Court. Each sentence is labelled with one of the 7 rhetorical roles. | semantic segmentation, sentence classification, legal | Link | paheli@iitkgp.ac.in |
Zalando products | Dataset of Zalando female fashion products, metadata, types and styles, images | image processing, outfit recommendation, ecir2022 | Link | harshm121@gmail.com |
MTLTS: A Multi-Task Framework To Obtain Trustworthy Summaries From Crisis-Related Microblogs | We include pre-processed PHEME dataset consisting of 4,659 Twitter conversation threads, each with a source tweet and its replies, posted during four breaking-news events related to man-made disasters. For each thread, the source tweet is labeled as either rumour or non-rumour by a team of journalists. Stance labels for a subset of rumourous threads were obtained from the RumourEval 2019 dataset. For each of the four events, we also include an experts-curated extractive summary of around 250 words long. The dataset is used for evaluating trustworthiness of model-generated summaries. | trustworthy-summarizarion, disaster-tweets, ai-for-social-good, nlp, ir, wsdm22 | Link | rajdeep1989.iitkgp@gmail.com |
A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit | We release datasets for context sensitive compound type identification. It is multi-class classification problem. The dataset is available for multiple languages. | Context-sensitive task, multi-class classification, Sanskrit, low-resource NLP, COLING22 | Link | jivneshsandhan@gmail.com |
Dataset of Annotated Intent Phrases on Indian and Australian Legal Case Proceedings | Contains the main indian and australian data with NER labels as well as sentence classification labels. The folder named as "ind_phrases_2" and "aus_phrases" contains the extracted intent phrases (indian were manually annotated, australian were automatically extracted using JointBERT). The files "ind_labels.csv" and "aus_labels.csv" contains the different types of labels used for our experiments. | nlp, summarization, legal, evaluation metrics, lrec2022 | Link 1, Link 2 | aankanmullick@gmail.com, nandyabhilash@gmail.com |
Two-Face: Adversarial Audit of Commercial Face Recognition Systems | Dataset of adversarial images generated from standard face image datasets- CelebSET, CFD, FairFace. | face recognition, adversarial audit, fairness and bias, icwsm22 | Link | siddsjaiswal@kgpian.iitkgp.ac.in |
DriBe: On-road Mobile Telemetry for Locality-Neutral Driving Behavior Annotation | Dataset for driving containing road-view video clips and IMU/GPS data | Driving behavior, Accident prediction, Driving safety, PerCom Workshops'22, MDM'22 | Link 1, Link 2 | debasreedas1994@gmail.com |
Joint Autoregressive and Graph Models for Software and Developer Social Networks | The dataset contain multiple packages with their dependency links, maintained by over 3800 developers, with over 280k bug reports. | Ubuntu packages, software dependency network, bug urgency prediction, developer recommendation. | Link | to_rima@iitkgp.ac.in |
Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis | We include Aspect-based Sentiment Analysis datasets correspnding to reviews from four different domains: Laptop (Source: SemEval 2014), Restaurant (Source: SemEval 2014), Men's T-Shirt (new), and Television (new) | absa, reproducibility, nlp, ecir22 | Link | rajdeep1989.iitkgp@gmail.com |
Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages | we introduce Prabhupadavani, a multilingual code-mixed ST dataset for 25 languages, covering ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. | Code-mixed, multilingual dataset, speech translation, automatic speech recognition, machine translation. | Link | jivneshsandhan@gmail.com |
When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia | The dataset contains two individual lists of prolific editors in the Wikipedia community- missing and active as well as different features representing the editors' editing activity. | Wikipedia, Missing editor, Prolific contributor, Platform moderation | Link | dasparamita1708@gmail.com |
Quality Change: norm or exception? Measurement, Analysis and Detection of Quality Change in Wikipedia | This work shows a dynamic portfolio of quality changes in English Wikipedia articles and further proposes a novel unsupervised page-level approach to detect quality changes in advance. The dataset contains human-assessed quality labels as well as a set of features representing the revisions for 33k English Wikipedia articles. | Wikipedia, Quality classes, Change point detection, Unsupervised approach | Link | dasparamita1708@gmail.com |
Is this bug severe? A text-cum-graph based model for bug severity prediction | The dataset consists of 280K bugs along with their meta data (e.g., the textual description of the bug, etc.). The ground-truth severity scores of these bugs have been collected in two different time points to facilitate the prediction experiments | Ubuntu Bugs, Bug severity prediction, Package-affect graph | Link | to_rima@iitkgp.ac.in |
Autonomous driving data from CARLA simulator | Contains 262 episodes of driving data using CARLA simulator along with annotations for training Conditonal Affordance Learning (CAL) [http://proceedings.mlr.press/v87/sauer18a/sauer18a.pdf] driving model | self-driving, CARLA, autonomous driving | Link | soumid.04@gmail.com |
"Short is the Road that Leads from Fear to Hate" :Fear Speech in Indian WhatsApp Groups | First ever dataset for fear speech detection from public political Whatsapp groups. Contains 1000 unique fear speech and around 4000 normal posts. users posting fear speech use violent events, emojis and controversial websites to create fear about community. | fear speech, nlp, social, www2021 | Link | punyajoys@iitkgp.ac.in |
Dataset on Question Answering over Electronic Devices | Benchmark QA datasets which include pre-training corpus of E-Manuals, question answer pairs curated by experts based upon two E-manuals, real user questions from Community Question Answering Forum pertaining to E-manuals etc | nlp, question-answering, emnlp2021 | Link | nandyabhilash@gmail.com |
HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection | HateXplain dataset is the first benchmark dataset for hate speech with word and phrase level span annotations that capture human rationales for the labeling. Using MTurk, a large dataset of around 20Kposts was collected and annotated to cover three aspects of each post - label, rationales and target. | hate speech, explainability, nlp, aaai2021 | Link | punyajoys@iitkgp.ac.in |
You too Brutus! Trapping Hateful Users in Social Media:Challenges, Solutions & Insights | Benchmark dataset for hateful users detection on Gab. Contains 423 hateful and 375 non-hateful users along with their posts and user networks (available on request) | hate speech, user level detection, ht2021 | Link | mithundas@iitkgp.ac.in |
Understanding the Role of Affect Dimensions in Detecting Emotions from Tweets: A Multi-task Approach | We include three datasets: (1) EmoBank: a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme (2) SenWave: The public sentimental analysis dataset SenWave for Covid-19 research, and (3) Affect In Tweets: created as part of SemEval 2018 Task 1, it consists of around 10K tweets labeled with the presence/absence of a total of 11 emotions. | multi-task learning, emotion-analysis, sentiment-analysis, valence-arousal-dominance, nlp, sigir21 | Link | rajdeep1989.iitkgp@gmail.com |
PASTE: A Tagging-Free Decoding Framework Using Pointer Networks for Aspect Sentiment Triplet Extraction | We preprocess and include in this repository the latest version (Version 2) of the benchmark dataset for the task of Aspect-Sentiment-Triplet-Extraction (the most challenging subtask under the umbrella of ABSA) | aspect-sentiment-triplet-extraction, nlp, emnlp21 | Link | rajdeep1989.iitkgp@gmail.com |
IndoRE:Relation extraction dataset in Indian languages | Relation extraction dataset created from wikipedia and webcrawling for Bengali,Hindi,Telugu and English. In each sentence two entities are marked and corresponding NER tags are also provided. | Relation extraction,CoNLL2021,Indic language | Link | arijitnag.iitkgp@gmail.com |
Local Recommendations | Dataset of local point-of-interest recommendations on Yelp and Google Local (Google Maps) | web, recommendation, RecSys2020 | Link | patrogourab@gmail.com |
Bias Stance | A Dataset to Assert the Role of Target Entities for Detecting Stance | nlp, stance detection, naacl2020 | Link | Ayushk4@gmail.com |
Thou Shalt Not Hate: Countering Online Hate Speech | First ever large dataset on counter-speech. The dataset is based on counterspeech targeted to three different communities:Jews,Blacks, andLGBT. It consists of 6,898 comments annotated as counterspeech andan additional 7,026 comments tagged as non-counterspeech. | counter speech, nlp, social, icwsm2019 | Link | punyajoys@iitkgp.ac.in |
Tweets on Disaster Events and Ground Truth Summaries | Codes and datasets related to sub-event identification and summarizing information during disaster | nlp, summarization, sigir2018 | Link | koustav@iitism.ac.in |
Coauthorship Network and Social Circles | One of the largest publicly available data- sets from Microsoft Academic Search (MAS) which houses over 4.1 million publications and 2.7 million authors. All the papers specifically published in the computer science domain and indexed by MAS. | network, social, kdd2015 | Link | animeshm@cse.iitkgp.ac.in |