publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- under reviewPersonal Information Parroting in Language ModelsNishant Subramani, Kshitish Ghate, and Mona DiabPreprint, 2025
Modern language models are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which language models (LMs) memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization and parroting: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e. when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and timesteps of pretraining (70k-143k iterations) on the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.
@article{Subramani2025personalinfoparroting, title = {Personal Information Parroting in Language Models}, author = {Subramani, Nishant and Ghate, Kshitish and Diab, Mona}, journal = {Preprint}, year = {2025}, url = {}, paper_link = {}, }
- NAACLMICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with ToolsNishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, and Sam ThomsonNAACL, 2025
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logit lens (nostalgebraist, 2020) and then computes similarity scores between each layer’s generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
@article{Subramani2025mice4cat, title = {MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools}, author = {Subramani, Nishant and Eisner, Jason and Svegliato, Justin and Durme, Benjamin Van and Su, Yu and Thomson, Sam}, journal = {NAACL}, year = {2025}, url = {https://nishantsubramani.github.io/assets/pdf/mice4cats_paper.pdf}, paper_link = {https://nishantsubramani.github.io/assets/pdf/mice4cats_paper.pdf}, }
2024
- ACLOLMo: Accelerating the Science of Language ModelsDirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, and 31 more authorsIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.
@inproceedings{groeneveld-etal-2024-olmo, title = {{OLM}o: Accelerating the Science of Language Models}, author = {Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tushar and Merrill, William and Morrison, Jacob and Muennighoff, Niklas and Naik, Aakanksha and Nam, Crystal and Peters, Matthew and Pyatkin, Valentina and Ravichander, Abhilasha and Schwenk, Dustin and Shah, Saurabh and Smith, William and Strubell, Emma and Subramani, Nishant and Wortsman, Mitchell and Dasigi, Pradeep and Lambert, Nathan and Richardson, Kyle and Zettlemoyer, Luke and Dodge, Jesse and Lo, Kyle and Soldaini, Luca and Smith, Noah and Hajishirzi, Hannaneh}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.841}, paper_link = {https://arxiv.org/abs/2402.00838}, }
- ACLDolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining ResearchLuca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, and 24 more authorsIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.
@inproceedings{soldaini-etal-2024-dolma, title = {Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}, author = {Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and Naik, Aakanksha and Nam, Crystal and Peters, Matthew and Ravichander, Abhilasha and Richardson, Kyle and Shen, Zejiang and Strubell, Emma and Subramani, Nishant and Tafjord, Oyvind and Walsh, Evan and Zettlemoyer, Luke and Smith, Noah and Hajishirzi, Hannaneh and Beltagy, Iz and Groeneveld, Dirk and Dodge, Jesse and Lo, Kyle}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year = {2024}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.840}, paper_link = {https://arxiv.org/abs/2402.00159}, }
- TrustNLP @ NAACLEvaluating Personal Information Parroting in Language ModelsNishant Subramani, Kshitish Ghate, and Mona DiabJun 2024
@article{subramani-etal-2024-evaluating, title = {Evaluating Personal Information Parroting in Language Models}, author = {Subramani, Nishant and Ghate, Kshitish and Diab, Mona}, booktitle = {Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, paper_link = {https://nishantsubramani.github.io}, }
2023
- GEM @ EMNLPRobust Tooling and New Resources for Large Language Model Evaluation via CatwalkKyle Richardson, Ian Magnusson, Oyvind Tafjord, Akshita Bhagia, Iz Beltagy, Arman Cohan, Pradeep Dasigi, Jesse Dodge, Dirk Groeneveld, Yuling Gu, Tushar Harsh Jha, and Nishant SubramaniDec 2023
@article{richardson-etal-2023-robust-tooling, title = {Robust Tooling and New Resources for Large Language Model Evaluation via Catwalk}, author = {Richardson, Kyle and Magnusson, Ian and Tafjord, Oyvind and Bhagia, Akshita and Beltagy, Iz and Cohan, Arman and Dasigi, Pradeep and Dodge, Jesse and Groeneveld, Dirk and Gu, Yuling and Harsh Jha, Ananya Khot, Tushar and Subramani, Nishant}, booktitle = {Proceedings of the 3rd Workshop on Generation, Evaluation and Metrics (GEM 2023)}, month = dec, year = {2023}, address = {Sinagpore, Singapore}, publisher = {Association for Computational Linguistics}, paper_link = {https://nishantsubramani.github.io}, }
- TrustNLP @ ACLDetecting Personal Information in Training Corpora: an AnalysisNishant Subramani, Sasha Luccioni, Jesse Dodge, and Margaret MitchellIn Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Dec 2023
Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively under-explored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.
@inproceedings{subramani-etal-2023-detecting, title = {Detecting Personal Information in Training Corpora: an Analysis}, author = {Subramani, Nishant and Luccioni, Sasha and Dodge, Jesse and Mitchell, Margaret}, editor = {Ovalle, Anaelia and Chang, Kai-Wei and Mehrabi, Ninareh and Pruksachatkun, Yada and Galystan, Aram and Dhamala, Jwala and Verma, Apurv and Cao, Trista and Kumar, Anoop and Gupta, Rahul}, booktitle = {Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)}, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.trustnlp-1.18}, paper_link = {https://aclanthology.org/2023.trustnlp-1.18/}, }
2022
- GEM @ EMNLPDon’t Say What You Don’t Know: Improving the Consistency of Abstractive Summarization by Constraining Beam SearchDaniel King, Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, and Doug DowneyIn Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), Dec 2022
Abstractive summarization systems today produce fluent and relevant output, but often “hallucinate” statements not supported by the source text. We analyze the connection between hallucinations and training data, and find evidence that models hallucinate because they train on target summaries that are unsupported by the source. Based on our findings, we present PINOCCHIO, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations. Given the model states and outputs at a given step, PINOCCHIO detects likely model hallucinations based on various measures of attribution to the source text. PINOCCHIO backtracks to find more consistent output, and can opt to produce no summary at all when no consistent generation can be found. In experiments, we find that PINOCCHIO improves the consistency of generation by an average of 67% on two abstractive summarization datasets, without hurting recall.
@inproceedings{king-etal-2022-dont, title = {Don{'}t Say What You Don{'}t Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search}, author = {King, Daniel and Shen, Zejiang and Subramani, Nishant and Weld, Daniel S. and Beltagy, Iz and Downey, Doug}, editor = {Bosselut, Antoine and Chandu, Khyathi and Dhole, Kaustubh and Gangal, Varun and Gehrmann, Sebastian and Jernite, Yacine and Novikova, Jekaterina and Perez-Beltrachini, Laura}, booktitle = {Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)}, month = dec, year = {2022}, address = {Abu Dhabi, United Arab Emirates (Hybrid)}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.gem-1.51}, paper_link = {https://arxiv.org/abs/2203.08436/}, }
- EMNLPGEMv2: Multilingual NLG Benchmarking in a Single Line of CodeSebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina Mcmillan-major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, and 65 more authorsIn Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Dec 2022
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
@inproceedings{gehrmann-etal-2022-gemv2, title = {{GEM}v2: Multilingual {NLG} Benchmarking in a Single Line of Code}, author = {Gehrmann, Sebastian and Bhattacharjee, Abhik and Mahendiran, Abinaya and Wang, Alex and Papangelis, Alexandros and Madaan, Aman and Mcmillan-major, Angelina and Shvets, Anna and Upadhyay, Ashish and Bohnet, Bernd and Yao, Bingsheng and Wilie, Bryan and Bhagavatula, Chandra and You, Chaobin and Thomson, Craig and Garbacea, Cristina and Wang, Dakuo and Deutsch, Daniel and Xiong, Deyi and Jin, Di and Gkatzia, Dimitra and Radev, Dragomir and Clark, Elizabeth and Durmus, Esin and Ladhak, Faisal and Ginter, Filip and Winata, Genta Indra and Strobelt, Hendrik and Hayashi, Hiroaki and Novikova, Jekaterina and Kanerva, Jenna and Chim, Jenny and Zhou, Jiawei and Clive, Jordan and Maynez, Joshua and Sedoc, Jo{\~a}o and Juraska, Juraj and Dhole, Kaustubh and Chandu, Khyathi Raghavi and Beltrachini, Laura Perez and Ribeiro, Leonardo F . R. and Tunstall, Lewis and Zhang, Li and Pushkarna, Mahim and Creutz, Mathias and White, Michael and Kale, Mihir Sanjay and Eddine, Moussa Kamal and Daheim, Nico and Subramani, Nishant and Dusek, Ondrej and Liang, Paul Pu and Ammanamanchi, Pawan Sasanka and Zhu, Qi and Puduppully, Ratish and Kriz, Reno and Shahriyar, Rifat and Cardenas, Ronald and Mahamood, Saad and Osei, Salomey and Cahyawijaya, Samuel and {\v{S}}tajner, Sanja and Montella, Sebastien and Jolly, Shailza and Mille, Simon and Hasan, Tahmid and Shen, Tianhao and Adewumi, Tosin and Raunak, Vikas and Raheja, Vipul and Nikolaev, Vitaly and Tsai, Vivian and Jernite, Yacine and Xu, Ying and Sang, Yisi and Liu, Yixin and Hou, Yufang}, editor = {Che, Wanxiang and Shutova, Ekaterina}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, month = dec, year = {2022}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.emnlp-demos.27}, paper_link = {https://arxiv.org/abs/2206.11249/}, }
- ACLExtracting Latent Steering Vectors from Pretrained Language ModelsNishant Subramani, Nivedita Suresh, and Matthew PetersIn Findings of the Association for Computational Linguistics: ACL 2022, May 2022
Prior work on controllable text generation has focused on learning how to control language models through trainable decoding, smart-prompt design, or fine-tuning based on a desired objective. We hypothesize that the information needed to steer the model to generate a target sentence is already encoded within the model. Accordingly, we explore a different approach altogether: extracting latent vectors directly from pretrained language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly (> 99 BLEU) for English sentences from a variety of domains. We show that vector arithmetic can be used for unsupervised sentiment transfer on the Yelp sentiment benchmark, with performance comparable to models tailored to this task. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark (STS-B), outperforming pooled hidden states of models. Finally, we present an analysis of the intrinsic properties of the steering vectors. Taken together, our results suggest that frozen LMs can be effectively controlled through their latent steering space.
@inproceedings{subramani-etal-2022-extracting, title = {Extracting Latent Steering Vectors from Pretrained Language Models}, author = {Subramani, Nishant and Suresh, Nivedita and Peters, Matthew}, editor = {Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2022}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.findings-acl.48}, paper_link = {https://arxiv.org/abs/2205.05124/}, }
- BigScience WorkshopBLOOM: A 176B-Parameter Open-Access Multilingual Language ModelTeven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, and 379 more authorsArXiv, May 2022
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
@article{Scao2022BLOOMA1, title = {BLOOM: A 176B-Parameter Open-Access Multilingual Language Model}, author = {Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili'c, Suzana and Hesslow, Daniel and Castagn'e, Roman and Luccioni, Alexandra Sasha and Yvon, François and Gall{\'e}, Matthias and Tow, Jonathan and Rush, Alexander M. and Biderman, Stella and Webson, Albert and Ammanamanchi, Pawan Sasanka and Wang, Thomas and Sagot, Beno{\i}t and Muennighoff, Niklas and del Moral, Albert Villanova and Ruwase, Olatunji and Bawden, Rachel and Bekman, Stas and McMillan-Major, Angelina and Beltagy, Iz and Nguyen, Huu and Saulnier, Lucile and Tan, Samson and Suarez, Pedro Ortiz and Sanh, Victor and Laurenccon, Hugo and Jernite, Yacine and Launay, Julien and Mitchell, Margaret and Raffel, Colin and Gokaslan, Aaron and Simhi, Adi and Etxabe, Aitor Soroa and Aji, Alham Fikri and Alfassy, Amit and Rogers, Anna and Nitzav, Ariel Kreisberg and Xu, Canwen and Mou, Chenghao and Emezue, Chris C. and Klamm, Christopher and Leong, Colin and van Strien, Daniel Alexander and Adelani, David Ifeoluwa and Radev, Dragomir R. and Ponferrada, Eduardo Gonz'alez and Levkovizh, Efrat and Kim, Ethan and Natan, Eyal and Toni, Francesco De and Dupont, G{\'e}rard and Kruszewski, Germ{\'a}n and Pistilli, Giada and ElSahar, Hady and Benyamina, Hamza and Tran, Hieu Trung and Yu, Ian and Abdulmumin, Idris and Johnson, Isaac and Gonzalez-Dios, Itziar and de la Rosa, Javier and Chim, Jenny and Dodge, Jesse and Zhu, Jian and Chang, Jonathan and Frohberg, Jorg and Tobing, Josephine and Bhattacharjee, Joydeep and Almubarak, Khalid and Chen, Kimbo and Lo, Kyle and von Werra, Leandro and Weber, Leon and Phan, Long and Allal, Loubna Ben and Tanguy, Ludovic and Dey, Manan and Mu{\~n}oz, Manuel Romero and Masoud, Maraim and Grandury, Mar'ia and vSavsko, Mario and Huang, Max and Coavoux, Maximin and Singh, Mayank and Jiang, Mike Tian-Jian and Vu, Minh Chien and Jauhar, Mohammad A. and Ghaleb, Mustafa and Subramani, Nishant and Kassner, Nora and Khamis, Nurulaqilla and Nguyen, Olivier and Espejel, Omar and de Gibert, Ona and Villegas, Paulo and Henderson, Peter and Colombo, Pierre and Amuok, Priscilla and Lhoest, Quentin and Harliman, Rheza and Bommasani, Rishi and L'opez, Roberto and Ribeiro, Rui and Osei, Salomey and Pyysalo, Sampo and Nagel, Sebastian and Bose, Shamik and Muhammad, Shamsuddeen Hassan and Sharma, Shanya and Longpre, S. and Nikpoor, Somaieh and Silberberg, S. and Pai, Suhas and Zink, Sydney and Torrent, Tiago Timponi and Schick, Timo and Thrush, Tristan and Danchev, Valentin and Nikoulina, Vassilina and Laippala, Veronika and Lepercq, Violette and Prabhu, Vrinda and Alyafeai, Zaid and Talat, Zeerak and Raja, Arun and Heinzerling, Benjamin and Si, Chenglei and Salesky, Elizabeth and Mielke, Sabrina J. and Lee, Wilson Y. and Sharma, Abheesht and Santilli, Andrea and Chaffin, Antoine and Stiegler, Arnaud and Datta, Debajyoti and Szczechla, Eliza and Chhablani, Gunjan and Wang, Han and Pandey, Harshit and Strobelt, Hendrik and Fries, Jason Alan and Rozen, Jos and Gao, Leo and Sutawika, Lintang and Bari, M Saiful and Al-Shaibani, Maged S. and Manica, Matteo and Nayak, Nihal V. and Teehan, Ryan and Albanie, Samuel and Shen, Sheng and Ben-David, Srulik and Bach, Stephen H. and Kim, Taewoon and Bers, Tali and F{\'e}vry, Thibault and Neeraj, Trishala and Thakker, Urmish and Raunak, Vikas and Tang, Xiang and Yong, Zheng-Xin and Sun, Zhiqing and Brody, Shaked and Uri, Y and Tojarieh, Hadar and Roberts, Adam and Chung, Hyung Won and Tae, Jaesung and Phang, Jason and Press, Ofir and Li, Conglong and Narayanan, Deepak and Bourfoune, Hatim and Casper, Jared and Rasley, Jeff and Ryabinin, Max and Mishra, Mayank and Zhang, Minjia and Shoeybi, Mohammad and Peyrounette, Myriam and Patry, Nicolas and Tazi, Nouamane and Sanseviero, Omar and von Platen, Patrick and Cornette, Pierre and Lavall'ee, Pierre Franccois and Lacroix, R{\'e}mi and Rajbhandari, Samyam and Gandhi, Sanchit and Smith, Shaden and Requena, St{\'e}phane and Patil, Suraj and Dettmers, Tim and Baruwa, Ahmed and Singh, Amanpreet and Cheveleva, Anastasia and Ligozat, Anne-Laure and Subramonian, Arjun and N'ev'eol, Aur'elie and Lovering, Charles and Garrette, Daniel H and Tunuguntla, Deepak R. and Reiter, Ehud and Taktasheva, Ekaterina and Voloshina, Ekaterina and Bogdanov, Eli and Winata, Genta Indra and Schoelkopf, Hailey and Kalo, Jan-Christoph and Novikova, Jekaterina and Forde, Jessica Zosa and Tang, Xiangru and Kasai, Jungo and Kawamura, Ken and Hazan, Liam and Carpuat, Marine and Clinciu, Miruna and Kim, Najoung and Cheng, Newton and Serikov, Oleg and Antverg, Omer and van der Wal, Oskar and Zhang, Rui and Zhang, Ruochen and Gehrmann, Sebastian and Mirkin, Shachar and Pais, S. Osher and Shavrina, Tatiana and Scialom, Thomas and Yun, Tian and Limisiewicz, Tomasz and Rieser, Verena and Protasov, Vitaly and Mikhailov, Vladislav and Pruksachatkun, Yada and Belinkov, Yonatan and Bamberger, Zachary and Kasner, Zdenvek and Kasner, Zdeněk and Pestana, Amanda and Feizpour, Amir and Khan, Ammar and Faranak, Amy and Santos, Ananda Santa Rosa and Hevia, Anthony and Unldreaj, Antigona and Aghagol, Arash and Abdollahi, Arezoo and Tammour, Aycha and HajiHosseini, Azadeh and Behroozi, Bahareh and Ajibade, Benjamin Ayoade and Saxena, Bharat Kumar and Ferrandis, Carlos Mu{\~n}oz and Contractor, Danish and Lansky, David M. and David, Davis and Kiela, Douwe and Nguyen, Duong Anh and Tan, Edward and Baylor, Emi and Ozoani, Ezinwanne and Mirza, Fatim Tahirah and Ononiwu, Frankline and Rezanejad, Habib and Jones, H.A. and Bhattacharya, Indrani and Solaiman, Irene and Sedenko, Irina and Nejadgholi, Isar and Passmore, Jan and Seltzer, Joshua and Sanz, Julio Bonis and Fort, Karen and Dutra, L{\'i}via and Samagaio, Mairon and Elbadri, Maraim and Mieskes, Margot and Gerchick, Marissa and Akinlolu, Martha and McKenna, Michael and Qiu, Mike and Ghauri, Muhammed and Burynok, Mykola and Abrar, Nafis and Rajani, Nazneen and Elkott, Nour and Fahmy, Nourhan and Samuel, Olanrewaju and An, Ran and Kromann, R. P. and Hao, Ryan and Alizadeh, Samira and Shubber, Sarmad and Wang, Silas L. and Roy, Sourav and Viguier, Sylvain and Le, Thanh-Cong and Oyebade, Tobi and Le, Trieu Nguyen Hai and Yang, Yoyo and Nguyen, Zach and Kashyap, Abhinav Ramesh and Palasciano, Alfredo and Callahan, Alison and Shukla, Anima and Miranda-Escalada, Antonio and Singh, Ayush Kumar and Beilharz, Benjamin and Wang, Bo and de Brito, Caio Matheus Fonseca and Zhou, Chenxi and Jain, Chirag and Xu, Chuxin and Fourrier, Cl{\'e}mentine and Perin'an, Daniel Le'on and Molano, Daniel and Yu, Dian and Manjavacas, Enrique and Barth, Fabio and Fuhrimann, Florian and Altay, Gabriel and Bayrak, Giyaseddin and Burns, Gully and Vrabec, Helena U. and Bello, Iman I.B. and Dash, Isha and Kang, Ji Soo and Giorgi, John and Golde, Jonas and Posada, Jose David and Sivaraman, Karthi and Bulchandani, Lokesh and Liu, Lu and Shinzato, Luisa and de Bykhovetz, Madeleine Hahn and Takeuchi, Maiko and P{\`a}mies, Marc and Castillo, Mar{\'i}a Andrea and Nezhurina, Marianna and Sanger, Mario and Samwald, Matthias and Cullan, Michael and Weinberg, Michael and Wolf, M and Mihaljcic, Mina and Liu, Minna and Freidank, Moritz and Kang, Myungsun and Seelam, Natasha and Dahlberg, Nathan and Broad, Nicholas Michio and Muellner, Nikolaus and Fung, Pascale and Haller, Patricia and Haller, Patrick and Eisenberg, Renata and Martin, Robert and Canalli, Rodrigo and Su, Rosaline and Su, Ruisi and Cahyawijaya, Samuel and Garda, Samuele and Deshmukh, Shlok S and Mishra, Shubhanshu and Kiblawi, Sid and Ott, Simon and Sang-aroonsiri, Sinee and Kumar, Srishti and Schweter, Stefan and Bharati, Sushil Pratap and Laud, Tanmay and Gigant, Th{\'e}o and Kainuma, Tomoya and Kusa, Wojciech and Labrak, Yanis and Bajaj, Yashasvi and Venkatraman, Y. and Xu, Yifan and Xu, Ying and Xu, Yu and Tan, Zhee Xao and Xie, Zhongli and Ye, Zifan and Bras, Mathilde and Belkada, Younes and Wolf, Thomas}, journal = {ArXiv}, year = {2022}, volume = {abs/2211.05100}, url = {https://api.semanticscholar.org/CorpusID:253420279}, paper_link = {https://arxiv.org/abs/2211.05100/}, }
- FAccTData Governance in the Age of Large-Scale Data-Driven Language TechnologyYacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Isaac Johnson, Gerard Dupont, Jesse Dodge, and 8 more authorsIn Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, May 2022
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
@inproceedings{10.1145/3531146.3534637, author = {Jernite, Yacine and Nguyen, Huu and Biderman, Stella and Rogers, Anna and Masoud, Maraim and Danchev, Valentin and Tan, Samson and Luccioni, Alexandra Sasha and Subramani, Nishant and Johnson, Isaac and Dupont, Gerard and Dodge, Jesse and Lo, Kyle and Talat, Zeerak and Radev, Dragomir and Gokaslan, Aaron and Nikpoor, Somaieh and Henderson, Peter and Bommasani, Rishi and Mitchell, Margaret}, title = {Data Governance in the Age of Large-Scale Data-Driven Language Technology}, year = {2022}, isbn = {9781450393522}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3531146.3534637}, booktitle = {Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency}, series = {FAccT '22}, paper_link = {https://arxiv.org/abs/2206.03216/}, }
- TACLQuality at a Glance: An Audit of Web-Crawled Multilingual DatasetsJulia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, and 40 more authorsTransactions of the Association for Computational Linguistics, May 2022
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
@article{kreutzer-etal-2022-quality, title = {Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets}, author = {Kreutzer, Julia and Caswell, Isaac and Wang, Lisa and Wahab, Ahsan and van Esch, Daan and Ulzii-Orshikh, Nasanbayar and Tapo, Allahsera and Subramani, Nishant and Sokolov, Artem and Sikasote, Claytone and Setyawan, Monang and Sarin, Supheakmungkol and Samb, Sokhar and Sagot, Beno{\\i}t and Rivera, Clara and Rios, Annette and Papadimitriou, Isabel and Osei, Salomey and Suarez, Pedro Ortiz and Orife, Iroro and Ogueji, Kelechi and Rubungo, Andre Niyongabo and Nguyen, Toan Q. and M{\"u}ller, Mathias and M{\"u}ller, Andr{\'e} and Muhammad, Shamsuddeen Hassan and Muhammad, Nanda and Mnyakeni, Ayanda and Mirzakhalov, Jamshidbek and Matangira, Tapiwanashe and Leong, Colin and Lawson, Nze and Kudugunta, Sneha and Jernite, Yacine and Jenny, Mathias and Firat, Orhan and Dossou, Bonaventure F. P. and Dlamini, Sakhile and de Silva, Nisansa and {\c{C}}abuk Ball{\i}, Sakine and Biderman, Stella and Battisti, Alessia and Baruwa, Ahmed and Bapna, Ankur and Baljekar, Pallavi and Azime, Israel Abebe and Awokoya, Ayodele and Ataman, Duygu and Ahia, Orevaoghene and Ahia, Oghenefego and Agrawal, Sweta and Adeyemi, Mofetoluwa}, editor = {Roark, Brian and Nenkova, Ani}, journal = {Transactions of the Association for Computational Linguistics}, volume = {10}, year = {2022}, address = {Cambridge, MA}, publisher = {MIT Press}, url = {https://aclanthology.org/2022.tacl-1.4}, paper_link = {https://arxiv.org/abs/2103.12028/}, }
2021
- GEM @ ACLThe GEM Benchmark: Natural Language Generation, its Evaluation and MetricsSebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, and 44 more authorsIn Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), Aug 2021
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.
@inproceedings{gehrmann-etal-2021-gem, title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and Metrics}, author = {Gehrmann, Sebastian and Adewumi, Tosin and Aggarwal, Karmanya and Ammanamanchi, Pawan Sasanka and Aremu, Anuoluwapo and Bosselut, Antoine and Chandu, Khyathi Raghavi and Clinciu, Miruna-Adriana and Das, Dipanjan and Dhole, Kaustubh and Du, Wanyu and Durmus, Esin and Du{\v{s}}ek, Ond{\v{r}}ej and Emezue, Chris Chinenye and Gangal, Varun and Garbacea, Cristina and Hashimoto, Tatsunori and Hou, Yufang and Jernite, Yacine and Jhamtani, Harsh and Ji, Yangfeng and Jolly, Shailza and Kale, Mihir and Kumar, Dhruv and Ladhak, Faisal and Madaan, Aman and Maddela, Mounica and Mahajan, Khyati and Mahamood, Saad and Majumder, Bodhisattwa Prasad and Martins, Pedro Henrique and McMillan-Major, Angelina and Mille, Simon and van Miltenburg, Emiel and Nadeem, Moin and Narayan, Shashi and Nikolaev, Vitaly and Niyongabo Rubungo, Andre and Osei, Salomey and Parikh, Ankur and Perez-Beltrachini, Laura and Rao, Niranjan Ramesh and Raunak, Vikas and Rodriguez, Juan Diego and Santhanam, Sashank and Sedoc, Jo{\~a}o and Sellam, Thibault and Shaikh, Samira and Shimorina, Anastasia and Sobrevilla Cabezudo, Marco Antonio and Strobelt, Hendrik and Subramani, Nishant and Xu, Wei and Yang, Diyi and Yerukola, Akhila and Zhou, Jiawei}, editor = {Bosselut, Antoine and Durmus, Esin and Gangal, Varun Prashant and Gehrmann, Sebastian and Jernite, Yacine and Perez-Beltrachini, Laura and Shaikh, Samira and Xu, Wei}, booktitle = {Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)}, month = aug, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.gem-1.10}, pages = {96--120}, paper_link = {https://arxiv.org/abs/2102.01672/}, }
- DataCentricAI @ NeurIPSNatural Adversarial ObjectsFelix Lau, Nishant Subramani, Sasha Harrison, Aerin Kim, Elliot Branson, and Rosanne LiuArXiv, Aug 2021
Although state-of-the-art object detection methods have shown compelling performance, models often are not robust to adversarial attacks and out-of-distribution data. We introduce a new dataset, Natural Adversarial Objects (NAO), to evaluate the robustness of object detection models. NAO contains 7,934 images and 9,943 objects that are unmodified and representative of real-world scenarios, but cause state-of-the-art detection models to misclassify with high confidence. The mean average precision (mAP) of EfficientDet-D7 drops 74.5% when evaluated on NAO compared to the standard MSCOCO validation set. Moreover, by comparing a variety of object detection architectures, we find that better performance on MSCOCO validation set does not necessarily translate to better performance on NAO, suggesting that robustness cannot be simply achieved by training a more accurate model. We further investigate why examples in NAO are difficult to detect and classify. Experiments of shuffling image patches reveal that models are overly sensitive to local texture. Additionally, using integrated gradients and background replacement, we find that the detection model is reliant on pixel information within the bounding box, and insensitive to the background context when predicting class labels. NAO can be downloaded at https://drive.google.com/drive/folders/15P8sOWoJku6SSEiHLEts86ORfytGezi8.
@article{Lau2021NaturalAO, title = {Natural Adversarial Objects}, author = {Lau, Felix and Subramani, Nishant and Harrison, Sasha and Kim, Aerin and Branson, Elliot and Liu, Rosanne}, journal = {ArXiv}, year = {2021}, volume = {abs/2111.04204}, url = {https://api.semanticscholar.org/CorpusID:243848218}, paper_link = {https://arxiv.org/abs/2111.04204/}, }
2020
- MLRSA @ NeurIPSA Survey of Deep Learning Approaches for OCR and Document UnderstandingNishant Subramani, Alexandre Matton, Malcolm Greaves, and Adrian LamArXiv, Aug 2020
Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. In this survey paper, we review different techniques for document understanding for documents written in English and consolidate methodologies present in literature to act as a jumping-off point for researchers exploring this area.
@article{Subramani2020ASO, title = {A Survey of Deep Learning Approaches for OCR and Document Understanding}, author = {Subramani, Nishant and Matton, Alexandre and Greaves, Malcolm and Lam, Adrian}, journal = {ArXiv}, year = {2020}, volume = {abs/2011.13534}, url = {https://api.semanticscholar.org/CorpusID:227209404}, paper_link = {https://arxiv.org/abs/2011.13534/}, }
- arXivDiscovering Useful Sentence Representations from Large Pretrained Language ModelsNishant Subramani, and Nivedita SureshArXiv, Aug 2020
Despite the extensive success of pretrained language models as encoders for building NLP systems, they haven’t seen prominence as decoders for sequence generation tasks. We explore the question of whether these models can be adapted to be used as universal decoders. To be considered "universal," a decoder must have an implicit representation for any target sentence s, such that it can recover that sentence exactly when conditioned on its representation. For large transformer-based language models trained on vast amounts of English text, we investigate whether such representations can be easily discovered using standard optimization methods. We present and compare three representation injection techniques for transformer-based models and three accompanying methods which map sentences to and from this representation space. Experiments show that not only do representations exist for sentences from a variety of genres. More importantly, without needing complex optimization algorithms, our methods recover these sentences almost perfectly without fine-tuning the underlying language model at all.
@article{Subramani2020DiscoveringUS, title = {Discovering Useful Sentence Representations from Large Pretrained Language Models}, author = {Subramani, Nishant and Suresh, Nivedita}, journal = {ArXiv}, year = {2020}, volume = {abs/2008.09049}, url = {https://api.semanticscholar.org/CorpusID:221186910}, paper_link = {https://arxiv.org/abs/2008.09049/}, }
- AAAILearning Efficient Representations for Fake Speech DetectionNishant Subramani, and Delip RaoIn AAAI Conference on Artificial Intelligence, Aug 2020
Synthetic speech or “fake speech” which matches personal vocal traits has become better and cheaper due to advances in deep learning-based speech synthesis and voice conversion approaches. This increased accessibility of synthetic speech systems and the growing misuse of them highlights the critical need to build countermeasures. Furthermore, new synthesis models evolve all the time and the efficacy of previously trained detection models on these unseen attack vectors is poor. In this paper, we focus on: 1) How can we build highly accurate, yet parameter and sample-efficient models for fake speech detection? 2) How can we rapidly adapt detection models to new sources of fake speech? We present four parameter-efficient convolutional architectures for fake speech detection with best detection F1 scores of around 97 points on a large dataset of fake and bonafide speech. We show how the fake speech detection task naturally lends itself to a novel multi-task problem further improving F1 scores for a mere 0.5% increase in model parameters. Our multi-task setting also helps in data-sparse situations, commonplace in adversarial settings. We investigate an alternative approach to the data-sparsity problem using transfer learning and show that it is possible to meet purely supervised detection performance for unseen attack vectors with as little as 6.25% of the training data. This is the first known application of transfer learning in adversarial settings for speech. Finally, we show how well our transfer learning approach adapts in an instance-efficient way to new attack vectors using the Real-Time Voice Cloning toolkit. We exceed the purely supervised detection performance (99.18 F1) with as little as 6.25% of the data.
@inproceedings{Subramani2020LearningER, title = {Learning Efficient Representations for Fake Speech Detection}, author = {Subramani, Nishant and Rao, Delip}, booktitle = {AAAI Conference on Artificial Intelligence}, year = {2020}, url = {https://api.semanticscholar.org/CorpusID:213460309}, paper_link = {https://ojs.aaai.org/index.php/AAAI/article/view/6044/}, }
2019
- NeurIPSCan unconditional language models recover arbitrary sentences?Nishant Subramani, Samuel Bowman, and Kyunghyun ChoAdvances in Neural Information Processing Systems, Aug 2019
Neural network-based generative language models like ELMo and BERT can work effectively as general purpose sentence encoders in text classification without further fine-tuning. Is it possible to adapt them in a similar way for use as general-purpose decoders? For this to be possible, it would need to be the case that for any target sentence of interest, there is some continuous representation that can be passed to the language model to cause it to reproduce that sentence. We set aside the difficult problem of designing an encoder that can produce such representations and, instead, ask directly whether such representations exist at all. To do this, we introduce a pair of effective, complementary methods for feeding representations into pretrained unconditional language models and a corresponding set of methods to map sentences into and out of this representation space, the reparametrized sentence space. We then investigate the conditions under which a language model can be made to generate a sentence through the identification of a point in such a space and find that it is possible to recover arbitrary sentences nearly perfectly with language models and representations of moderate size without modifying any model parameters.
@article{subramani2019can, title = {Can unconditional language models recover arbitrary sentences?}, author = {Subramani, Nishant and Bowman, Samuel and Cho, Kyunghyun}, journal = {Advances in Neural Information Processing Systems}, volume = {32}, year = {2019}, url = {https://proceedings.neurips.cc/paper_files/paper/2019/file/48c8c3963853fff20bd9e8bee9bd4c07-Paper.pdf}, paper_link = {https://arxiv.org/abs/1907.04944/}, }
2018
- CausalML @ ICMLPag2admg: An Algorithm for the Complete Causal Enumeration of a Markov Equivalence ClassNishant SubramaniInternational Conference of Machine Learning CausalML Workshop, Aug 2018
@article{subramani2018pag2admg, title = {Pag2admg: An Algorithm for the Complete Causal Enumeration of a Markov Equivalence Class}, author = {Subramani, Nishant}, journal = {International Conference of Machine Learning CausalML Workshop}, year = {2018}, url = {https://arxiv.org/abs/1612.00099/}, paper_link = {https://arxiv.org/abs/1612.00099/}, }
2017
- AAAIPAG2ADMG: A Novel Methodology to Enumerate Causal Graph StructuresNishant Subramani, and Doug DowneyProceedings of the AAAI Conference on Artificial Intelligence, Aug 2017
@article{Subramani_Downey_2017, title = {PAG2ADMG: A Novel Methodology to Enumerate Causal Graph Structures}, author = {Subramani, Nishant and Downey, Doug}, volume = {31}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/11121}, number = {1}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, year = {2017}, paper_link = {https://ojs.aaai.org/index.php/AAAI/article/view/11121}, }