Datasets of languages data

Many language corpora are available for researchers. Each one contains details of levels of access. Many of the corpora have associated corpus tools to use for analysis. 

A collection of hundreds of data sets of spoken language, in over 34 languages. Includes adult and child language, conversation and more formal genres, and multilingual interactions. All transcribed. Carnegie Mellon University. 

The Language Archive. An archive of audio and video spoken language from around the world, including many languages with small numbers of speakers. Includes naturalistic and elicited production recordings.  Max Planck Institute for Psycholinguistics in Nijmegen

Endangered Languages Archive. Contains audio and video recordings of many endangered languages around the world. SOAS University of London

Spoken language corpora from languages around the world. Often transcribed and with English glosses. ARC Centre of Excellence for the Dynamics of Language. 

An archive of spoken language from many small languages around the world. Include digitizations of data recorded in analog formats. Over 1,200 languages represented.

Scripture resources in Australian Indigenous languages, including full texts of Bibles. 

https://www.english-corpora.org

Spoken and written corpora in English in many different genres. Includes American and British varieties of English. 

https://www.corpusdelespanol.org

Corpus of historical and contemporary Spanish in many genres as spoken in many different countries.

https://slaap.chass.ncsu.edu

An interactive web-based archive of sociolinguistic recordings, with integrated media playing and annotation features, plus other corpus tools. North Carolina State University

https://buckeyecorpus.osu.edu

Conversational speech recordings from 40 speakers in Columbus, Ohio conversing freely with an interviewer. Transcribed. 

Contains 500 ready-to-use written text corpora in over 90 languages. Includes corpus tools.

https://multicast.aspra.uni-bamberg.de

A collection of annotated texts from a typologically diverse array of languages. Time-aligned annotations with audio recordings, in formats suitable for cross-corpus typological research.

http://doreco.info/

Collection of spoken language corpora from about 50 languages, extracted from documentations of small and often endangered languages. Transcribed with time-aligned annotations.