Corpus

The SpeechReporting corpus

The SpeechReporting corpus contains corpora of traditional folk stories, annotated for a number of discourse phenomena using the ELAN-CorpA software and tools (Chanard 2015; Nikitina et al. 2019). It is updated regularly with newly available data, including data from new languages. All texts are transcribed, glossed, translated, and annotated.

The corpus was developed as part of the project “Discourse Reporting in African Storytelling”, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 758232, PI Tatiana Nikitina). In addition to standard glosses and part of speech tags, it is annotated for instances of reported discourse; see our Annotation Guide.

Languages and their corpora

Bashkir
Bandial (Jóola Eegimaa)
Chuvash
Ginyanga
Gizey
Guro
Kafire
Macedonian
Mwan
Udihe
U̱t-Ma’in
Wan

A narrative corpus of Bashkir

Language and storytelling tradition

Bashkir is a Turkic language spoken in Bashkortostan (Russia) by approximately 1,2 million people. Nowadays, the traditional way of storytelling no longer exists, and people do not gather to tell stories. However, there are people who still remember tales that were told to them by older people in their childhood. Moreover, a lot of tales are published and people might retell these written tales.

Corpus composition

The current version of the corpus contains about two hours of texts (10.500 words) and corresponding video files. 1 h 20 minutes were collected by Ekaterina Aplonova during fieldwork in 2019; the rest comes from the Spoken Corpus of Bashkir (without video). The texts were recorded in the following villages: Abzanovo, Aslayevo, Baimovo, Kyzgy, Mullakaevo, Rakhmetovo, Tuishevo, and Usakly.

Orthography

The texts are transcribed in the Latin script with some extensions, unlike standard Bashkir orthography, which is based on the Cyrillic script. The system employed here is mainly a transliteration of the standard Bashkir orthography, but it has some features of phonological transcription. Correspondences between the standard orthography and the transliteration system employed here are listed on the Spoken Corpus of Bashkir website.

List of parts of speech

adj – adjective
adv – adverb
conj – conjunction
cop – copula
intj – interjection
n – noun

n.prop – proper noun
nsuf – noun suffix
num – numeral
numsuf – suffix of numerals
onomat – onomatopoeia
post – postposition

pron – pronoun
part – particle
suf – suffix
v – verb
vsuf – suffix of verbs
word – unclassified

List of glosses

A.CV – converb (-a)
ABL – ablative
ACC – accusative
ADJ – adjectivizer
ADV – adverbializer
AFF – affective
AG – agent nominalization
B.CV – converb (-p)
CAUS – causative
COND – conditional
CV.ANT – anterior converb
CV.TERM – terminative converb
DAT – dative
FUT – future
GEN – genitive

HORT – hortative
IMP – imperative
IMP.EMPH – emphatic imperative
IPFV – imperfective
JUSS – jussive
LOC – locative
NEG – negative suffix
NEG.CV.ATT – negative form of converb
NEG.POT – negative form of the potential
NMLZ – nominalization
NUM.SUBST – substantivizer of numeral
ORD – ordinal numeral
P.1PL – possessive suffix of 1PL
P.1SG – possessive suffix of 1SG
P.2PL – possessive suffix of 2PL

P.2SG – possessive suffix of 2SG
P.3 – possessive suffix of 3 person
PASS – passive
PC.PST – past participle
PL – plural
PLPF – pluperfect
POSS.SUBST – substantivizer of possessor
POT – potential
PST – past tense
PTCP.FUT – future participle
Q – question marker
RECP – reciprocal
REFL – reflexive
\RUS – Russian borrowing

Acknowledgements

I would like to thank my Bashkir language assistants who told me their stories and Fizaliya Makhanova for her help with transcription, as well as Lilya Buskunbaeva and Ramilya Karimova from Ufa Institute of Linguistics for their precious help during data collection and analysis.

Citing the sub-corpus:
Aplonova, Ekaterina. 2021. A narrative corpus of Bashkir. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Bandial (Jóola Eegimaa)

Language and storytelling tradition

Jóola Eegimaa is an Atlantic language spoken in Senegal by approximately 15.700 people. The practice of storytelling is largely extinct in Senegal; even in remote areas, the only people capable of telling stories are 'rememberers' of their grandparents' generation. Among the Jóola, many tales are likely borrowed from other linguistic-ethnic groups such as the Mandinka.

Corpus composition

The current version of the corpus contains about two hours of texts (about 12.000 words) and corresponding video files. The data was collected during a three month period in 2018-2019 by Abbie Hantgan-Sonko in the area known as The Kingdom in Casamance.

Orthography

The transcription is made according to the official Jóola (otherwise written as Jola or Diola) orthography which indicates long vowels by doubling the letter: aa ee ii oo uu , and advanced tongue root ([+ATR]) by an acute accent over the first vowel of a stem: á é í ó ú . Retracted or [-ATR] roots are unmarked.

List part of speech tags and glosses (adapted from Sagna 2008: 18)

adj – adjective
adp – adposition
adv – adverb
aux – auxiliary
comp – complementizer
coordconn – coordinative conjunction
cop – copula
def – definite article
dem – demonstrative
emph – emphatic

idph – ideophone
indf – indefinite article
intj – interjection
n – noun
n: – nominal morpheme
n>v – noun to verb conversion
nprop – proper noun
num – numeral
part – particle
post – postposition

pn – proper noun
pref – prefix
prep – preposition
pro – pronoun
q – question mark
rel – relative marker
subordconn – subordinating conjunction
v – verb
v: – verbal morpheme
v>n – verb to noun conversion

ABSTR – abstract
AGT – agentive
ASSOC – associative
CAUS – causative
CD(3, 6, 13) – concord/ agreement marker
CENT – centrifugal
COMP – complementizer
DEF – definite
DEM – demonstrative
DEM.PROX – proximal demonstrative
DEP – dependent
DIR – directional
DO – direct object
EMPH – emphatic
EQUAT – equative
EXCL – exclusive

\FR – French borrowing
FUT – future
GEN – genitive
GER – gerund
HAB.NEG – habitual negative
IMP.NEG – imperative Negative
INACT – inactualis
INCL – inclusive
INDF – indefinite
INF – infinitive
LOC – location marker
MED – medial
MID – middle voice
NC(1-14) – noun class marker
NEG – negation
NEG.COP – negative Copula

NEG.FUT – negative Future
NMLZ – nominalizer
PFV – perfective
PL – plural
PLUR – pluractional
POSS – possessive
PRO – pronoun
PROH – prohibitive
QUANT – quantifier
REFL – reflexive
REL – relative
REV – reversive
SG – singular
SBJ – subject

Citing the Jóola Eegimaa corpus:
Hantgan-Sonko, Abbie. 2021. A narrative corpus of Jóola Eegimaa. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Chuvash

Language

Chuvash is a Turkic language spoken in Russia by approximately 1 million people

Corpus composition

The current version of the corpus contains examples of traditional storytelling recorded by A.K. Salmin in the 1980s (the original recordings are available in the archival collections of the Chuvash State Institute of Humanities) and folktales recorded by Elena Fedotova during her fieldwork in the summer of 2019. The total length of the currently available portion of the corpus is about 5 hours; the total number of words is about 25.000. The corpus is under construction, with new texts due to become available soon.

List of parts of speech

adj – adjective
adv – adverb
cnj – conjunction
idph – ideophone
intj – interjection

n – noun
np – proper noun
num – numeral
onomat – onomatopoeia
pp – postposition

pro – pronoun
prt – particle
suff – suffix
v – verb
vn – non-verbal predicate

List of glosses

ABL - ablative
ACC/DAT - accusative/dative
ADVBZ - adverbializer
ANAPH - anaphoric marker
ANTR - anterior
APPR.ALL - approximative allative
CAR - caritive
CAUS - causative
CMPR - comparative
COLL - collective
COND - conditional
CSL - causal
CV - converb
CV_ANT - anterior converb
CV_COORD - coordinate converb
CV_POST - posterior converb
DEST - destinative
\DIAL - dialectal
DISTR - distributive
EMPH - emphatic
EX - existential
EX.NEG - negative existential

GEN - genitive
HORT - hortative
IDPH - ideophone
IMP - imperative
IMPF - imperfective
INF - infinitive
INF2 - second infinitive
INSTR - instrumental
INTJ - interjection
INTENS - intensifier
INTS - intensifying particle
ITER - iterative
JUSS - jussive
LOC - locative
NEG - negative
NEG.PRS - negative present
NMLZ - nominalizer
OBL - oblique
ONOMAT - onomatopoeia
ORD - ordinal
PC_DEBT - debitative participle
PC_FUT - future participle

PC_PRS - present participle
PC_PST - past participle
PL - plural
POSS - possessive
POT - potential
PROH - prohibitive
PROPR - property marker
PRS - present
PRT - particle
PST - past
Q - question marker
RECIPR - reciprocal
REFL - reflexive
REL.POSS - relational possessor
RETR - retrospective
\RUS - Russian borrowing or codeswitching with Russian
SG - singular
SUBST - substantivizer
TEMP_ATTR - marker of temporal attribute
VBLZ - verbalizer

Citing the Chuvash corpus:
Nikitina, Tatiana. 2022. A narrative corpus of Chuvash. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Ginyanga

Language and storytelling tradition

Ginyanga is a North Guang language belonging to the Kwa language family spoken by a population of 16500 people in Togo and Ghana. Stories are still occasionally told during night gatherings, but they are less and less frequent. However, even young people generally know a couple of tales.

Corpus composition

The data were collected in 2019-2023 by Katya Aplonova and Viktoria Pozdnyakova in Agbandi and Diguina in Blitta prefecture in Togo. The data were transcribed and glossed with a precious help of our language assistants: Yao Amédomé, Evégénou Tchala, Emmanuel Kontré and Atchou Edoh. The available part of the corpus corresponds to approximately 1h.

List of parts of speech

adj - adjective
adv - adverb
aux - auxiliary
conj - conjunction
det - determinative

idph - ideophone
intj - interjection
n - noun
num - numeral
part - particle

pp - preposition
pron - pronoun
refr - refrain
v - verb

List of glosses

ADJ - adjectivizer
AG - agent suffix
ART - article
ASS.PL - assosiative plural
CL1-8 - noun class prefix
CMPL - complementizer (also a quotative marker)
COND - conditional marker
COP - copula

CPL - completive
DIM - diminutive
DISC - discursive marker
FOC - focalizator
FUT - future
IPFV - imperfective
LOG - logophoric pronoun/prefix
NEG - negation

PFV - perfective
Q - question particle
RECP - reciprocal pronoun
SUB - subordinator
\ASH - Ashanti borrowing
\ENG - English borrowing
\EWE - Ewe borrowing

Citing the Ginyanga corpus:
Aplonova, Ekaterina & Pozdniakova, Viktoria. 2023. A narrative corpus of Ginyanga. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Gizey

Language and storytelling tradition

Gizey is a Masa languoid (Chadic < Afro-Asiatic) spoken by approximately 19.000 people across Cameroon and Chad. Stories are still occasionally told during night gatherings. People generally know a couple of tales, which they modify freely to suit the storytelling context.

Corpus composition

The data were collected in 2019 by Guillaume Guitang and Dieudonné Soupoursou in Djougoumta and Maïda (Mayo-Danay, Cameroon). The data were partly transcribed by Victor Marheyna. In total, the corpus contains around 210 minutes. Only part of this corpus (around 30 minutes) is currently available.

List of parts of speech

adj - adjective
adv - adverb
afx - affix
conj - conjunction
cop - copula
dem - demonstrative

det - determiner
idm - idiom
idph - ideophone
interj - interjection
locu - locution
n - noun

num - numeral
poss - possessive
prep - preposition
pro - pronoun
prt - particle
v - verb

List of glosses

1plexcl - 1st plural exclusive
1plincl - 1st plural inclusive
1s - 1st singular
2pl - 2nd plural
2sf - 2nd singular feminine
2sm - 2nd singular masculine
3do - 3rd direct object
3pl - 3rd plural
3sf - 3rd singular feminine
3sm - 3rd singular masculine
abess - abessive
adv - adverb
anaphdem.pl - anaphoric demonstrative plural
anaphdem.sf - anaphoric demonstrative singular feminine
anaphdem.sm - anaphoric demonstrative singular masculine
compl - completive
conj - conjunction
cop - copula

def.sf - definite singular feminine
dem - demonstrative
dest - destinative
distr - distributive
exist - existential
exist.neg - negative existential
exist.pst - past existential
fv - final vowel
idm - idiom
idph - ideophone
indf.sf - indefinite singular feminine
indf.sm - indefinite singular masculine
interj - interjection
itv - itive
mod - modality
neg - negator
onom - onomatopoeia
our.incl - our.inclusive
p.mkr - pause marker

pl - plural
pl1 - plural 1
pl2 - plural 2
pro - pronoun
prt - particle
q - question word
quot1 - quotative 1
quot2 - quotative 2
rel.pl - relative plural
rel.sf - relative singular feminine
rel.sm - relative singular masculine
res - resultative
rev - reversative
seq - sequential
stat - static
your.pl - your.plural
your.sf - your.singular feminine
your.sm - your.singular masculine

Citing the Gizey corpus:
Guitang, Guillaume. 2022. A narrative corpus of Gizey. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Guro

Language

Guro is a South Mande language spoken in Ivory Coast by approximately 500.000 people. The practice of storytelling is endangered and undergoing some transformations. Even in remote areas it is largely abandoned as a family practice. Instead, in the later decades there emerged semi-professionalised groups performing traditional tales accompanied with instrumental music and singing. They are invited to weddings, funerals and similar occasions. In any kind of performance the main narrator is usually supported by a second narrator, who animates the storytelling by short interventions.

Corpus composition

The data was collected in 2019 by Olga Kuznetsova during two expeditions to the Zuenoula and Oume regions (Ivory Coast). Recordings from the Zuenoula region (over 10 hours) mostly represent group performances, and recordings from Oume region (about 2 hours) are performances in pairs (the main and a second narrators), sparsely accompanied by songs. Only parts of this collection are currently available; the corpus contains two tales from the Zuenoula region, around 45 minutes (around 6500 words) in total.

List of parts of speech

adj - adjective
adv - adverb
art - article
conj - conjunction
cop - copula
det - determiner

id - ideophone
intj - interjection
mrph - morpheme
n - noun
num - numeral
part - particle

pn - proper noun
pp - postposition
prep - preposition
pm - predicative marker
pron - pronoun
v - verb

List of glosses

-/ - hightening tonal morpheme
-\ - lowering tonal morpheme
\FR - in French
ADVZ - adverbializer
ART - article
COMP - complementizer
COMPL - completive
COP - copula
DEF - definite article
DIM - diminutive
DISTR - distributive numeral
EMPH - emphatic particle
EX - exclusive pronoun
FOC - focalized pronoun
GER - gerund
H - high tone (pronoun)
INC - inclusive pronoun

INDF - indefinite article
INTJ - interjection
IPFV - imperfective
JNT - joint pronoun
LOG - logophoric pronoun
NEG - negation
NMLZ - nominalization
NREF - non-referential
NSBJ - non-subject pronoun
OPT - optative
ORD - ordinal numeral
PCOP - presentative copula
PFV - perfective
PL - plural
PNEG - negative presentative copula
POSS - possessive
PREP - preposition

PROG - progressive
PROH - prohibitive
Q - question
QUOT - quotative
RECP - reciprocal pronoun
REFL - reflexive pronoun
REL - relativizer
RES - resultative
RESP - respective
RET - retrospective
SBJ - subject pronoun
SCOP - subordinated copula
SG - singular
SIM - simultaneity
SUP - supine
SURP - surprise marker
TOP - topic

Citing the Guro corpus:
Kuznetsova, Olga. 2022. A narrative corpus of Guro. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Kafire

Language

Kafire is a Senufo language (Gur, Niger-Congo) spoken in Northern Côte d’Ivoire. Its speakers live in an area called ‘Kafigue’ which comprises the subprefectures of Sirasso, Nafoun and Kanoroba in the department of Korhogo. Kafire, like other Senufo languages, has noun classes / genders and it has a S(ubject)-Aux(iliary)-O(bject)-V(erb)-X (obliques, adverbials) word order. Verbs are marked for aspect (perfective vs imperfective) which is expressed through preverbal auxiliaries and on the verb as well.

Traditional narratives

In the community of Kafire speakers, i.e., the Kafibele, the narration activity is practiced by everyone (women, men, children…). It is learned during narration sessions which are held only at night (during the farm-keeping activity and around the fire in the village). Traditional narration requires at least two participants: a narrator (who narrates) and a responder (who assists him). But those roles are most of the time shifted during the session since people take turns narrating. Nowadays, the narration practice is endangered because people have changed their agricultural habits (they cultivate more products that do not require the farm-keeping activity) and children go to school (they do not gather with their elders to learn narration).

Corpus composition

This corpus is a portion of the data collected by Silué Songfolo Lacina from 2019 to 2021 in the community of Kafibele (Côte d’Ivoire, in the subprefectures of Sirasso, Nafoun and Kanoroba). The length of the total corpus is more than 10 hours. But the current annotated portion is 1h30 and it contains what the Kafibele narrate during a storytelling session, namely riddles, dilemma tales and tales. More annotated data will be uploaded progressively. In the ELAN files, morphemes are represented in two forms: the surface form (at tier mb) and the underlying form (at tier cf).

List of parts of speech

adj – adjective
adp – adposition
adv – adverb
art – article
aux – auxiliary
conj – conjunction
ideo – ideophone
inf – infinitive marker

intj – interjection
n – noun
num – numeral
onom – onomatopoeia
pref – prefix
post – postposition
pn – proper noun

pro – pronoun
prt – particle
q – question marker
quantifier – quantifier
stat – stative verb
suf – suffix
v – verb

List of glosses

1PL - 1st person plural
1SG - 1st person singular
2PL - 2nd person plural
2SG - 2nd person singular
3PL - 3rd person plural
3SG - 3rd person singular
ABESS - abessive
ADVS - adversative
AFF - affirmative marker
ARCH - archaic form
APPEL - appellative
ASSOC - associative marker
CAUS - causative
CIPRT - clause initial particle
CFPRT - clause final particle
CON - consecutive connective
COND - conditional marker
COP - copula
DEF - definite
DEM - demonstrative

DIST - distant demonstrative
DISTR - distributive
DM - discourse marker
\DY - Dyula origin
EMPH - emphatic
EXCL - exclamation marker
EXP - locution
\FR - French origin
FUT - future marker
(G)1 - gender 1
(G)2 - gender 2
(G)3 - gender 3
(G)4 - gender 4
(G)5 - gender 5
IDEO - ideophone
IDEN - identificational marker
IMPO - impossible
INCIT - incitative
INDF - indefinite
IPFV - imperfective

-MID - middle voice suffix
NPRS - non-present
NEG - negation marker
OBLIG – obligation marker
ONOM - onomatopoeia
PFV – perfective
POSS – possessive
PRES – presentative
PRF – perfect
PROG – progressive
PROH – prohibitive
PRS – present
PST – past
-RECP – reciprocal suffix
-REFL – reflexive suffix
REL – relative clause marker
SBJV – subjunctive
SIM – similative marker
VOC – vocative

Citing the Kafire corpus:
Silué, Songfolo Lacina. 2022. A narrative corpus of Kafire. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Macedonian

Language

Macedonian is a South Slavic language spoken in North Macedonia, adjacent territories in Greece and Bulgaria, and diaspora communities, the largest of which are in Australia, the United States and Canada. The total number of speakers is about 2 million. Macedonian dialects are standardly divided into three major groups: Western, Northern, and Southeastern.

Storytelling tradition

Traditional storytelling is no longer practiced in modern Macedonian society, but the stories are remembered by some people. Furthermore, many Macedonian folktales have been written down and published (most notably by Marko Cepenkov). Folktales have also been dramatized and televised and some are currently available on YouTube. Archived recordings and texts can also be found at the Marko Cepenkov Institute of Folklore and at the Macedonian Academy of Arts and Sciences in Skopje. Some of this archival material is digitalised and uploaded on YouTube by Dimitrije Buzarovski.

Corpus composition

The current version of the corpus contains 26 fairytales in either audio-only or audio and video format, recorded in 2016 and 2021 over a combined period of four weeks by Izabela Jordanoska. The total length comprises about 2 hours and it contains about 15.000 words.

The majority of data in this corpus is from the West Central dialect group and mostly recorded in the capital, Skopje, with speakers from Skopje itself and from: Veles, Sekirci (a village near Prilep) and Kičevo. Some recordings were also made in Prilep with local speakers. Only one fairytale is from a different dialectal group, the Northern group (recorded in Skopje with a speaker from Kumanovo), which contains major lexical, phonological and morphosyntactic differences from the Western Central dialects. Lexical differences are marked in the glosses. Recordings were also made in Štip, but are to date not annotated.

Most speakers were in their eighties at the time of recording.

Orthography

Standard Macedonian is codified in the Cyrillic script. The corpus contains both Cyrillic transcriptions and their corresponding romanization. For interlinearized words, morphemes and glosses, only Latin script is used.

List of parts of speech

adj – adjective
adv – adverb
art – article
cnj – conjunction
case – case
comp – complementizer
det – determiner
dem – demonstrative

idph – ideophone
intj – interjection
n – noun
np – proper noun
num – numeral
onom – onomatopoeia
pref – prefix

prep – preposition
prn – pronoun
prt – particle
quant – quantifier
suff – suffix
v – verb
wh – constituent question word

List of glosses

1, 2, 3 - first, second, third person
ACC - accusative
ADJZ - adjectivizer
AG - agentive nominalizer
AOR - aorist
AUG - augmentative
CMPL - completive Aktionsart
CMPR - comparative
COLL - collective number
DAT - dative
DEF - definite article
DEL - delimitative Aktionsart
DEM - demonstrative
DEP - dependent
DIAL - dialectal word (from Kumanovo dialect)
DIM - diminutive
DIST - distributive Aktionsart
DMN - demonymic nominalizer

DS - distal
EMPH - emphatic pronoun
EXCS - excessive
F - feminine
FOC.Q - focus particle within questions
FUT - future particle
GER - gerund
\GER - word borrowed from German
IMP - imperative
IMPF - imperfect
INCH - inchoative Aktionsart
INTS - intensive particle or Aktionsart
IPFV - imperfective
LPTCP - l-participle
M - masculine
N - neuter
NMLZ - nominalizer
NOM - nominative

OBL - oblique
OPT - optative particle
POSS - possessive
PL - plural number
PRS - present
PTCP - participle
PROH - prohibitive particle
PX – proximal
Q.TAG - question tag
REFL – reflexive pronoun
\SER – word borrowed from Serbian
SBJ – subjunctive particle
SG – singular number
SUP – superlative
TERM – terminative Aktionsart
\TUR – word borrowed from or through Ottoman Turkish
VOC – vocative

Citing the Macedonian corpus:
Jordanoska, Izabela. 2023. A narrative corpus of Macedonian. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Mwan

Language

Mwan, ISO 639-3 {moa}, Glottocode mwan1250 (South Mande ‹ Mande ‹ Niger Congo) is a minor language of the republic of Côte d'Ivoire (West Africa). It is spoken in a small compact area in the central part of the country (Congasso and Counahiri sub-prefectures), and also in urban centers (mainly, Abidjan, Bouaflé).

The Mwan (also called Mona by the Jula) live in about 20 villages and number more than 20.000 people (my estimation based on the data published in Ethnologue-14 which estimated the number of Mwan speakers as 17.000 people in 1993). Most of the Mwan speak Jula, those living in the cities speak French. A small number of the Mwan are monolinguals.

Typologically, Mwan displays many properties characteristic for isolating languages: tones, tendency to single metric foot words, a fuzzy border between inflection and derivation, as well as between compound words and combinations of free words, and so on. There are three level tones in Mwan. A significant part of the inflectional morphology is tonal. Mwan grammar uses mostly analytical technics, however, significant elements of the agglutination are also present. There are instances of fusion.

Mwan has no official status. It is not used at school either as a subject or as a language of instruction. It is used partially in the church service, in communication with the family and neighbors. In other areas of life, French is used (school, administration) or Jula (trade, market). In the villages, Mwan is not endangered.

Corpus composition

The present corpus of interlinearized Mwan texts is based on a collection of texts which were gathered in the period 2004-2020 by Elena Perekhvalskaya. The texts in the corpus belong to the following genres: 1) oral transcribed texts (folk tales; tales of witchcraft; oral history; dialogues); 2) texts taken from published sources (folk tales; funny stories; proverbs; translations from French). Oral texts, especially dialogues, contain a large amount of incomplete sentences, hesitation pauses, discursive markers and loans from other languages, mainly, from French and Jula.

List of parts of speech

adj – adjective
adv – adverb
art – article
conj – conjunction
cop – copula
det – determiner
foc – focus marker

ideof – ideophone
intrj – interjection
loc – locative
mrph – derivational or inflectional morpheme
n – noun
num – numeral
onomat – onomatopoeia

prep – preposition
pron – pronoun
part – particle
pstp – postposition
v – verb
word – word

List of glosses

1, 2, 3 - first, second, third person
ANAPH - anaphoric marker
ABSTR - abstract
ADMR - admirative
AGNT - agentive
ART - article
CONJ - conjunction
COP - copula
DIM - diminutive
EMPH - emphatic
EXCL - exclusive
FOC - focus particle

\FR - word borrowed from French
FUT - future
GER - gerund
HAB - habitual
IMP - imperative
INCL - inclusive
INSTR - instrumental
IRR - irrealis
NEG - negative
NSBJ - non-subject pronoun
PCOP - presentative copula
POSS - possessive

PL - plural number
PREF - prefix
PRF - perfect
PROG - progressive
PST - past
Q - question mark
QUOT - quotative
REFL - reflexive
REL - relativizer
SG - singular number
SPN - supine
SUF.ADJ - adjectival suffix

Citing the Mwan corpus:
Perekhvalskaya, Elena. 2023. A narrative corpus of Mwan. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Udihe

Language

Udihe is a Tungusic language spoken in the Russian Far East. The language is critically endangered; currently there are no more than 40 native speakers.

Corpus composition

The corpus contains written data that was collected in 1936 by E.N. Baskakova from the archives from the Peter the Great Museum of Anthropology and Ethnography (Kunstkamera, Russian Academy of Sciences). There is one text recorded in 2006 by the author of the corpus. The total number of tokens of the currently available corpus is about 9K words.

List of parts of speech

AD - actant derivation
adj - adjective
adv - adverb
akz - Aktionsart (verb derivation)
be - existential verb
case - case marker
conv - converbializer
cop - copula
dem - demonstrative
det - determinative
ideof - ideophone
intj - interjection

mood - suffix that encodes mood (verb derivation)
mrph - morpheme
n - noun
neg - negation (verb)
Nmprh - different nominal suffixes
num - numeral
onomat - onomatopoeia
part - particle
pers - personal pronoun
pers.n - proper noun
pl - plural suffix
poss - possessive suffix

postp - postposition
pp - past participle
pron - pronoun
refr - refrain
Q - question marker
quant - quantifier
TAM - tense-aspect-mood-marker
v - verb
Vmrph - different verbal morphemes (such as aspect markers)
Vneg - negative copula
voice - suffix that encodes voice (verb derivation)

List of glosses

ABL – ablative
ACC – accusative
ADM – admirative
ALIEN – alienable possession
ANDAT - andative
ASS – associative plural
ATTR – attributive
AUG – augmentative
CAUS – causative
CC.DS - conditional converb different subjects
CC.SS - conditional converb same subject
COL – collective numeral
COMIT – comitative
COND – conditional
CONJ – conjunction
CONTR – contrastive
CP - perfective converb
CV.PST – past converb
CV.RES – resultative converb
DAT – dative
DEC - decausative
DEL - deliberative future
DEST – destinative case suffix
DIM – diminutive
DIR – directive case suffix
DISC - discursive particle
DIST - distributive

DIV - diversative
EMPH – emphatic
EVID – evidential
EXCL – exclusive first person pronoun
FOC – focus particle
FP - future participle
FUT – future
GER – gerund
HORT – hortative
IC - simultaneous converb
IM - imperfective
IMP – imperative
IMPRS – impersonal
INC - inchoative
INCL – inclusive first person pronoun
INDEF – indefinite
INEXP - unexpected modality
INS – instrumental
INTS – intensifier
LIMIT – limitative
LOC – locative
MOM - instant action converb
NECES - necessity modality
OPER - operation suffix
OPT - optative
ORN - ornative suffix
PASS - passive

PC.DS - perfective converb, different subjects
PC.SS - perfective converb, same subject
PP - past participle
PP.PASS - past participle passive
PRF – perfect
PRIV - privative
PRO - prospective
PROL - prolative case suffix
PRP - present participle
PRP.PASS - present participle passive
PST – past
PURP – purposive
Q – question mark
RECP – reciproc
REFL – reflexive
REGR – regressive
REV - reversive
RUSSIAN - Russian borrowing
SEM - semelfactive
SING – singulative
SS.PL - same subject suffix plural
SS.SG - same subject suffix singular
TOP – topic
VOC – vocative
VQ - question verb

Citing the Udihe corpus:
Perekhvalskaya, Elena. 2021. A narrative corpus of Udihe. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Ut-ma'in

Language and its speakers

U̠t‑Ma'in is a Kainji (Benue-Congo, Niger-Congo) language spoken in northwestern Nigeria (Gerhardt 1989; Blench 2018). Most speakers live in Fakai Local Government Area near the border of Kebbi State and Niger State. The principal town of U̠t-Ma'in speakers is Mahuta, located on the road that runs from Dabai, near the town of Zuru, to the town of Koko on the main road to the Kebbi State capital Birnin Kebbi. The United Nations Office for the Coordination of Humanitarian Affairs (2017) present a projected 2016 population of Fakai Local Government Area as 161,365.

U̠t-Ma’in speakers comprise a minority group among minority language groups in the area. The U̠t-Ma’in language area is bordered by Hausa ([hau], Chadic, Afroasiatic) spoken to the north and west, by C’Lela [dri] and Gwamhi-Wuri-Mba [bga] spoken to the northeast and east, by U̠t-Hun [uth] to the south, and U̠s-Saare [uss] to the southeast. With the exception of Hausa, all other neighboring languages belong to the Northwest Kainji Cluster. Most native speakers of all Northwest Kainji languages are bilingual in Hausa; Hausa is used between speakers who do not share a common first language.

A more detailed description of the language can be found here.

Language name

U̱t‑Maꞌin is a recently applied cover term for seven mutually‑intelligible varieties previously known in the literature by their individual varietal endonyms Kag-Fer-Jiir-Koor-Ror-Us-Zuksun (e.g., Ror for U̱t‑MaꞌRor), by the label Kag Cluster (Gerhardt 1989: 362-363), and by exonyms Puku-Geeri-Keri-Wipsi from the neighboring C’Lela [dri] language, as well as Fakkanci, Gelanci, etc., in the regional language Hausa [hau]. U̠t-Ma'in roughly translates as ‘our (incl.) language’.

Corpus composition

Compiled and annotated by Rebecca Paterson. Texts contributed by Ibrahim Tume Ushe, Mama Iliya, and Ibrahim Yohanna. Free translations completed in consultation with Sunday John.

Select U̠t‑Ma'in texts annotated for speech reports are taken from approximately seven and a half hours of recorded and translated data; the larger corpus contains texts in various genre including folk narratives, personal narratives, pear story retellings, and conversational data. These texts were collected during twelve months of fieldwork conducted in 2005–2007, 2013 and 2017. The annotated texts come from the Ror and Jiir varieties.

List of parts of speech

Adj - adjective
Adv - adverb
Advlizer - adverbializer
CardNum - cardinal numeral
Class - noun class
Comp - complementizer
Conn - connective
CoordConn - coordinating connective
DefPro - definite pronoun
Dem - demonstrative
DerivedN - derived noun
DerivedV - derived Verb
Det - determiner
Ideo - ideophone

Imp - imperative
IndefPro - indefinite pronoun
Interrog - interrogative
Intj - interjection
N - noun
Neg - negative
Nom - nominal
NomPrt - nominal particle
NProp - proper noun
Num - numeral
OrdNum - ordinal numeral
PersPro - personal pronoun
PossPro - possessive pronoun
Prep - preposition

Preverb - preverbal particle
Pro - pronoun
Pro-Form - pro-form
Prt - particle
Q - question
Quant - quantifier
ReflexPro - reflexive pronoun
Rel - relativizer
V - verb
Vi - intransitive verb
Vi>Vt - derived transitive verb
Vt - transitive verb

List of glosses

AG1 - class 1 agreement
AG2 - class 2 agreement
AG3 - class 3 agreement
AG4 - class 4 agreement
AG5 - class 5 agreement
AG6 - class 6 agreement
AG6b - class 6b agreement
AG7 - class 7 agreement
AGDim - class Diminutive agreement
AGAug - class Augmentative agreement
AGT - agent
APPL - applicative
ASSOC - associative
AUG - augmentative
CAUS - causative

C1 - class 1
C2 - class 2
C3 - class 3
C4 - class 4
C5 - class 5
C6 - class 6
C6b - class 6b
C7 - class 7
CDim - class Dim
CAug - class Aug
FUT - future
GOAL - goal
HAU - Hausa loan
IDEO - ideophone
LOC - locative

NEG - negative
NPERS - nonpersonal
OBJ - object
PFT - perfect
PL - plural
POSS - possessive
PROG - progressive
PST - past
PURP - purpose
Q - question
REL - relativizer
SBJ - subject
SG - singular
SOURCE - source

Citing the Ut‑Ma'in corpus:
Paterson, Rebecca, comp. 2022. A collection of U̱t-Maꞌin narrative texts annotated for reported speech. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

A narrative corpus of Wan

Language

Wan is a South Mande language spoken in central Côte d'Ivoire. It is traditionally divided into two dialects: Mian and Ken. Other languages spoken in the same area with which Wan is in contact are Mwan, Guro, Baule, and Dyula. Wan is not endangered and has a thriving storytelling tradition.

Corpus composition

The corpus contains traditional narratives from a variety of sources: audiorecordings made by Tatiana Nikitina in the village of Kounahiri and in Abidjan in the period of 2001-2016; videorecordings made by Alexandre Djamala in a number of different villages in 2019; as well as transcriptions from Philip L. Ravenhill’s archive, of recordings made in several different villages in 1973-1974 (Philip L. Ravenhill papers, National Anthropological Archives, Smithsonian Instituion).

List of parts of speech

*** - unanalyzable word or word from another language
adv - adverb
adv.adj - adverb from the adjectival subclass
aux - auxiliary
cnj - conjunction
cop - copula
dtm - determiner
idph - ideophone
intj - interjection

mrph - morpheme
n - noun
n.adj - property noun
n.loc - locative noun
n.num - numeral
np - personal noun
np.loc - toponym
onom - onomatopoeia
pp - postposition

pred - predicative marker
prn - pronoun
prt - particle
prt.asp - aspectual particle (post-verbal)
prt.fin - clause-final particle
v - verb
vn - formulae, greetings, signalling words

List of glosses

1, 2, 3, 1+2 - 1st, 2nd, 3rd person, first person dual
adj.foc - adjunct focus
AGNT - agentive
cop - copula
def - definite marker
dtm - determiner
emph - emphatic
excl - exclamative
excl - exclusive
foc - focus
hab - habitual
idph - ideophone
imper - imperative
imper.neg - negative imperative
imperf - imperfective

incl - inclusive
indep - independent pronominal series
intj - interjection
log - logophoric pronoun
neg - negation
nmlz - nominalization
onom - onomatopoeia
past - past tense
perf - perfect
pl - plural
poss.aln - alienable possessor
pres - presentative
prog - progressive
prosp - prospective
prt - particle

prtc.reslt - resultative participle
prtr - preterit
prtr2 - second preterit
ps.refl - pseudo-reflexive
purp - purpose
q - question marker
quot - quotative marker
recipr - reciprocal pronoun
refl - reflexive pronoun
reslt - resultative
sg - singular
subj.foc - subject focus
subj>obj - transitivity marker
supn - supine

Citing the Wan corpus:
Nikitina, Tatiana. 2023. A narrative corpus of Wan. In Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

Citing the corpus and tools

Corpus:

Nikitina, Tatiana, Ekaterina Aplonova, Izabela Jordanoska, Ekaterina Biteeva, Abbie Hantgan-Sonko, Guillaume Guitang, Olga Kuznetsova, Rebecca Paterson, Elena Perekhvalskaya & Lacina Silué (eds.) 2022. The SpeechReporting Corpus: Discourse Reporting in Storytelling. CNRS-LLACAN & LACITO, http://discoursereporting.huma-num.fr/index.html

Template:

Nikitina, Tatiana, Hantgan-Sonko Abbie & Chanard Christian. 2019. Reported speech annotation template for ELAN (The SpeechReporting Corpus). Villejuif-Paris: LLACAN.

Elan-CorpA:

Chanard, Christian. 2015. ELAN-CorpA: Lexicon-aided annotation in ELAN. In Amina Mettouchi, Martine Vanhove & Dominique Caubet (eds.), Corpus-based Studies of Lesser-described Languages: The CorpAfroAs corpus of spoken AfroAsiatic languages (Studies in Corpus Linguistics 68), 311–332. Amsterdam: John Benjamins Publishing
Company. https://doi.org/10.1075/scl.68.10cha.
https://benjamins.com/catalog/scl.68.10cha (2 October, 2020).