Why wordlists matter ?

Some context

Priorities : With limited recording capabilities, it is better to use frequency lists to record the most frequent words first. With unlimited recording abilities, the order doesn’t matter much since we we assume that all the target words will eventually be recorded.

Corpus’purpose : As for language’s learning, written transcripts of spoken language such as films’ subtitles are known to be better materials (see SUBTLEX studies, 2007). Other corpuses will also allows you to do a good work to provide audio recording. For lexicographic purposes as Wiktionary, rare words are as interesting as frequent words, and the aim is to provide all items with their audio.

Consistency : It is best to provide consistent audio data, with same neutral or enhousiastic tone and same speaker.

Lexicon range for learners : For language learners and assuming learning via the most frequent words, a minimum vocabulary of 2000-2500 base-words is required to move the learner to autonomous level. Language teaching academics name this level the “threshold level”. The CEFR (Common European Framework of Reference for Languages: Learning, Teaching, Assessment) (doc), Chinese’s HSK and some academic statements lead to the following relation between lexicon size, CEFR level and competence :

Lexicon(*) Levels CEFR’s descriptors
600 A1 “Basic user. Breakthrough or beginner”. Survival communication, expressing basic needs.
1,200 A2 “Basic user. Waystage or elementary”
2,500 B1 “Independant user. Threshold or intermediate”.
5,000 B2 “Independant user. Vantage or upper intermediate”
20,000 C2 “Mastery or proficiency”. Native after graduation from highschool.

(*) : Assuming the most frequent word-families learnt first.

See also CEFR (image), with the most relevant section cited below :

C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and

colloquialisms; shows awareness of connotative levels of meaning.

C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with

circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms.

B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can

vary formulation to avoid frequent repetition, but lexical gaps can still cause hesitation and circumlocution.

B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to

his/her everyday life such as family, hobbies and interests, work, travel, and current events. Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics.

A2 Has a sufficient vocabulary for the expression of basic communicative needs.

Has a sufficient vocabulary for coping with simple survival needs.

A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete


C2 Consistently correct and appropriate use of vocabulary.
C1 Occasional minor slips, but no significant vocabulary errors.
B2 Lexical accuracy is generally high, though some confusion and incorrect word choice does occur without

hindering communication.

B1 Shows good control of elementary vocabulary but major errors still occur when expressing more complex

thoughts or handling unfamiliar topics and situations.

A2 Can control a narrow repertoire dealing with concrete everyday needs.
A1 No descriptor available
Users of the Framework may wish to consider and where appropriate state:

• which lexical elements (fixed expressions and single word forms) the learner will need/be equipped/be required to recognise and/or use;
• how they are selected and ordered