Chat room

Welcome to the Chat room!

Place used to discuss any and all aspects of Lingua Libre: the project itself, discussions of the operations, policy and proposals, technical issues, etc. Other forums include LinguaLibre:Technical board for code-oriented issues, LinguaLibre:Administrators' noticeboard.

Feel free to participate in any language you want to.

Start a new discussion

Chatroom FAQ

How to download all audios of one language ? By speaker ?
- Languages are there https://lingualibre.fr/datasets/. A short server-side script is auto-ran every 2 days, itself using lingua-libre/CommonsDownloadTool. For more, see Help:Download from LinguaLibre.

How to add missing languages ?
- Administrators can add new languages, they do so within few days. For users, please provide your language's iso-639-3 code + link to the en.wikipedia.org's article. Optional infos are the common English name and wikidata IQ. For more, see Help:Add a new language.

How to archive sections which have been answered ?
- After reviewing the section, add `{{done}} -- can be closed ~~~~` to the top of the section. After some days to 2 weeks, move the sectin's code to LinguaLibre:Chat_room/Archives/2018.

How to keep my wikimedia project up to date ?
- Contact User:0x010C, the botmaster of Lingua Libre Bot. For more, see Help:Bots.

What IRL event.s are coming ? When ? Where ?
- Paris's LinguaLibre:Hackathon_15-16_Décembre just finished. More events to come. For more, see LinguaLibre:Events.

Utiliser le Lingua Libre Bot dans l'incubator:shy

Est-ce que c'est possible de faire la même chose pour le wiktionnaire en Chaoui ? je veux dir est-il il possible d'utiliser votre bot sur notre wiktionnaire aussi ? je peux donner l'algorithme du wiki-test. Cordialement. -Reda Kerbouche (talk) 12:32, 8 July 2018 (UTC)

Oui bien sur ! Avez-vous un bistro / village pump / ... pour en discuter là-bas ? — 0x010C ^~talk~ 15:24, 8 July 2018 (UTC)

Oui il y a un bistro vierge du wiktionnaire Chaoui que vous pouvez activer. Ou bien celui de l'incubator où en peut discuter avec des administrateurs à propos de l'autorisation du bot. Cordialement. -Reda Kerbouche (talk) 18:26, 8 July 2018 (UTC)

Je suis en ce moment en chemin pour Wikimania, je vais n'avoir que très peu de temps jusque là, mais je lancerais la discussion à mon retour. Cordialement — 0x010C ^~talk~ 11:43, 11 July 2018 (UTC)

Bon voyage.--Reda Kerbouche (talk) 21:48, 11 July 2018 (UTC)

0x010C J'espère que vous m'avez pas oublié =) Car en septembre on lance un concours pour le wiktionnaire en Chaoui, et si on peut enregistrer des mots qui vont passer directement sur incubateur, je fais la promo de Lingualibre en même temps que la promo du concours.--Reda Kerbouche (talk) 14:01, 16 August 2018 (UTC)

Reda Kerbouche, 0x010C, Is this {done} ? --Yug (talk) 11:04, 15 December 2018 (UTC)

Bots-related documentation could be gathered in Help:Bots Yug (talk) 11:01, 31 December 2018 (UTC)

Liste sur le modèle de Petscan

Salut, est ce qu'il serait possible de faire une liste à la volée sur le modèle de ce qu'est capable de faire Petscan ? Ici, on a la liste de tous les lemmes du Wiktionnaire qui n'ont pas de catégorie « Prononciations audio en français » ce qui signifie qu'il n'ont pas le modèle « écouter » qui permet d'ajouter les entrées dans cette catégorie. Je trouve que la génération d'une telle liste serait vraiment sympa pour les Wiktionnaires. Pamputt (talk) 06:07, 12 July 2018 (UTC)

L'idée est bonne en effet, cependant ça représente un gros boulot à intégrer sur Lingua Libre. Je pense qu'il serait intéressant d'en discuter un peu et d'établir un petit cahier des charges de ce que l'on veut pouvoir faire (tout dans petscan n'est pas utile ici). — 0x010C ^~talk~ 22:00, 14 July 2018 (UTC)

0x010C, est ce que tu penses que l'exemple que j'ai donné ci-dessus (lemmes en français qui n'ont pas de prononciation) peut être implémenté à partir de MediaWiki:Gadget-Demo.js. Pamputt (talk) 14:23, 14 October 2018 (UTC)

Oui c'est exactement ça, il faut passer par la création d'un nouveau générateur de mots. Dans mon début de réflexion plus haut, je réfléchissais à comment implémenter les fonctionnalités de petscan dans un générateur. Sauf que niveau performance et rapidité, on pourrait jamais faire quelque chose d'utilisateur avec des catégories aussi grosse que "Lemmes en français", je m'explique. Petscan fait son travail de recherche et de recoupement côté serveur, directement sur une copie de la base de donnée des wikis (il peut ainsi en un coup explorer tous les enregistrements). Or ici, nous n'avons pas d'accès à la base de donnée et les calculs doivent être fait côté client, en javascript. On dépend donc de l'API des wikis en question pour récupérer les données, API qui n'est pas du tout faite pour travailler sur des catégories très grosses (ne peut retourner que 500 membres par requête, etc).

Bref, c'est pas possible. Cependant, on peut imaginer se reposer sur petscan pour faire le boulot chiant à notre place (ce générateur deviendrait complètement dépendant de cet outil externe, une panne de ce dernier bloquerait le fonctionnement du premier). Je vois trois options :

le générateur reprend un certain nombre de champs de petscan, et va à partir des valeurs fournies générer une requête à petscan (complexe pour l'utilisateur lambda, flexible pour l'utilisateur expérimenté) ;
le générateur propose à l'utilisateur de choisir parmi un certain nombre de requêtes petscan préparé à l'avance par nos soins (par exemple en cliquant sur "mots en français n'ayant pas de prononciation sur le wiktionnaire francophone", ta requête exposé plus haut serait utilisé), ou de coller l'URL / l'identifiant d'une requête qu'il a préparé / trouvé (plus simple à implémenter, nous oblige à créer pleins de requêtes pour supporter différentes langues, assez flexible) ;
on fait un générateur spécialisé "mots dont la prononciation est manquante" où il va automatiquement forger la requête petscan pour faire comme dans ton exemple pour la langue sélectionnée (facile d'utilisation, très spécifique mais potentiellement très utile, nous obligerait à renseigner manuellement les catégories wiktionnaire correspondante car je ne vois aucun moyen de deviner le nom de la catégorie d'une langue à partir de son code ou son id wikidata...)

Qu'en penses-tu ?

— 0x010C ^~talk~ 02:53, 16 October 2018 (UTC)

La première proposition me semble trop usine à gaz et bien que puissante, je ne pense pas qu'elle s'adresse au public de Lingua Libre.Entre les propositions 2 et 3, j'ai une préférence pour la 2 car elle est simple d'utilisation au premier abord (on utilise des requêtes pré-forgées) tout en permettant une utilisation avancée (avantage de la solution 1). Et par rapport à la solution 3, ça évite de la maintenance pour déterminer la langue d'une catégorie donc c'est plus maintenable sur le long terme à mon avis. Pamputt (talk) 06:23, 17 October 2018 (UTC)

@Pamputt: Entre deux avions, je viens de finir une première version du générateur petscan, activable via préférences > gadgets. Est-ce que tu peux y jetter un œil et me dire ce que tu en penses avant que je continue et que je l'annonce plus largement ?

Merci — 0x010C ^~talk~ 08:39, 22 October 2018 (UTC)

0x010C, j'ai activé le gadget et je vois bien PetScan dans la liste. J'ai fait quelques essais et ça fonctionne bien. J'ai essayé avec l'URL du premier message et ça fonctionne nickel. En revanche, j'ai essayé avec ça et ça m'indique "Petscan output something weired with this URL, check it and come back afterwards.". En revanche si j'ajoute le « &doit= » à la fin, ça fonctionne correctement (est-il vraiment nécessaire) ?

Autre point, cest-ce qu'il est déjà possible de préparer des requêtes pré-faites (« mots en français n'ayant pas de prononciation sur le wiktionnaire francophone », ...) ou pas encore ? En l'état c'est déjà super cool. Pamputt (talk) 17:04, 22 October 2018 (UTC)

J'avais oublié que cetaines URL pouvaient ne pas avoir l'auto-run, c'est fix. Je réfléchis actuellement à la meilleur façon de faire en fait. Ma problématique, c'est qu'une requête comme « mots en français n'ayant pas de prononciation sur le wiktionnaire francophone » n'intéressera que ceux qui font des enregitrements en français, si un germanophone dois scroller 25 requêtes qui le concerne pas (et qu'il ne comprend surement pas) avant d'en trouver une en allemand, c'est pas cool pour lui.

De là, trois idées qui me viennent en écrivant ces lignes :

Une page par langue, dans l'espace de nom list (List:fra ? List:fra-external ? List:fra-examples ? ...) qui regroupe via une liste à puce toutes les urls dispo pour une langue ;
Une fois ce travail fait, ce n'est pas très compliqué de supporter d'autres outils externes qui peuvent être appelé via une URL et renvoyer le résultat en JSON ; je pense notamment à querry.wikidata.org ;
Et là, plus une réflexion, est-ce que ça serait pertinent une fois que ça sera stable de l'intégrer au générateur "listes" actuel (genre avoir deux onglets dedans, "listes statique", "listes dynamiques/externes/..." ?), ou l'intégrer comme un nouveau générateur à part entière dans le core du RecordWizard ? (et du coup comment le nommer dans ce cas ?)

Un avis externe me serait bien utile pour trancher tout cela :) — 0x010C ^~talk~ 19:52, 22 October 2018 (UTC)

Variations géographiques

Bonjour,

Bravo pour ce projet très intéressant.

Je me pose une question à propos des prononciations. Je suis du sud de la France et contrairement à une bonne partie du reste de la France, nous usons beaucoup de l'accent tonique (influence italienne et espagnole, j'imagine). Du coup, la prononciation de certains mots, et surtout des locutions, ont une rythmique différente par chez moi.

Comment gérer ces variations de prononciation ? Ont-elles droit de cité ou comme les québécois doit-on privilégier un "Français international" neutre ?

Pour finir sur le sujet, la prononciation de certains mots sont différentes chez nous : lait, mas, moins (avec un s !), etc. Comment intégrer ça dans Wiktionnaire ou Wikipédia ?

Jpgibert (talk) 12:02, 13 July 2018 (UTC)

Bonjour,

Merci pour ton intérêt !

Non, il ne faut surtout pas privilégier un français "neutre". Chaque variation / accent locale est intéressent. En fait, juste avant de commencer à enregistrer il t'es demandé de remplir ton profil de locuteur, dans lequel tu peux renseigner ton lieu d'habitation / d'apprentissage d'une langue.

Lorsqu'un enregistrement est ajouté ensuite sur le Wiktionnaire par exemple, cette information y est inclu. Si plusieurs personnes ont enregistré les même mots, on pourra donc écouter les différences de prononciation de « lait » en Alsace, au Québec, en Occitanie, en Île de France, au Mali,... Et ça c'est cool :)

Cela répond à tes questions ?

Cordialement — 0x010C ^~talk~ 21:55, 14 July 2018 (UTC)

Bonjour User:0x010C

Merci pour la réponse. Je m'inquiétais de la chose parce que s'il existe un code linguistique pour les variations du français au Québec (fr-CA) ou de Belgique (fr-BE), en revanche l'accent n'est pas pris en compte.

Content d'apprendre que malgré mon accent, je serai le bienvenu. Bon pour le moment, faut que j'achète un bon micro avant de faire quoi que ce soit, mais dès que j'aurai ça, je tenterai de partager mon accent méridional.

Jpgibert (talk) 12:31, 23 July 2018 (UTC)

Thésaurus

Bonjour,

Durant la vidéo de présentation du projet par Lyokoï (LetsContribute6), j'ai appris qu'on pouvait générer des listes de mots à partir de catégories. Serait-il possible de faire le même genre de chose à partir d'un thésaurus ? Question subsidiaire, est-ce que ça à un intérêt ?

Jpgibert (talk) 12:39, 23 July 2018 (UTC)

Ca pourrait effectivement être intéressant même si c'est plus compliqué à coder (j'imagine). Juste pour donner un exemple pour ceux qui ne voient pas ce dont il est question, on peut aller voir ici. Pamputt (talk) 21:30, 23 July 2018 (UTC)

@Jpgibert Le plus simple pour faire ça, c'est de copier-coller le contenu du thésaurus et de séparer les mots avec un #. Ça doit demander quelques minutes pour être mis en forme, mais ce n'est pas non plus le Pérou. Lyokoï (talk) 14:22, 15 December 2018 (UTC)

General issues + issues with Odia and Asian writing systems

Done, all issue tracked on phabricator or explained below. Ready to archive. Yug (talk) 23:22, 23 December 2018 (UTC)

I loved the current version! Truly admire the changes you all have made over time. I have also done a few recordings in my own language Odia to check for any error. Below are a few:

Tag already recorded items (T212580): When a word has already been recorded and has been uploaded on Commons, does is not make sense to show it as a flag instead of letting any user to upload it directly?
Add custom commons categories (T201135): Also, different languages have different additional categories which Lingua Libre does not let one to add. For instance, I generally add a user category to know how many audio files I have uploaded. For the files recorded using Lingua Libre, I don't see an option to add that optional category.
Remove duplicated words (in same session: explanation below ; across time: T212580): If I am adding a wordlist before recording, is that possible to keep only one word if the same word is used multiple times? This would save some time for the uploader.
Monitor suspect cracking sound in audios (T201136): There is a bit of crackling sound that is heard while monitoring the recorded words. Any particular reason?
Some words fails anyway (T212584): Even though I am correctly pronouncing every word, I see a lot of red-labelled words.
Allow click-play-listen while recording (T212583): While recording, I cannot check how the recording sounds like. I can only choose to re-record after hearing the recorded sound. Otherwise even having that option is of no use.
Remove underline (done): While recording each word is seen as a green button and during the recording the word is underlined. This works well for Latin-based scripts. However for my script, Odia, and even many other Asian languages, this is a problem as we have diacritics and accent marks below the character. It becomes too hard at times to read when underlined. Also, the light green color and a white background is not accessible to people with corrections or color blindness. Maybe black background with white text will create more contrast and make it easier to read.
Last word cannot be re-recorded (explanation below): When you reach the last word of a batch and want to re-record that word, it doesn't allow you to click on the word button and re-record.

Also, requesting to add the Warang Citi (used for Ho language) and Ol Chiki (used for Santali language).

Thank you much again. I would really love to contribute more myself, and involve other community members. --Psubhashish (talk) 07:21, 26 July 2018 (UTC)

Hi!

First of all, thanks for your feedbacks, that's really helpful. Here are some details about your remarks:

In my opinion, it is interesting to have several records of the same word by different users, the naming convention takes this into account to avoid records to be overridden by another user. But as I'm not sure I understood this point very well, don't hesitate to clarify it if my answer is mistaken.
T201135
If I have correctly understood your point, that's already the case. You can't add duplicate words in the same record batch (if you try to do so, the second one will be dismissed).
It's just a small file-loading issue, it will be fixed soon, see T201136
This is a major issue I'm already aware of. In some cases (~ 1 word out of 100), for some unknown reason, MediaWiki is mistaken in taking WAV files for executable files, so it refuses them...
I'll try to add a way to listen the records while still in the recording studio.
I wasn't aware of that particularities, I'll remove the underline. I'm not so fond of the white text on black, but I'll try to find something more accessible.
Hum, this works well with me. When you have recorded the last word, the record automatically cuts off, did you click on the big red button to enable it again?

I've imported the Ho language, which was missing from Lingua Libre, but the two writing system you've mentionned are part of Unicode and should works, am I wrong?

Best regards — 0x010C ^~talk~ 08:37, 3 August 2018 (UTC)

+1 for point 7, the underline is also troublesome for Chinese. Yug (talk) 13:08, 6 August 2018 (UTC)

Hi! Continuing the cleaning effort and tracking of issues, also to stay short and concise, I enhanced the initial post with title and status (phabricator issue). Sorry for that, just cleaner. Yug (talk) 11:33, 24 December 2018 (UTC)

Première utilisation : quelques questionnements

Bonjour !

Tout d'abord, merci beaucoup pour ce super outil !

J'ai remarqué quelques difficultés à l'usage. Peut-être que c'est juste parce que je suis nouvelle et pas au courant de toutes les options, mais voilà ma liste :

Sur une liste de 20 mots, il faut généralement que je reprenne l'enregistrement manuellement trois ou quatre fois parce que l'outil décide soudain de ne plus enregistrer. Quand je sélectionne un mot, même en cliquant sur le gros bouton rouge, il y a à peu près une chance pour deux pour que l'enregistrement se lance.
Mes mots sont très souvent coupés au début et à la fin (pour les noms propres en deux ou trois mots surtout) : peut-être qu'il serait pertinent d'avoir un petit bouton "next" pour marquer manuellement les fins de mots ? Sur 20 mots enregistrés, entre ceux que l'outil n'a pas envie de me laisser enregistrer (cf #1) et ceux qui sont coupés, m'en reste peu. Sur 3 listes d'une vingtaine de noms, j'en ai eu 2, 5 et 7 exploitables.
Sur une page d'enregistrement comme Q44570, le lien vers la page Wikipédia met un + au lieu d'un _ entre les mots donc on arrive sur un lien rouge dans Wikipédia.

Si ça peut servir, je suis sur la dernière version en date de Firefox au 10/10/18 & Windows 10.

Pour le reste : c'est vraiment super, bravo pour tout ce travail ! Je vais continuer à faire joujou avec l'outil jusqu'à être bien familière avec.

Exilexi (talk) 06:22, 10 October 2018 (UTC)

Les problèmes 1 et 2 sont en fait quasiment réglés avec un meilleur micro. Lingua Libre demande la permission pour un micro qui n'est pas mon micro par défaut, pour une raison inconnue.

Nouveau souci avec l'upload : tous les mots sauf 1 sont bien téléversés. Le bouton Commons s'affiche en grisé et rien ne se passe si je clique sur la petite croix à côté d'un mot : apparemment, c'est tout ou rien pour mettre sur Commons, donc je viens de perdre 29 mots parce qu'un seul refusait de s'uploader. Exilexi (talk) 06:44, 10 October 2018 (UTC)

J'en ajoute un : j'avais enregistré 20 mots "autour de moi". Là, je viens d'en lancer 20 autres... et c'est les mêmes. Il pourrait être intéressant d'ajouter une option pour éviter d'enregistrer plusieurs fois la même chose (mon accent ne change pas d'un jour à l'autre). Exilexi (talk) 05:36, 11 October 2018 (UTC)

Salut Exilexi, quelques remarques ou éléments de réponse à tes commentaires

Lorsque tu décris que l'outil stoppe l'enregistrement, je pense que le problème vient de la qualité du micro. C'est ce que tu sembles avoir conclu également.
Lingua Libre découpe les mots automatiquement dès qu'il détecte un blanc. Pour les noms à rallonge, on pourrait envisager d'ajouter un bouton pour passer manuellement au mot suivant. Cela étant dit, ça perd un peu de l'intérêt de l'outil car ça devient beaucoup plus lent.
concernant le lien vers Wikipédia (avec un « + »), ça semble en effet un bogue. J'ai ouvert un ticket sur Phabricator.
pour les problèmes d'upload, quand un téléversement échoue, un ticket existe déjà sur ce sujet.
pour les listes de mots, il est possible d'en créer soi-même. Il en existe déjà plusieurs en français (quelques dizaines) et moins dans les autres langues. Il est expliquer ici sur la façon de procéder. Si tu as besoin d'aide, fais-nous signe. Pamputt (talk) 20:13, 11 October 2018 (UTC)

Formosan languages workshop

Hi there, I had an email exchange with Vicky, the NCCU language researcher involved in Formosan languages protection. Some of her questions are beyond my skills :

1. I couldn't find ais(Sakizaya), ami(Amis), trv(Truku) in the language list. Please add, thanks!
2. Can I add the dialect information in the speaker file? 
Because there are 42 dialects under 16 aboriginal languages, I had record Squliq dialect not C’uli’ dialect of Atayal language today.
3. I had add the Chinese translation after the aboriginal languages, is that ok for lingua libre? 
Or I only can type in aboriginal languages?

I broke the questions in several subsections so a quick discussion may occurs for each. Please take notes that Vicky workshop is coming this week, so it would be cool to forward her practical solutions early. Yug (talk) 09:38, 29 November 2018 (UTC)

1) Requesting languages additions

Amis_language (iso: ami; wikidata: Q35132).
Sakizaya has no iso639, from my understanding. Sakizaya_language (iso: none, wikidata: Q718269), Nataoran_language (iso: ais, wikidata: Q42508148).
Truku (no iso no wikidata) : is described in Wikipedia as the main component of Seediq language (iso: trv, wikidata: Q716686), already in LinguaLibre. Taiwanese linguist, the most experienced in the matter, are making a distinction.

If I understand well, LL only requires wikidata ID. If so, I would recommend to add Q35132 (amis), Q718269 (Sakizaya). Q42508148 (Nataorans) and Q716686 (Seediq) are already in I think. Truku may require a wikidata item creation, then integration in LL. Yug (talk) 09:38, 29 November 2018 (UTC)

The four languages have been imported here: Q51311 Seediq, Q51870 Amis, Q51871 Sakizaya and Q51872 Nataoran and can be used for recording. Pamputt (talk) 04:15, 30 November 2018 (UTC)

2) "There are 42 dialects under 16 aboriginal languages".

We previously added 15 or 16 of these recognized languages into LinguaLibre (thanks x0 and Pamputt). Again, Taiwanese linguists are the experts on the matter, so what can we (LL) recommend for these 42 variants ? Two ideas came to me.

Add the information in he speaker name or place of learning. By example for : Paul Martin (Breton north) ; Paul Martin (Breton south).
Add the Wikidata items following Taiwanese linguists recommendations, while no wikipedia articles nor iso639 exists.

What do you think ? Yug (talk) 09:38, 29 November 2018 (UTC)

As far as I uundertand, if no Wikidata item exists for a given language, we have two options: create it on Wikidata (whether it is notable) and import here after or create it by hand directly here. So for dialect, I would say they are enough notable to be created on Wikidata but I have no time to do it by myself before the end of the year (I have no regular Internet connection for now). Pamputt (talk) 04:18, 30 November 2018 (UTC)

In fact, the second option mentionned above by Pamputt won't work. For a language to be recognised by the RecordWizard, it has to have a wikidata ID. The right way to do it imho is (as also suggested by Pamputt) to create the corresponding item on wikidata, and then ask for an import here. — 0x010C ^~talk~ 14:46, 3 December 2018 (UTC)

3) "Is it ok to use `mhway su (谢谢)` ?" (target word + translation)

Technically, both aboriginal languages and Chinese, de factor the target word together with its closest macro-language's translation, here, Chinese.
Keep extremely consistent in your practice, so to ease later usages (learning apps). If the rule is

{aboriginal}{white_space}{opening_round_braket_(}{Chinese}{closing_round_braket_)}

stick to it, and avoid round brackets in other places of your element. Early consistency makes later usages easier. Yug (talk) 09:38, 29 November 2018 (UTC)

@x0, devs, there again we have the questions of wordlists with translations. I previously suggested that words lists support a iso639 syntaxe or wikidata id syntax so to push the translation into a different metadata field. Example of list :

mhway su [cmn:谢谢,eng:Thank]

Then "mhway su" is the target recorded word. "谢谢" is the translation in the meta data "cmn" (Chinese). "Thank" is the translation in the meta data "eng" (English). I guess I should open a ticket on Phabricator. Yug (talk) 10:19, 29 November 2018 (UTC)

Multi-lingual wordlist --wordlist including the translation of target words-- are not supported at the moment. An issue have been opened on LinguaLibre developments and bugs tracking system (T211086). Yug (talk) 09:29, 4 December 2018 (UTC)

Thésaurus (2)

J'ai archivé le coeur de la discussion de Benoit & 0x010C, mais cet autre sujet mérite une section:

"Rien à voir. Je pensais qu'un petit outil de génération de liste depuis un thésaurus fr.wikt ce serait top. Au lieu de choisir une catégorie d'un wikiprojet, on choisirait un thésaurus. Une idée comme ça. --Benoît 21:36, 20 December 2018 (UTC)"

--Yug (talk) 10:41, 24 December 2018 (UTC)

Feature request: ask to reuse existing identical audio if available

Done, can be archived. 12:08, 31 December 2018 (UTC)

I waste a lot of time because Lingua Libre Bot has to have new audio for every lexeme forms. For example this audio https://commons.wikimedia.org/wiki/File:LL-Q809_(pol)-KaMan-Bizancjum.wav I had to record 10 times (https://lingualibre.fr/index.php?title=Q55850&action=history). A lot of forms in Polish language is duplicated in different cases. It would be great if in word generator (+ExternalTools) in Record Wizard could be question to ask if duplicate should be recorded (identical speaker, language and lexeme), and Lingua Libre Bot propagate existing audio. It could save time. KaMan (talk) 14:28, 25 December 2018 (UTC)

KaMan, where does your wordlist(s?) come from ? how is it created ? You use LinguaLibre word generator ? Yug (talk) 00:12, 27 December 2018 (UTC)

If I understand well, you eventually have the same issue as raised in Warn the user when they try to record a file that they already made. Namely, you meet again and again words that you already recorded. If this is correct, then we started to look for technical solutions (T212580). As of now, for long series, it is important to stick to large frequency list, so to not re-record similar words multiple times. Yug (talk) 00:17, 27 December 2018 (UTC)

I took a look online for available frequency lists in polish.

Subtlex-pl : article, http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-pl data, available but "for research usage".
Worldlex : article, data, available but unstated license
Hermit Dave, 2016 : page, data, CC-by-sa

So Hermit Dave's data would do. We have tutorials on how to clean up frequency lists,how to split such long file, other tricks, and how to create a list on LinguaLibre to help.

Some command will need minor changes if your input differs. If you have some basic shell skills, you can do it and learn the exact commands needed quickly. Yug (talk) 01:30, 27 December 2018 (UTC)

No Yug. He's talking about word lists generated with a SPARQL query from Lexemes on Wikidata, and from the fact that Lingua Libre Bot only associate audio recordings on the Lexeme when there is a direct link, causing him to re-record many times homograph words that are also homonym.

But the main issue I pointed out in T212580 apply here too, I don't have any idea of easy and effective implementation right now.

(and no Yug, it is not "important to stick to large frequency list", we have other —more simple— solutions yet as Wikimedia categories or external tools imports).

Best regards — 0x010C ^~talk~ 11:10, 27 December 2018 (UTC)

0x010C is right. It's not problem of wrong list, list of words is correct. If there is no easy solution to it I can work with it as is but I admit I feel pain ;) before recording of 14 identical forms of https://www.wikidata.org/wiki/Lexeme:L19356 :) KaMan (talk) 13:22, 27 December 2018 (UTC)

"Who doesnt try cannot be wrong." It really needs to read between lines to find the Wikidata reference. "Lexeme" is lexicology term before being a Wikidata item type. The current SPARQL query doesnt seems time savy.

And yes, generally speaking frequency list of unique words save our speakers energy. First, each form is recorded only once : this is why human speakers are for, and they shouldn't have to record multiple times a same form. Second, in natural language, words frequency follow the Zipf's law. Thus, the 135 most frequent English items represent 50% coverage of written text. On the opposite side, recording Wikipedia categories is not representative of human language and thus not time efficient. One volunteer can audio record 2000 categories it will still barely account for 1% of this human language. This only has internal value, by wikipedians for wikipedians, which is positive but sub-optimal.

As of KaMan's case, I would still recommend using frequency list : it would save valuable human time. A later bot could dispatch the audios upon the various wikidata items of this language and form. So I just used Hermit Dave CC-by-sa data to create Polish language frequency lists on LinguaLibre for the first 20k words, they are now availale to in the Record Studio > Details step : Local list > "pol". Yug (talk) 13:51, 27 December 2018 (UTC)

Yug, it's not a problem of frequency list but feature of language. I record all FORMS of words. Every noun in Polish has at least 14 forms, every adjective has 30-80 forms, same for verbs. Every form has entry in Wikidata and needs recording. But many of these forms are identical so in the end I have to record the same audio several times. It is independent from the fact the word is from frequency list. In other words word from frequency list has the same problem in Wikidata. BTW: I already follow frequency list in creating lexemes in Wikidata, but thanks :) KaMan (talk) 16:27, 27 December 2018 (UTC)

I think I get your process now. Learning ongoing ! Still seems weird you are recording 14 times the same form. Yug (talk) 16:58, 28 December 2018 (UTC)

Homonymy

How homonyms are treated? Will they be overwritten with new recordings? Infovarius (talk) 17:42, 27 December 2018 (UTC)

Yes, if a new word has the same transcription, the same language and the same speaker as an old one, it will be override. If you want to record two homonym words that have a different pronunciation, you can add a small qualifier into brakets just after the word when you type it in the 3rd step of the RecordWizard. Everything that is inside brackets will be put aside, like on this record File:LL-Q150 (fra)-0x010C-fils (enfant).wav. — 0x010C ^~talk~ 21:26, 27 December 2018 (UTC)

It is good that this is possible in principle. But how can I know that I am recording a homonym of something already recorded? Infovarius (talk) 21:51, 27 December 2018 (UTC)

How to properly credit lists

Done : no built in solution as of now, issue opened (T212671), current hack: put source in talk page. Yug (talk) 10:53, 31 December 2018 (UTC)

(T212671) I attempted this List:Pol/words-by-frequency-2001-to-4000#Source, but loading the list in the Record Studio keeps the source section as a word to record. Is there a known trick to hide this source section in the Record Studio ? Yug (talk) 16:56, 28 December 2018 (UTC)

Erreur de téléversements

Salut, je rencontre un problème assez curieux. Lorsque j'ai fini de m'enregistrer, je choisis de publier sur Commons et là, une partie de mes enregistrements sont publiés et puis ça se met à planter. Après quoi, je ne peux plus en ré-upload pour une certaine période de temps. Que dois-je faire ? Lepticed7 (talk) 21:17, 29 December 2018 (UTC)

Feature idea : table tacking existing languages on LinguaLibre.fr

I have difficulties to keep track all the languages I helped to add to LinguaLibre. Taiwan has 16 languages and 42 locals variations. Maybe it already exists... If not, It would be a positive have a sortable table such as below :

Wikidata qid	LinguaLibre qid	English name	Language group	Active ?	Numb. or recordings
Q718269	Q51871	Sakizaya	Taiwanese	Low	6
...	...	....	...	...	...

Yug (talk) 12:16, 31 December 2018 (UTC)

LinguaLibre

Chat room

Revision as of 12:19, 31 December 2018 by Yug (talk | contribs) (→‎Feature idea : table tacking existing languages on LinguaLibre.fr)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Contents

Chatroom FAQ

Utiliser le Lingua Libre Bot dans l'incubator:shy

Liste sur le modèle de Petscan

Variations géographiques

Thésaurus

General issues + issues with Odia and Asian writing systems

Première utilisation : quelques questionnements

Formosan languages workshop

1) Requesting languages additions

2) "There are 42 dialects under 16 aboriginal languages".

3) "Is it ok to use `mhway su (谢谢)` ?" (target word + translation)

Thésaurus (2)

Feature request: ask to reuse existing identical audio if available

Homonymy

Categories

How to properly credit lists

Erreur de téléversements

Feature idea : table tacking existing languages on LinguaLibre.fr

Chatroom FAQ

Utiliser le Lingua Libre Bot dans l'incubator:shy

Liste sur le modèle de Petscan

Variations géographiques

Thésaurus

General issues + issues with Odia and Asian writing systems

Première utilisation : quelques questionnements

Formosan languages workshop

1) Requesting languages additions

2) "There are 42 dialects under 16 aboriginal languages".

3) "Is it ok to use mhway su (谢谢) ?" (target word + translation)

Thésaurus (2)

Feature request: ask to reuse existing identical audio if available

Homonymy

Categories

How to properly credit lists

Erreur de téléversements

Feature idea : table tacking existing languages on LinguaLibre.fr

3) "Is it ok to use `mhway su (谢谢)` ?" (target word + translation)