Difference between revisions of "Psubhashish"

Latest revision as of 18:29, 19 November 2023

Rewards

This user made more than 50,000 recordings on Lingua Libre

SPEAKER OF THE MONTH - 07/2021

This user was the most active speaker in 07/2021, with 5960 recordings made during the month.

SPEAKER OF THE MONTH - 08/2021

This user was the most active speaker in 08/2021, with 8381 recordings made during the month.

SPEAKER OF THE MONTH - 10/2021

This user was the most active speaker in 10/2021, with 5464 recordings made during the month.

SPEAKER OF THE MONTH - 11/2021

This user was the most active speaker in 11/2021, with 5536 recordings made during the month.

SPEAKER OF THE MONTH - 01/2022

This user was the most active speaker in 01/2022, with 7259 recordings made during the month.

SPEAKER OF THE MONTH - 08/2022

This user was the most active speaker in 08/2022, with 5003 recordings made during the month.

SPEAKER OF THE MONTH - 03/2023

This user was the most active speaker in 03/2023, with 3346 recordings made during the month.

SPEAKER OF THE MONTH - 06/2023

This user was the most active speaker in 06/2023, with 3974 recordings made during the month.

Babel user information

ori

This user contributes to Lingua Libre in Odia.

I AM ON A BREAK to recalibrate, focus on other life priorities, regain energy to come back to this beautiful project soon.

I am a Wikimedian, documentary filmmaker and National Geographic Explorer. I am interested in studying access, decolonization of knowledges and the free-culture movement. I have been active in language documentation with a focus on endangered languages and the use of multimedia as a democratic tool. I also have been an organizational leader and have served both in professional and volunteer-advisory roles at the Internet Society, Wikimedia Foundation, Mozilla, Centre for Internet Society, Creative Commons, Digital Language Diversity Project (DLDP), Wikitongues and now defunct ScholarlyHub.

I am very interested in publicly-owned and public-governed multimedia archives, and I put some volunteer time into action. I have contributed over 68,000 pronunciation recordings On Lingua Libre and over 4,000 sentence recordings on Mozilla Common Voice. My primary contribution to Lingua Libre is in the Central (Mugalbandi) and Baleswari dialects of the Odia language.

LinguaLibre/other pronunciation-related publications

Subhashish Panigrahi (2022), Building a Public Domain Voice Database for Odia, Companion Proceedings of the Web Conference 2022, Virtual Event, Lyon, France, pp. 1331–1338, DOI: 10.1145/3487553.3524931, ISBN: 978-1-4503-9130-6
Subhashish Panigrahi (2022), Building a 50,000 pronunciation data repository in the Odia language, Diff

Things I have made/broken

Prepare words for Lingua Libre: a tool to copy text from any source and clean up to create a list of words ready to be used in RecordWizard of Lingua Libre.
Kathabhidhana: an open-source toolkit to record a large number of words in any language (inspired from another open project by T. Shrinivasan) (see tweet thread, coverage on Rising Voices, blog, selected talk at Wikimania 2017 and coverage on French Wikipedia newsletter RAW)

Personal lists

All lexeme forms in Odia missing a pronunciation (inspired by Adithya K's query)
All standard Odia (all lists)
List:Ory/Baleswaria (Baleswaria dialect of Odia; ongoing, total #words: 1046 by June 1, 2020; words with recording completed)
TBD: Baleswari words from Ordia Purnachandra Bhashakosha
List of Places in Odisha (villages, Towns, Administrative blocks, etc.)। Words collected from:
- Kalahandi district official site
- Malkangiri (village names missing)
- Koraput (village names exist)
- Bhadrak (village names exist)
- Rayagada (village names missing)

Potential bugs or required features

Kind (issue/new feature request)	Summary	Context/Steps to reproduce	Response
Suspected issue	Words already uploaded using LL does not get removed while creating a new list	Included a word "ଉଦ୍ଦେଶ୍ୟରେ" in a new batch and selected "Remove words already recorded" while loading words from a local list Even though the word already exists in two places (first, second -- both uploading using LL) on Commons, it does not appear as a duplicate on LL Record Wizard	Hi @Psubhashish could you please try to reproduce this issue with recordings that were not renamed? Just to be sure: the Record wizard can only remove words that the current speaker already recorded, for the moment it can't remove words recorded by other speakers (there is a ticket on phabricator asking for this feature). — WikiLucas (🖋️) 12:13, 18 August 2021 (UTC)
Feature	LL helps remove words recorded already. But there is no way to download that word. This would help a lot in creating a list locally.		Could you develop a little bit your idea please? You would like to export a textual file containing the words that you already recorded? Or you would like to download the sound files you uploaded? — WikiLucas (🖋️) 12:15, 18 August 2021 (UTC) @WikiLucas00 ha ha you caught off guard! I was trying to make a rough list as I am discovering new things here first before fleshing out suggestions for improvement. By listing, I mean a text file containing the words recorded, not the audio file. I guess one can download audio files from Commons in bulk too. But that's another question and do share if there is a way that you might know. --Subhashish (talk) 09:32, 21 August 2021 (UTC) @Psubhashish Using Petscan and your Lingua Libre category on Commons, you can export the text list of all your recorded files. Here is the query. You can change the output to plain text, wikicode, json etc if you want to (in the Output tab). I hope this fits to your needs. All the best — WikiLucas (🖋️) 15:17, 21 August 2021 (UTC)
Feature	Number counter while reviewing recorded audio	While reviewing recorded audio it is not possible to see the change in the counter at the bottom. For instance, I am reviewing the recorded audio number 10 and the total number of recorded sounds is 300. I cannot see the exact number of a particular sound in the counter.
Issue	RecordWizard field "Spoken languages" is confusing.	Should one add all the languages/dialects they know or the one they are going to speak in the next step in a particular batch? If I am a speaker who is multilingual (which is the case for most people in South Asia), I'd prefer that the form asks me the specific dialect/language I am going to speak in a batch. I might speak six languages but they are not relevant for each word in a particular batch.
Issue	"Place of residence" is meaningless without the "place of language learning".	One might have learned a language in one place but might be living in another. The latter might or might not have impact on the language that they speak. However, where they learned the language is very important (in most cases).
Feature	Need an option to record offline and upload/sync when connected to the internet	I am planning for a workshop to record pronunciation of words in an indigenous language in a remote place. This would mean traveling to places with probably no internet connectivity, and then recording there offline, and uploading to LL later when connected to the internet. This might be possible to have a MediaWiki + Wikibase environment locally by forking LL. There are two challenges: a. I don't know yet how to set up one such environment locally. b. I don't know how to enable the local wiki to speak to LL when connecting to the internet
Potential feature	How to record words in a language with no writing system/script?	When a language is only oral and has no formal writing system/script of its own, International Phonetic Alphabet (IPA) is often used by linguists to "write" the pronunciations. Will IPA-based word listing work on LL? Another possibility in such a case is the speaker's familiarity of a neighboring dominant script. This can be problematic in many levels (for starters, colonization by users of dominant scripts) but can be a temporary fix just for the field recording. If such recordings are made and uploaded, how can they be converted into IPA later so that the file names do not show the dominant script?
Feature	Parsing words from any public web page	Legally and technically, words per se are not copyrighted. Hence, parsing and creating a list of words is a great way to make way for recording words from different topics. Wikipedia categories or Wiktionary entries are not always diverse, considering their diversity scope is limited to the personal interest of active Wikimedians and/or a good amount of content don't make their way to these projects because of citation issues (not everything that is public is citable -- they might have many words in a particular topic though and hence are of interest to LL).
Bug	All words under a dialect (e.g. Baleswari-Odia) should be listed under the language (e.g. Odia) in Statistics	A language being a superset of a dialect, all words recorded under a dialect should be listed under a language as well. Right now each dialect has its own category in the Statistics page which is great. But these words do not appear in the total number of recordings its respective language name.

@@ Line 1: / Line 1: @@
 {{Userboxtop|Rewards}}
+{{50k barnstar}}
 {{Speaker of the month|07/2021|5960}}
+{{Speaker of the month|08/2021|8381}}
+{{Speaker of the month|10/2021|5464}}
+{{Speaker of the month|11/2021|5536}}
+{{Speaker of the month|01/2022|7259}}
+{{Speaker of the month|08/2022|5003}}
+{{Speaker of the month|03/2023|3346}}
+{{Speaker of the month|06/2023|3974}}
 {{Userboxbottom}}
 {{#babel:records-ori}}
-[[:w:Odia language|Odia-language]] [[:w:Odia Wikipedia|Wikipedian]], documentary filmmaker, [https://www.nationalgeographic.org/find-explorers/subhashish-panigrahi National Geographic Explorer] and former community manager in noted nonprofits including Wikimedia Foundation, Mozilla, Centre for Internet and Society and Internet Society; Open Culture and Creative Commons advocate. I have contributed in recording over 4,000 pronunciations using Lingua Libre and even before LinguaLibre was launched in the Odia language (both standard pronunciation and the [[:w:Baleswari Odia|Baleswaria]] dialect)
-== Lists ==
+:: ''I AM ON A BREAK to recalibrate, focus on other life priorities, regain energy to come back to this beautiful project soon.''
-* [[List:Ory/All standard Odia|All standard Odia]]
+I am a Wikimedian, documentary filmmaker and [https://www.nationalgeographic.org/find-explorers/subhashish-panigrahi National Geographic Explorer]. I am interested in studying access, decolonization of knowledges and the free-culture movement. I have been active in language documentation with a focus on endangered languages and the use of multimedia as a democratic tool. I also have been an organizational leader and have served both in professional and volunteer-advisory roles at the Internet Society, Wikimedia Foundation, Mozilla, Centre for Internet Society, Creative Commons, Digital Language Diversity Project (DLDP), Wikitongues and now defunct ScholarlyHub.
+I am very interested in publicly-owned and public-governed multimedia archives, and I put some volunteer time into action. I have contributed over [https://lingualibre.org/wiki/LinguaLibre:Stats/Speakers 68,000 pronunciation recordings On Lingua Libre] and over 4,000 sentence recordings on Mozilla Common Voice. My primary contribution to Lingua Libre is in the [[:w:Odia_language#Standardization_and_dialects|Central]] (''Mugalbandi'') and [[:w:Baleswari Odia|Baleswari]] dialects of the Odia language.
+== LinguaLibre/other pronunciation-related publications ==
+* Subhashish Panigrahi (2022), [https://dl.acm.org/doi/10.1145/3487553.3524931 Building a Public Domain Voice Database for Odia], Companion Proceedings of the Web Conference 2022, Virtual Event, Lyon, France, pp. 1331–1338, DOI: 10.1145/3487553.3524931, ISBN: 978-1-4503-9130-6
+* Subhashish Panigrahi (2022), [https://diff.wikimedia.org/2022/03/10/building-a-50000-pronunciation-data-repository-in-the-odia-language/ Building a 50,000 pronunciation data repository in the Odia language], Diff
+== Things I have made/broken ==
+* [[/tools/Prepare words for Lingua Libre|Prepare words for Lingua Libre]]: a tool to copy text from any source and clean up to create a list of words ready to be used in RecordWizard of Lingua Libre.
+* [https://github.com/ofdn/Kathabhidhana Kathabhidhana]: an open-source toolkit to record a large number of words in any language (inspired from another open project by T. Shrinivasan) (see [https://twitter.com/i/events/898061810217213956 tweet thread], [https://rising.globalvoices.org/blog/2017/03/28/a-new-audio-uploading-tool-for-crowdsourced-wiktionary-project-in-odia-language/ coverage on Rising Voices], [https://opensource.com/article/17/5/simple-command-line-tool-recording-audio blog], [https://wikimania2017.wikimedia.org/wiki/Submissions/Kathabhidhana:_Recording_words_for_Wiktionary_and_preparing_for_an_AI_assistant selected talk at Wikimania 2017] and coverage on [https://fr.m.wikipedia.org/wiki/Wikip%C3%A9dia:RAW/2017-05-25 French Wikipedia newsletter RAW])
+== Personal lists ==
+* [https://w.wiki/77Hs All lexeme forms in Odia missing a pronunciation] (inspired by Adithya K's [https://w.wiki/77Hu query])
+* [[List:Ory/All standard Odia|All standard Odia]] ([https://lingualibre.org/index.php?search=&search=List%3AOry all lists])
 * [[List:Ory/Baleswaria]] ([[:w:Baleswari Odia|Baleswaria]] dialect of Odia; ongoing, total #words: 1046 by June 1, 2020; [[List:Ory/Baleswaria/recording_complete|words with recording completed]])
 * TBD: [[:or:wikt:ଶ୍ରେଣୀ:ବାଲେଶ୍ୱରୀ ଶବ୍ଦ|Baleswari words from Ordia Purnachandra Bhashakosha]]
+* [[List:Ori/Places of Odisha|List of Places in Odisha]] (villages, Towns, Administrative blocks, etc.)। Words collected from:
+** [https://kalahandi.nic.in/od/%e0%ac%97%e0%ad%8d%e0%ac%b0%e0%ac%be%e0%ac%ae-%e0%ac%93-%e0%ac%aa%e0%ac%9e%e0%ad%8d%e0%ac%9a%e0%ac%be%e0%ad%9f%e0%ac%a4/ Kalahandi district official site]
+** [https://malkangiri.nic.in/od/ Malkangiri] (village names missing)
+** [https://koraput.nic.in/od/ Koraput] (village names exist)
+** [https://bhadrak.nic.in/od/ Bhadrak] (village names exist)
+** [https://rayagada.nic.in/od/ Rayagada] (village names missing)
-== Issues to report ==
+== Potential bugs or required features ==
-=== Words already uploaded using LL does not get removed while creating a new list ===
+{| class="wikitable"
-Steps:
+|-
+! Kind (issue/new feature request) !! Summary !! Context/Steps to reproduce !! Response
+|-
+| Suspected issue
+|| Words already uploaded using LL does not get removed while creating a new list
+||
 # Included a word "ଉଦ୍ଦେଶ୍ୟରେ" in a new batch and selected "Remove words already recorded" while loading words from a [[List:Ory/All standard Odia|local list]]
 # Even though the word already exists in two places ([[:commons:File:Or-ଉଦ୍ଦେଶ୍ୟରେ.wav|first]], [[:commons:File:Or-ଉଦ୍ଦେଶ୍ୟରେ 01.wav|second]] -- both uploading using LL) on Commons, it does not appear as a duplicate on LL Record Wizard
-: Hi {{ping|Psubhashish}} could you please try to reproduce this issue with recordings that were not renamed? Just to be sure: the Record wizard can only remove words that the current speaker already recorded, for the moment it can't remove words recorded by other speakers (there is a [[phabricator:T231559|ticket on phabricator]] asking for this feature). — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 12:13, 18 August 2021 (UTC)
+||
+Hi {{ping|Psubhashish}} could you please try to reproduce this issue with recordings that were not renamed? Just to be sure: the Record wizard can only remove words that the current speaker already recorded, for the moment it can't remove words recorded by other speakers (there is a [[phabricator:T231559|ticket on phabricator]] asking for this feature). — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 12:13, 18 August 2021 (UTC)
-=== Technical: Feature request ===
+|-
-* LL helps remove words recorded already. But there is no way to download that word. This would help a lot in creating a list locally.
+| Feature
+|| LL helps remove words recorded already. But there is no way to download that word. This would help a lot in creating a list locally.
+||
+||
 :Could you develop a little bit your idea please? You would like to export a textual file containing the words that you already recorded? Or you would like to download the sound files you uploaded? — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 12:15, 18 August 2021 (UTC)
+:: {{ping|WikiLucas00}} ha ha you caught off guard! I was trying to make a rough list as I am discovering new things here first before fleshing out suggestions for improvement. By listing, I mean a text file containing the words recorded, not the audio file. I guess one can download audio files from Commons in bulk too. But that's another question and do share if there is a way that you might know. --[[User:Psubhashish|Subhashish]] ([[User talk:Psubhashish|talk]]) 09:32, 21 August 2021 (UTC)
+:::{{ping|Psubhashish}} Using Petscan and your Lingua Libre category on Commons, you can export the text list of all your recorded files. [https://petscan.wmflabs.org/?psid=19878687 Here is the query]. You can change the output to plain text, wikicode, json etc if you want to (in the Output tab). I hope this fits to your needs. All the best — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 15:17, 21 August 2021 (UTC)
+|-
+| Feature
+|| Number counter while reviewing recorded audio
+|| While reviewing recorded audio it is not possible to see the change in the counter at the bottom. For instance, I am reviewing the recorded audio number 10 and the total number of recorded sounds is 300. I cannot see the exact number of a particular sound in the counter.
+||
+|-
+| Issue
+|| RecordWizard field "Spoken languages" is confusing.
+|| ''Should one add all the languages/dialects they know or the one they are going to speak in the next step in a particular batch? If I am a speaker who is multilingual (which is the case for most people in South Asia), I'd prefer that the form asks me the specific dialect/language I am going to speak in a batch. I might speak six languages but they are not relevant for each word in a particular batch.''
+||
+|-
+| Issue
+|| "Place of residence" is meaningless without the "place of language learning".
+|| ''One might have learned a language in one place but might be living in another. The latter might or might not have impact on the language that they speak. However, where they learned the language is very important (in most cases).''
+||
+|-
+| Feature
+|| Need an option to record offline and upload/sync when connected to the internet
+||
+* I am planning for a workshop to record pronunciation of words in an indigenous language in a remote place. This would mean traveling to places with probably no internet connectivity, and then recording there offline, and uploading to LL later when connected to the internet.
+* This might be possible to have a MediaWiki + Wikibase environment locally by forking LL. There are two challenges:
+:: a. I don't know yet how to set up one such environment locally.
+:: b. I don't know how to enable the local wiki to speak to LL when connecting to the internet
+||
+|-
+|| Potential feature
+|| How to record words in a language with no writing system/script?
+||
+* When a language is only oral and has no formal writing system/script of its own, International Phonetic Alphabet (IPA) is often used by linguists to "write" the pronunciations. Will IPA-based word listing work on LL?
+* Another possibility in such a case is the speaker's familiarity of a neighboring dominant script. This can be problematic in many levels (for starters, colonization by users of dominant scripts) but can be a temporary fix just for the field recording. If such recordings are made and uploaded, how can they be converted into IPA later so that the file names do not show the dominant script?
+||
+|-
+|| Feature
+|| Parsing words from any public web page
+|| Legally and technically, words per se are not copyrighted. Hence, parsing and creating a list of words is a great way to make way for recording words from different topics. Wikipedia categories or Wiktionary entries are not always diverse, considering their diversity scope is limited to the personal interest of active Wikimedians and/or a good amount of content don't make their way to these projects because of citation issues (not everything that is public is citable -- they might have many words in a particular topic though and hence are of interest to LL).
+||
+|-
+|| Bug
+|| All words under a dialect (e.g. Baleswari-Odia) should be listed under the language (e.g. Odia) in Statistics
+|| A language being a superset of a dialect, all words recorded under a dialect should be listed under a language as well. Right now each dialect has its own category in the Statistics page which is great. But these words do not appear in the total number of recordings its respective language name.
+||
+|}