Sunday 16 August 2020

On spelling and transliteration

When you write a blog on translations of songs between languages, spelling and transliteration issues will definitely come to bite. Let me comment on the spelling/transliteration I use for each language I've worked with.
  1. Languages with standard orthographies in the Latin alphabet: Italian, English, French, German, Spanish, Portuguese, Galician, Swedish, Indonesian, Xhosa, Zulu, Haitian Creole, Latin, Venetian, Friulian, Occitan, Middle English, Swahili, Finnish, Albanian, Czech, Slovak, Croatian, Vietnamese, Turkish, Irish, Romanian, Danish, Hungarian, Lingála; for these, there isn't really much of an issue; just a couple special mentions: for Vietnamese, I decided to link syllables of single words with dashes, whereas the normal orthography is to separate all syllables with spaces (so I write "cô-bé" for "girl" while it's normally written "co bé"), and for Irish I use both the standard and my own creation, which was conceived to be easier for me to crack, but I don't really remember exactly how it was supposed to work; oh, and in Indonesian I try to mark all /e/ as é to distinguish them from schwas, which the standard orthography doesn't do; and I try to mark tones and the e/ɛ and o/ɔ distinctions in Lingála, which isn't always done;
  2. Languages spelled with the Latin alphabet for which I either know of no standard, or of multiple competing standards: Romagnolo, Neapolitan, Sicilian, Mende; let's look at these one by one:
    1. For Romagnolo, I'm transcribing a specific variety, and I devised my own orthography; there is a standard orthography, but I don't know how it works; judging by a dictionary I consult, I'm really not too far off, the main difference between me and it being it's transcribing a dialect with /e/ and /e:/~/ej distinct, so it uses ē for the long vowel / diphthong and é for the short one, whereas I, not having the short vowel, just use é for the diphthing;
    2. For Sicilian, I know of two ways to spell autoctonous sounds, but I decided to devise my own conventions;
    3. For Neapolitan, there are three conventions regarding the spelling of the schwa sound: Neapolitan proper spells schwas as what they were before vowel reduction; Abruzzese dialects (as well as the San Benedetto del Tronto dialect, and perhaps others as well) use e for the schwa; Facebook illiterate dialect speakers don't write the schwa at all; as for me, I usually stick to whatever convention the dialect I'm working with uses, perhaps marking schwas with some diacritic; I guess it's typically using overdotted non-reduced vowels in Neapolitan proper, and ë or ė for other "nap-code" dialects;
    4. For Mende, there are distinct conventions as regards: vowel lengthe, either using macrons or doubling vowels; and e/ɛ and o/ɔ, where these symbols are used sometimes, the distinction is sometimes ignored, and sometimes you find ẹ/e (or e/ẹ?) and o/ọ (or viceversa), and then there is o̱ which I assumed was I-don't-recall-which of these sounds; I use e/ɛ, o/ɔ, and macrons;
  3. Languages with a standard orthography that uses a non-Latin alphabet: Russian, Ukrainian, Mandarin, Cantonese, Japanese, Modern and Ancient Greek, Korean, Arabic, Hindi, Urdu, Persian, Bulgarian, Hebrew; for the native orthography, no issue; for the transliteration, as with any transliteration, I strive for back-transliteratability as well as phonetic transparency; let's discuss each language:
    1. My romanization of Russian is a modification of some "standard" systems; here's how I transliterate each Ciryllic letter:
      Аа
      Бб
      Вв
      Гг
      Дд
      Ее
      Ёё
      Жж
      Зз
      Ии
      Йй
      Кк
      Лл
      Мм
      Нн
      Оо
      Пп
      Рр
      Сс
      Тт
      Aa
      Bb
      Vv
      Gg
      Dd
      Ĵe ĵe
      Ĵo ĵo
      Žž
      Zz
      Ii
      Jj
      Kk
      Ll
      Mm
      Nn
      Oo
      Pp
      Rr
      Ss
      Tt


      Уу
      Фф
      Хх
      Цц
      Чч
      Шш
      Щщ
      Ъъ
      Ыы
      Ьь
      Ээ
      Юю
      Яя
      Uu
      Ff
      Kh kh
      Ts ts
      Ćć (except when
      it's pronounced š,
      in which case it's Čč)
      Šš
      Śś
      `
      Yy
      '
      Ee
      Ĵu ĵu
      Ĵa ĵa

      So there is one case where I privileged phonetic transparency over back-transliteratability: the choice not to use c for the ts sound; note that the ĵ has a circumflex, which was placed there specifically to have back-transliteratability; I originally used carons instead, but ě́ and similar didn't render well in the captions for this video (or was it the title?), so I switched; NOTE: I was checking up transliteration explanations when I went «Oh, so I use ` for the tvĵórdyj znak?», so I may have transliterated that differently or not at all in some places;
    2. I honestly haven't put much thought into how I transliterate Ukranian;
    3. For Mandarin, the native orthography is well-standardized, and the transliteration I use is Pinyin; there may be some oscillation on whether I mark "neutered" tones as neuter or as what the character's root tone is, for example 个 is gè, but is often pronounced ge; I suspect I normally mark that because Google does, and I fix Google's Pinyin when I transliterate stuff here rather than transliterating from scratch; spaces are mostly placed according to dictionary words, save for what I do with aspect particles guò le zhe, which I usually stick only to verbs; I use dashes for doubled verbs and adjectives, I think, and perhaps for some other 4-char phrases too;
    4. For Cantonese, the orthography is well-established, and the transliteration is Jyutping with tone numbers and dashes for syllable division within words; when I worked on my first Cantonese song, I was transliterating with Wiktionary which uses Yale, then I discovered CantoDict, a much more complete (at least at the time) dictionary using Jyutping; the switch may have left hybrids in my old file, so some transliterations may still be hybrids;
    5. For Japanese, the orthography is… fuzzy; I probably tend to over-kanji-ize, since I use furigana on every kanji, though this trend may have reduced in recent times; the romanization is Hepburn in any case, modulo using ou and ei as distinct from ō and ē which are reserved for おお and ええ and not applied to おう and えい; I wonder if I've ever attempted to distinguish ou and ei pronounced ō and ē from those pronounced ou and ei;
    6. For Modern and Ancient Greek, the orthography is well-established (except for some Modern words like μαντήλι/μαντίλι, though I believe the rule is now no eta in loanwords? In any case, in the specific example, I used the eta because my lyrics used it); as for transliteration, again, a compromise between back-transliterability and phonetic transparency; I'm too lazy to make another table, so here's the Greek alphabet as transliterated for Modern Greek: a-b-g-d-e-z-i̱-th-i-k-l-m-n-x-o-p-r-s-t-y-f-kh-ps-ō, but then you have αι ει οι υι = ä ë ö ÿ and αυ ευ = ay̆ ey̆ or af̆/av̆ ef̆/ev̆, and that should be all; as for Ancient, a-b-g-d-e-z-ē-th-i-k-l-m-n-x-o-p-r-s-t-y-ph-kh-ps-ō;
    7. For Korean, I use the Wiktionary's Revised transliterations; I don't really take cases of non-phonetic Hangul into account, but I haven't really thought much about this anyway;
    8. For Arabic, the standard transliteration is just outright stupid; I means, are you seriously having us distinguish LEFT AND RIGHT QUOTES? You gotta be kidding me, right? Since ayn sounds kind of like a nonstandard pronunciation of r I've heard in Italy, I decided that ř would be ayn (I mean, řayn), so the apostrophe is definitely 'ālif; for a few more remarks, I'll quote this post:
      The transliteration scheme is my own. In particular, I differ from the "standard" transliteration in the following respects: I use ŕ for the left-quote transliterating ŕayn (ع), because using left-quote and right-quote for two different sounds dounds like complete madness to me, and because the sound of ŕayn sounds like a particular kind of "r moscia" (nonstandard pronunciation of the Italian phoneme /r/); I prefer using th and dh to ṯ and ḏ for the dental fricatives (ث and ذ), because it is quicker to type and easier to read; I use gh for ghayn (غ), for similar reasongs to th and dh; I guess I did not have to think about what to do with kha (خ), but either kh or x are fine: x is IPA, kh is just as easy to type and read; standard probably has ḵ, which I doff for the same reasongs as ṯ and ḏ; should ta-ha, dal-ha or kaf-ha ever occur, to avoid confusion with tha, dhal and kha, I'd use t.h, d.h and k.h respectively; I doffed š for šin in favor of sh, with the same solution to sin-ha combinations, for the same reasons.
      And then I promptly forget about everything said above in my other Arabic post Problems, where the romanization is compatible with the following table:
      ا
      ب
      ت
      ث
      ج
      ح
      خ
      د
      ذ
      ر
      ز
      س
      ش
      ص
      ض
      ط
      ظ
      ع
      غ
      ف
      ق
      ک
      ل
      م
      ن
      ه
      و
      ی
      '/ā
      b
      t
      j
      x
      d
      r
      z
      s
      š
      `
      ğ
      f
      q
      k
      l
      m
      n
      w/ū
      h
      y/ī

    9. For Hindi, I use the Devanagari standard orthography as a reference for back-transliteratability; now, Hindi has this perfect system which could be fully phonetic… and then it decides to not use it as such; more specifically, it mutes a bunch of schwas, and virtually regularly turns "aha" into /ɛh(ɛ)/, like in the word पहले, which looks like "pahale" but is actually "peh(e)le", and "ahu" and "uha" into /ɔhɔ/, as in बहुत "bahut", pronounced "bohot"; for back-transliteratability, I use ä for /e/s that are spelled a, å and ů for a and u pronounced o (so "båhůt", "můhåbbat"), and ' to indicate mute schwas, so the word from before is "päh'le"; then it happens that some normally-muted schwas resurface in singing, in which case I use a literal schwa, ǝ; for example, एक looks like eka, but is actually pronounced ek', except in "ekǝ din" found in this video; I use ai and au for ऐ and औ, as is standard; then there is the matter of anusvāra and candrabindu; normally, these are nasalizing diacritics: put one on इ i, and it becomes nasalized; in that case, I use ṅ for the anusvāra and ṃ for the candrabindu; for example, the verb "to be" (honā) conjugates हूँ, है, है, हैं, हो, हैं, a.k.a. hūṃ, hai, hai, haiṅ, ho, haiṅ; then there's words like चांद or संबंध, which look like cāṅd' and saṅbaṅdh', but where the anusvāras are pronounced as nasals that assimilate to the next consonant; I use ń in those cases, so cāńd' and sańbańdh'; I don't know if candrabindus can do that, but if they can and I ever find one that does, I'll definitely use ḿ; I believe that's all; note that this was conceived over a long time, so there may be leftover errors that I missed when correcting posts – especially in the one I haven't corrected yet [as of the last edit to this post, I wonder if by 1/7/21 I had done that…]; this also means that the same word in Hindi or Urdu spelling would be transliterated differently;
    10. Speaking of, let's deal with the ABSOLUTE MESS that the Urdu script is; let's make a table of letters, transliterations, and letter names:
      ا
      ب
      پ
      ت
      ٹ
      ث
      ج
      چ
      ح
      خ
      د
      ڈ
      ذ
      ر
      ڑ
      ز
      ژ
      س
      ش
      ص
      ض
      ط
      ظ
      ', â
      b
      p
      t
      j
      c
      h
      x
      d
      r
      z
      ž
      s
      š
      ż
      'alif
      ṭê
      s̱ê
      jîm
      baṛî hê
      huttî hê
      dâl
      ḍâl
      ẕâl
      ṛê
      žê
      sîn
      šîn
      ṡẃâd
      ẑẃâd
      ṫôê
      żôê


      ع
      غ
      ف
      ق
      ک
      گ
      ل
      م
      ن
      ں
      و
      ه
      ھ
      ی
      ے
      ؤ ,ئ ,إ/أ
      `
      ğ
      f
      q
      k
      g
      l
      m
      n
      ṅ, ṇ
      w, û, ô, ô̱, ŭ
      ḥ, ĥ
      y, î, ě, ě̱, ẹ̌, ĭ?
      ê, ê̱
      '', ŷ, ŵ
      `ě̱n
      ğě̱n
      qâf
      kâf
      gâf
      lâm
      mîm
      nûn
      nûn ğunnaĥ
      wâŵo
      gôl ḥê
      cḫôṭî ḥê
      dô cašmî ḥê
      cḫôṭî yê
      baṛî yê
      'alif hamzaĥ,
      yê hamzaĥ,
      wâŵo hamzaĥ

      If you're going «What the heck?!», that's exactly what I thought; a few remarks:
      • First of all, ẃ is probably ONLY in the names ṡẃâd and ẑẃâd, because there is a w in the pronunciation, but it is not written; and I thought the consonants were the strong point of Urdu…;
      • Next up, yes, s̱ê and sîn and ṡẃâd, as well as zê and ẕâl and ẑẃâd and żôê, as well as tê and ṫôê, are pronounced exactly the same; they are purely etymological distinction for Arabic terms;
      • ḫ just indicates the aspiration of the previous consonant;
      • As for baṛî hê and cḫôṭî ḥê, they are pronounced the same, except when the latter is silent, in which case I write ĥ;
      • Nûn ğunnaĥ was apparently created to nasalize vowels, hence my ṅ, the same as for the anusvār' in Hindi; they even created ڻ for the retroflex nasal ṇ, it would seem; and then I find दर्पण darpaṇ' spelled with a nûn ğunnaĥ; like WTF? So I guess they ditched the specific character, and used nûn ğunnaĥ for ṇ too, hence my double transliteration; also, I think medial and initial forms of nûn ğunnaĥ and of plain nûn coincide?
      • For `ě̱n, I always make it a silent `; some argue it represents vowels in some places, but I bet it's only ever found in (ultimately) Arabic loans, where it was an actual /ʕ/, hence my `;
      • And speaking of vowels, they are a complete mess; so, 'alif is used either as etymological, in which case I use ', or to represent a long a (the आ ā of Hindi), in which case I use â; I haven't developed a convention for 'alif madda (whatever that is spelled like), i.e. آ, but I guess it would have to be 'â;
      • Wâŵo is either w, its root sound, or used to represent the sounds ओ o and औ au and उ u and ऊ ū of Hindi, which I render respectively as ô, ô̱ (note the underline), ŭ, and û; because not writing short vowels was bad enough, now we hacve some of them written and some not, and I think there are also some unwritten ऊ ū, and am pretty sure some ओ o and औ au aren't written either;
      • Yê is the biggest mess of all; well, *"the yê's are"; so, yê is of couirse used for y, and for long ई ī, which I render y and î respectively; all good there; then you have ए e and ऐ ai in Hindustani, right? What do you do for those? Apparently they created baṛî yê ad hoc… and then restricted it to word end; yes, you heard that right; they create a glyph for two sounds, and then restrict it to the end of words, and use, guess what, yê (I mean cḫôṭî yê) for those same sounds in other positions; like, seriously? Anyway, when ě = ए e spelled with cḫôṭî yê, ê = ए e spelled with baṛî yê, ě̱ = ऐ ai spelled with cḫôṭî yê, ê̱ = ऐ ai spelled with baṛî yê, and then since I'm crazy I decided I would specify when something is an ezafe (you know that Persian grammar construct which Urdu borrowed? Yeah, that one), and went with ẹ̌; that is AFAIK pronounced as Hindi ए e, never ऐ ai, and that is a blessing because I couldn't render such a difference, that would be too many diacritics; finally, as with w, I suppose y could also happen to write short इ i at times, which is why I have "ĭ?";
      • Then we have the hamza carriers, which are their consonant representations with a circumflex, except for 'alif hamzaĥ where apostrophe-circumflex '̂ would look terrible, so I just gave up some degree of back-transliteratability – or did I? You judge if '' can be mistaken for double 'alif; actually, I think that's disallowed, and the madda ˜ was invented in Arabic just to avoid that combo; btw, WTH is up with the keyboard layout that doesn't have any of these except for ŷ, which it has on layer 1? I don't think a lone hamzaĥ is allowed in Urdu, so that is not a problem luckily;
      • I may have to come back to this because apparently ezafe can also be written with heh (see Persian below), but I haven't seen that yet in Urdu.
    11. For Persian, I don't exactly remember what conventions I developed; looking at this video, it seems I used ë for /e/ "written as heh" (including ezafes, transliterated -e when unwritten), separated plural marker hâ with a dash, used ř for ayn, č for če, ǧ for qaf, ' for alif (except for when I used ʔ instead, as in ʔAgë), š for shin, â for "long a", x for kh, ẃ for the mute w in bexẃâd which is pronounced bexâd, ṣ for ṣad (as opposed to s for sin), presumably underdots for all unmentioned emphatics, and ḥ for ḥe, and marked all written vowels with macrons (except of course for â);
      That was my original comment; I have since decided to actually develop a scheme to apply to posts and translations from after 1/7/21; this will transliterate the letters as follows:
      ا
      ب
      پ
      ت
      ٹ
      ج
      چ
      ح
      خ
      د
      ذ
      ر
      ز
      ژ
      س
      ش
      ص
      ض
      ط
      ظ
      ع
      غ
      ف
      ق
      ک
      گ
      ل
      م
      ن
      و
      ھ
      ی
      '/â
      b
      p
      t
      j
      č
      h/ë
      x
      d
      r
      z
      ž
      s
      š
      `
      ğ
      f
      q
      k
      g
      l
      m
      n
      w/û/ô
      y/î/ê


      When I use ë, it means "vowel e spelled with he"; note that ezafes, progressive(?) mî's, and possessives are separated with dashes, so mî-konî, beh-et, etc;
    12. For Bulgarian, I didn't put much thought into it, I probably followed Wiktionary; look at this post and infer what you can;
    13. For Hebrew, I have two systems: one fully back-transliteratable, and one less cluttered and more readable; let's see a table of the transliterations of letters in the first system, make a couple remarks, and then see what changes in the second system:
      א
      ב
      ג
      ד
      ה
      ו
      ז
      ח
      ט
      י
      כך
      ל
      מם
      נן
      ס
      ע
      פף
      צץ
      ק
      ר
      ש
      ת
      '
      v/b
      g
      d
      v/ų/ǫ
      z
      ħ
      j/į/ę
      kh/k
      l
      m
      n
      `
      f/p
      tz
      q
      r
      s/sh
      t
      If dageshes were used more often, most pairs of transliterations would be no dagesh / dagesh, the exception being s/sh whcih depends on the also-omitted top dot; yod and vav are often used to mark vowels, hence the multiple transliterations, the ogonek showing the vowel in question is marked; I also separate articles from nouns and conjunctions and prepositions from whatever follows them with dashes, so you see things like l-i for "in me", shel-i for "of me" (=my, mine), and ḥa-kavód for "the respect"; the transcription was made a bit carelessly at times, but it should drop the ogoneks (ǫ ų į ę), the ḥ (meaning ḥeḥ is effectively not transliterated in the transcription), and the ' and ` (so 'alef and `ayn are also untransliterated); I will make sure that is all the differences;
  4. Languages without a standard orthography or a standard script: Min Nan, Hakka, Teochew; I've covered all of these in this post.

No comments:

Post a Comment