Friday, March 17, 2017

An index & a count of Fulfulde words used in Kaïdara

Last year I dusted off an old sub-project idea to index words used in Amadou Hampâté Bâ's Kaidara, a Fulani initiation tale originally published in parallel Fulfulde and French text. I've brought that to a level of completion with list of occurrences of all Fulfulde words in Kaïdara ("Kaydara" in Fulfulde), each one tagged with the "stanza" (actually just a set of 10 numbered lines) in which it appears. This is complemented by a word frequency count using an online utility designed for such work.

The original idea goes back to a project proposal in the early 1990s for a follow-on phase to an original US Department of Education materials grant to produce a lexicon for the Maasina variety of Fula. That phase would have included on the one hand field research and the other "mining" of various Fulfulde texts for vocabulary and word forms. The Kaïdara idea fit under the latter.

At the time there were ASCII texts (with markup for accents and extended characters) of this and a few other texts available from an FTP site. The plan was to use a series of macros in WordPerfect to substitute characters as needed in such text, then to tag each word with the number of the line in which it appeared - tag meaning simply to affix the number in a manner similar to what I have just done with the index I'm making available. The resulting index could then be used to identify terms missing from the lexicon, and to look up how they and other words were used along with their translations in context. (Kaïdara of course is in verse, so the usage is stylized but still of interest.)

Ultimately the follow-on project was not funded, so the Fulfulde lexicon completed for the original project was further edited and slightly expanded for publication in 1993. And the idea of indexing Fulfulde texts in the manner described was shelved. In the intervening quarter century, a considerable amount of work has been done on corpus development for many languages, but not to my knowledge including Kaïdara (or other bilingual works in the "Classiques africaines" series).

In January 2016 I decided to make an index, using the digital copy of Kaïdara from That resource is very helpful, but I did find a number of small errors, which to me looked like scanos (these were most easily identifiable at the stage when the words were sorted alphabetically). This was a manual process, with some search & replaces: a set of 10 lines is copied (lines ending in 0-9, so numbering indicates 10s), and spaces are searched and replaced with the appropriate number and a hard return. A difference between this and the original concept is that the words are not tagged with their exact line, but rather with the set of ten lines within which they occur (still more exact than a page number would be).

At the end of that process, punctuation was stripped out of the complete list, again by search & replace, and then the list was sorted. It was at that point that the whole list had to be scanned visually for anomalies - for example several words repeating but one with a regular d instead of hooked ɗ, or what looks like a plural ending in -be when -ɓe is intended. And for single words, occasionally something doesn't look right and needs to be checked against what was printed in the book.

It is entirely possible that I (1) missed errors, or (2) introduced errors. Ideally an automated process (that could be run more than once) could do such work. But for the moment, here is away of searching Fulfulde words in Kaïdara, and a different way to look at its contents.

Monday, February 27, 2017

About African languages on Wikipedia & on PanAfriL10n

Wikipedia logo
Am overdue for an update on Wikipedias in African languages, but in the meantime, here's a quick suggestion concerning articles in any Wikipedia editions about African languages. That is, incorporate contributions to Wikipedia as assignments in African languages and linguistics classes.

This is actually a variation on a theme previously discussed on this blog. It is prompted by an observation made by Michael Everson comparing treatment in the English Wikipedia of the Irish language and the Wolof language (as an example). The latter is not bad but has gaps, and nowhere near the detail one finds on the Irish language (which extends to other articles).

What would it take to set up an experiment in a university-level African language program - in Africa or elsewhere - where a prof would institute this idea? The experience could be shared and developed with other objects in mind, such as contributions to African language Wikipedias. In a few cases like Wolof, which has its own edition of Wikipedia, one could write more about the language in the language itself.

Information on African language orthographies

Michael's comment came in the context of what he sees as the difficulty getting "decent grammatical and orthographic information on most major African languages." This in a discussion on Facebook following a post by Charles Riley (of Yale University Library) about the "Garay script," which was invented in the 1960s as an alternative to the dominant Latin-based script and the traditional Arabic-based Wolofal or writing Wolof. (Africa is a continent of many alphabets - another topic to which I hope to return soon).

ANLoc's logo for the PanAfriL10n wiki, 2008
With regard to information on orthographies of African languages, collecting such information was one of the mandates of the PanAfrican Localisation project (2005-2008). At one point early-on, the possibility of setting up a database on character requirements for diverse African language orthographies was seriously , but the quality of available information was not deemed to be sufficient for that investment (see discussion and diagrams of evolution). Ultimately the PanAfriL10n wiki (hosted by since 2015) had a pretty good coverage of what was available, language by language and by script, but unfortunately updates have been spotty.

So, another possibility might be for African language and linguistics students to also help update this resource intended as an aid for localization and other language and technology efforts.

Tuesday, February 21, 2017

IMLD 2017 & the Linguapax Prize

The theme of this year's International Mother Language Day (IMLD) celebration is again related to the importance of languages in education: "Towards Sustainable Futures through Multilingual Education." (See also the posting on this blog about IMLD 2016.)

Held annually on February 21, this is the 18th IMLD observance. IMLD is coordinated on the international level by UNESCO, but countries, communities, and associations organize local observances around the world. There have also been initiatives online, such as the Rising Voices "Mother Language Meme Challenge."

In a letter marking IMLD 2017, the African Academy of Languages is requesting information about local observances and initiatives in Africa after they are held.

Linguapax Prize

The Linguapax Institute announces the winner of the Linguapax Prize annually on IMLD. This year's award went to Dr. Matthias Brenzinger, who is described on the Linguapax site as:
A linguist of German origin, expert in African languages, pioneer in the study of endangered languages and linguistic revitalisation who stands out both for his theoretical contributions and his commitment to the field. He is an activist and promotes the training of linguists among the speakers of endangered languages. ...
Dr. Brenzinger is currently a professor at the University of Capetown, and has there founded the Centre for African Linguistic Diversity (CALDi) and The African Language Archive (TALA). The African languages he has worked on include (certainly an incomplete list): Borana, Khwe, Nǀuu, and non-Bantu click languages in general. For more information see his bio on LinguistList and his article "African language studies on the African continent."

Tuesday, February 14, 2017

2 CFPs re African languages: Agency & the Production of Knowledge, and Disciplines & Professions (ALDP8)

Here are calls for participation (CFPs) in two more conferences, one at Columbia University in New York in March on "African Languages, Agency, and the Production of Knowledge," and the other being the 8th edition of Harvard University's African Languages in the Disciplines and Professions Conference, to be held in Conakry, Guinea in April. The deadline for both CFPs is 1 March 2017 (apologies for the late notice).

African Languages, Agency, and the Production of Knowledge

This "mini-conference," to be held on 24-25 March 2017, is jointly sponsored by the Department of Middle Eastern, South Asian and African Studies (MESAAS) and the Institute of African Studies (IAS) at Columbia University.

"The main objective ... is to assemble scholars from diverse disciplines and engage them in dialogue on the current  status of African languages as conveyors of knowledge, their relevance in knowledge production and sharing, and their role in the future of knowledge construction." It is also planned to publish the proceedings.

"Some relevant questions to ask include: What has Africa lost due to the disuse of African languages in education? What is the relevance of African languages in knowledge production and sharing? What has been achieved so far towards the development of African languages and indigenous knowledge? What are the future prospects for African languages? How can African languages contribute in the construction of knowledge through literature, translation, poetry, and fiction? And what is the role of African language writers, translators, researchers, and teachers?"

For further details on submissions, click on image for full text of CFP. Abstracts should be submitted by 1 March, to sms2168 (at) columbia (dot) edu.

African Languages in the Disciplines and Professions (ALDP8)

The 8th ALDP conference, to be held on 21-23 April 2017 in Conakry, is the first in the series to take place in Africa. The series is run by the African Language Program at Harvard and co-sponsored by the Department of African and African American Studies and the Harvard Committee on African Studies. This year it is being co-organized with Université Kofi Annan de Guinée.

For a general description and some background, see also last year's post on this blog about ALDP7.

The theme of this year's conference is “Progress of African Languages in Disciplines and Professions.” It is planned that the "plenary sessions will include scholars' and activists' presentations from broad areas of disciplines and professions." They are "seeking scholars to give papers and serve on panels for the conference." "The conference's languages ​​of communication are French, English and African languages."

For further details on submissions, see the ALDP page (which also has versions of the CFP in French and N'Ko). Abstracts should be submitted by 1 March, to harvardalp (at) gmail (dot) com.

Thursday, February 09, 2017

Health info in African languages, on 2 non-African sites

Here are quick reviews of two websites - one Australian the other American - that have health information in numerous world languages, including a number from Africa. Both are primarily intended to serve immigrant communities. This post will then return briefly to the theme of the benefits of systematically sharing and improving of health related information composed in or translated into African languages.

Health Translations

Health Translations is a website maintained by the government of the state of Victoria in Australia.  It has information (mainly documents such as fact sheets and flyers, from what I can tell, some with illustrations) on over 80 topics in a total of almost 100 languages or language varieties (although not all information is in every language, and some languages have few items).

The African languages for which there are materials include: Afrikaans; Akan; Amharic; Arabic; Bemba; Dinka; Juba Arabic; Kirundi; Krio; Lingala; Nuer; Oromo; Shona; Somali; Sudanese Arabic (also listed as Sudanese); Swahili (Congolese); Swahili (Kenyan); and Tigrinya.

This is an impressive collection of materials from various sources, apparently all Australian, and in different formats. They appear to all be translations - a given material may be available in a few or quite a number of language versions. Navigating from the a click on the desired language on the list of languages (which helpfully includes both English & native names/scripts) to a particular topical resource requires interacting with screens in English - not surprising, but when one gets to the list of topics and resources in a particular language, the titles are only in English, and then on the list of languages in which a material is translated (this is a typical navigation sequence), the language names are in English (no native scripts used). So the resource appears intended to be used by or with help of professionals or others who can read English.

Source: Bushfire smoke & your health [am]
I'm not able to evaluate the quality of the translations, but noted French in Lingala (which may simply be typically used loanwords) and by chance an anomalous English word in an Amharic text (image).

All the several documents I viewed were PDFs, mainly text but some image (meaning the text cannot be searched or copied out for editing into other materials). Spot checking some non-Latin text, specifically Ethiopic/Ge'ez used for Amharic and Tigrinya, and complex Latin, specifically for Dinka, there were some issues with the text that would interfere with searches or copying out passages (such problems are not uncommon with PDF rendering, even when visually the PDF presents everything correctly and in its intended place).

Some Amharic text when copied out and pasted showed capital A for አ and E for እ in initial position (for example, here). The corresponding characters are appropriate, interestingly, but this makes search or reuse of such text problematic.

From original (l.); copy-pasted out (r.)
Source: Bushfire smoke ... [din]

The Dinka text sampled showed some typical problems with complex Latin in PDFs. Dinka is written with what I call in ALDA a "category 4" Latin orthography in that it includes extended Latin characters (aka modified letters) plus combining diacritics, sometimes together as in the open-o with diaeresis in the word "daiɣɔ̈kthai" (dioxide) featured on the left side of the image. Copying that word from the PDF and pasting it in a word processor or advanced text editor yielded the results on the right, missing one extended character and the combining diacritic on the other. This complicates any potential re-use of this text, but also means that document, folder, or web searches will not pick up words with such character combinations.

Will return to these issues, why they're important, and what to do about them in the last section of this post.



HealthReach: Health Information in Many Languages is a program of the US Library of Medicine of the National Institutes of Health. It includes translations in 46 languages maintained on the MedinePlus site. A total of almost 350 topics are listed (although here too, not all information is in every language, and some languages have fewer items than others).

The African languages for which there are materials include: Amharic; Arabic; Oromo; Somali; Swahili; and Tigrinya, The native names of languages are featured on the list of languages, except oddly for Amharic and Tigrinya, which are transliterated into Latin ("amarunya" instead of አማርኛ, and "tigrinya" instead of ትግርኛ).

This also is an impressive collection from diverse sources, in this case American, but it is longer on topics and shorter on languages covered. The list of topics for each language also includes the titles in the language and its script - except again for Amharic and Tigrinya (not even transliterations) - as well as in English.

All materials checked were PDFs. There are no materials for African languages with complex Latin scripts.

As for non-Latin scripts, text in Arabic seems to behave as intended, from small samples. On the other hand, some Amharic text when copied out and pasted showed the same capital A for አ and E for እ observed above, plus O for ኦ (see here).  A Tigrinya document had a similar issue. So this issue may have to do with a problem in PDFs for handling a particular set of characters - አኡኢኣኤእኦኧ (representing glottal stop plus the range of vowels) - or a subset of them, which might be helpful to know when troubleshooting.

Health education materials and the "2Ds & 4Rs"

In highlighting aspects of public health messaging during the ebola epidemic in West Africa (2014-15), this blog suggested a systematic approach to sharing and improving materials that were developed and used in that context (with primary attention to text and images). A mnemonic - 2Ds & 4Rs - was put forth in October 2014, initially to explain the rationale for reposting and discussing various ebola education materials, but also as a way to capture the ideal cycle of utility of such production. Too often, materials are developed, used for a particular purpose, and then forgotten, when they could add to a growing living corpus of resources to tap for future work. This is important in any field and language, but arguably especially important in health, and for languages that have fewer resources and emerging terminologies / technical lexicons, such as many in Africa.

In that context I propose to use the 2Ds & 4Rs to consider the efforts represented by the two sites discussed above. Of the 6 elements of this model, the first three have to do more with the sharing and use of materials, and the last three with their longer term development and potential re-use. These are listed with brief explanations and what I see as relevance to the two sites:
  • Dissemination (making materials available, including via multiple sources)
    • Both sites bring together and post materials from diverse sources, increasing their exposure and access to them.
  • Demonstration (showing how materials in African languages can be presented, including in cases where complex scripts are involved)
    • Both sites show that African language materials can be presented on the same footing as other world languages.
    • However, the HealthReach presentation does not use available technology to present the native names of Amharic and Tigrinya, or titles of materials in those languages.
  • Reading (creating or translating text materials with attention to how they may be read aloud in groups or over local radio, which may be more likely scenarios for their use than the typical Western expectation of silent reading by individuals)
    •  It appears that all or most of the materials from diverse sources compiled on the two sites are translations from English of technical descriptions and advice. It is not clear how well how well adapted they are for the range of uses and audiences they might serve.
  • Review (written material - text - is well suited for review, comparison, and analysis; such material, especially in less resourced languages and on issues of public importance like health, should undergo such treatment)
    • No information on how any of the materials may be or have been reviewed, either in the diverse organizations where they originated, or in the projects hosting the two websites. 
    • Image PDFs, where these occur, do not lend themselves to processes of review.
    • Text PDFs with problems in their encoding of non-Latin or complex Latin scripts, present problems for review.
  • Revision (after review of materials, and in response to other information and feedback relevant to them, materials should undergo appropriate revisions in content, form of language, copyediting, and presentation)
    • No information on any revisions of any of the materials.
    • Issues cited under "Review" with image PDFs and with text PDFs that have encoding problems also hinder revision work.
  • Re-use or re-purposing (text materials can be re-used or sections re-purposed)
    • No information on re-use or re-purposing of any of the materials.

The two sites profiled above and the various health and medical education materials presented on them represent an important resource for fifteen African languages (and some varieties of two of those).

One additional question is whether such materials, intended primarily to serve needs of immigrants in Australia and the US, might be useful as is or with modifications, for speakers of the same languages in relevant African countries. Or in the reverse sense, whether any health extension materials from Africa might inform revision of these materials and development of new ones. A next step could be a for a site to begin to collect health materials in African languages from all sources.

There are many directions in which this could be taken, with the goals of improving availability, quality and utility of health education information in a range of African languages. One, for example, is linking with the longstanding WikiProject Med's Translation Task Force for development of articles in those African languages that have Wikipedias (such as Afrikaans, Akan, Amharic, Arabic, Kirundi, Lingala, Oromo, Shona, Somali, Swahili, and Tigrinya). Another might be connecting with efforts to advance development of standard terminologies. Still another might be to bring in human language technology, such as text to speech, so that materials designed and disseminated in text form could be accessible in audio via mobile devices.

Thanks to Charles Riley of Yale University for calling our attention to these two websites.

Tuesday, January 31, 2017

Conferences: ACAL 2017 & two BAAL events

Here is some information on three upcoming conferences that deal in one way or another with African language topics. The first, the 48th Annual Conference of African Linguistics (ACAL), will be held at the University of Indiana in Bloomington, Indiana, US on March 30 - April 2, 2017 (CFP already passed). The other two are organized by groups connected with the British Association of Applied Linguistics (BAAL): The BAAL - Cambridge University Press Seminar Series (BAAL-CUP), will be held at Aston University in Birmingham, UK on 27-28 April 2017 (CFP deadline Feb. 20); and the BAAL - Language in Africa Special Interest Group (BAAL-LIASIG) Annual Conference, will be held at the University of Reading in Reading, UK on May 12, 2017 (CFP deadline March 31).


The ACAL, run since 2013 by an organization with the same acronym, the Association of Contemporary African Linguistics, is a major annual academic meeting on African languages and linguistics. I personally had the chance to present at ACAL 35 in 2004, and to attend ACAL 38 ten years ago.

This year's conference is hosted by the Department of Linguistics at Indiana University. "The conference will focus on all aspects of African linguistics, from linguistic description to theoretical analysis and sociolinguistics."

Links to registration and the program page (though the latter was blank at the time of this writing).


The BAAL-CUP seminar has as its theme "Minority Languages in New Media: Towards language revitalisation in Europe and Africa." The term "minority language" is a little tricky, but in the African context, I read it as applying to most if not all African languages.

Per the seminar information, it "is intended to identify and discuss emerging trends in the study of minority languages in new media and technology. This includes the ways in which minority languages are supported through their presence in new media, and how minority language users are making use of their languages in digital landscapes traditionally dominated by global languages such as English."

"New Media  refers to digital communication platforms such as online news  sites, blogs, wikis, Facebook, Twitter, Snapchat, Instagram, and other Social Media."

Links to the CFP and registration.


The LIA SIG conference this year has the theme "Language without Borders: Multilingual Communication in Africa and the Diaspora."

Per the CFP, topics addressing the theme may include for example: oral communication; multilingualism; language choice; codeswitching; translanguaging in education; and translation.

Links to the CFP and registration.

Saturday, December 31, 2016

Revisiting an African language content strategy

While in Mali in late 1999 and early 2000, inspired in part by early research by FUNREDES on Languages and Cultures of the Internet, I began thinking about strategies for increasing African language web content. While recognizing of course that such content would come primarily from communities of speakers of African languages, as well as internationally funded projects which at the time were beginning to think about how to use the internet for development, the motivation was to facilitate creation of an environment favorable to its creation and use.

ISOC's report, 8/2016
Recent discussion of one project and reading about another, each of which deal in different ways with content and communication, and reading the Internet Society's (ISOC) August 2016 report on "Promoting Content in Africa," prompt me to revisit this early effort and look at how things are playing out.

Looking forward from 1999

The basic idea in 1999 was to disaggregate approaches to internet content development in African languages, and consider how each could optimally contribute to the overall goal of greater presence of those languages in cyberspace. In 2003 I reworked that schema to share more widely - for example on the short-lived Africa Web Content Owner email list.1 Elements of this strategy were incorporated in different ways in later work, such as African Languages in a Digital Age (ALDA).

The main approaches were:
  1. Composition of text-based content (including where possible, digitization of works previously published in African languages)
  2. Translation of text-based content from other languages (leveraging the then emerging machine translation [MT] technology)
  3. Development of content in non-text formats (with specific reference to audio2)
It was recognized that production of text-based content from scratch, by composing material (#1 above), is an incremental and artisanal process. In other words, it takes time and effort to achieve modest results.This is especially the case for languages with younger written traditions that are not well supported in education or even sometimes fully supported in software (fonts were then a problem for certain writing systems, for example, and keyboards for them still are). No matter how fundamental text content was, it would be hard to keep pace with content creation in many non-African languages, leaving speakers of those languages with few opportunities to see their language online.

Therefore the possibility of taking existing texts in African languages - from published books or other printed materials - and putting them on the web was suggested as a way to give a small but significant boost to efforts to generate African language internet content. Such texts often have historical or cultural value, and may already be in a standard orthography (or transcriptions that could be easily converted into them.3 A sustained effort to "weblish" these materials, according to this thinking, could quickly add quality material to what is available on the web in a number of languages, and more importantly, make many materials that are accessible only in university libraries more readily available to speakers of those languages. However, copyright protections limit the potential of this tactic (although there have been some sites that appear to have made some of these materials available online without permission).

Therefore, an emphasis was put on alternative ways to create new content, especially translation (#2 above) of various relevant, useful, and interesting material aided by MT, and content built around the spoken voice (#3), responding to oral dimensions of African cultures, as well as the low literacy rates in African languages.

MT in that era especially was mainly a hope for the future, and aside from a few experiments most advances in the 2000s were for pairs of major (mainly Europhone) languages. Nowadays the technology has improved, but the statistical methods that have been key in that evolution require language resources that do not exist for many languages (at least yet). As such, the contribution of MT to content development in African languages is still in the future. 

As far as audio content, this did not emerge as a significant component on the internet (unless one counts songs or the sending of audio files as email attachment, both of which were transferred over the internet rather than presented as part of web content). But see the discussion regarding video sharing below.

Just how much African language content?

Through the 2000s, African language content seemed to grow only marginally and unevenly. A pair of studies published by Rifal in 2003 provided perspectives on this subject that are still useful:
I'm not aware of any more recent studies along these lines, but it may still be the case that there is a significant number of sites with at least some African language content, but that these are still mostly descriptions.

New kinds of content

The rise of social media, video sharing, and mobile devices over the last decade or so has changed how we think of and produce web content, opening new possibilities for African languages in cyberspace.

Social media, including blogs and wikis, makes the creation of content in any language easier. But in the case of many African languages, also brings us face-to-face with other limitations in input systems (for extended Latin and non-Latin scripts), education (where schools use only Europhone languages so that people aren't familiar with writing their first languages), and incentive (where the audience for text-based content in less widely spoken African languages is perceived to be small).

Video in a way fulfills the old idea of audio content on the web, but with the obvious advantage of visual (though there are at least a few YouTube videos with static presentation - a picture or line of text - and full audio in one or another African language). What's missing as far as I can tell is a way to find videos in specific languages that does not rely on the producer having tagged it appropriately (which may not happen).

Mobile devices have changed how we access and interact with content, and consequently how content is designed and even conceived. They also have become the most common way for Africans in general to access the internet - proportionately more important I believe than in any other continent. What I don't have a sense of is how much content in African languages is developed with mobile devices in mind. On the other hand the input limitations for some writing systems would certainly be an issue for use of some African languages in messaging for example.

"Promoting Content in Africa," 2016

ISOC's recent report includes a look at structures to support development of content in Africa, including in African languages. It is encouraging to note the attention ISOC is giving in this report to the importance of African language content for internet use in Africa.

One of the recommendations ISOC has for promoting local content in Africa, including that in African languages, is to promote development of local infrastructure, including data centers, Content Delivery Networks, and Internet Exchange Points. This idea to in effect create a facilitating environment for creation of African language content is an interesting strategy, and would complement other efforts such as mentioned above.

1. The direct link to the post in the AWCO archives is apparently accessible only to subscribed group members. I've created an alternative presentation of it on my website. That post has more background.
2. An early consideration of audio and web content on AWCO mentions Native American interest in the topic, as well as a project in Mauritania (also available on my website).
3. Numerous transcriptions of histories and tales from before the adoption of current orthographies used systematic notation that generally corresponds directly to characters used today (1:1 or occasionally 2:1). I have encountered this for example in various older materials on Fula and Bambara. Cheick Anta Diop's famous Wolof translations of scientific and European cultural texts into Wolof (1955) similarly used a regular transcription that predated the current standard orthography in Senegal.