Theme

Mandarin is replacing Cantonese. Offbeat AI fights back as Big Tech looks away

· English· 南华早报

Unlike Mandarin, English or Spanish, which have vast troves of digital text for AI training, Cantonese lacks written text that accurately reflects how it is spoken.

Image: Shutterstock Preserving Cantonese has been challenging due to the dominance of Mandarin, limited learning resources and a lack of a standard written form.

With a declining number of young learners, the language faces an uncertain future.

Artificial intelligence (AI) – seen by some as an existential threat to humanity – may become the hope for saving the language, and many others, along with the distinct cultures they embody.

This is the mission of Hong Kong-based deep-tech company Votee AI: to use large language models (LLMs) to preserve languages, especially those overlooked by tech giants.

Leo Ma, Votee’s chief scientist for the Asia-Pacific, said in an interview that AI could create living records of many languages beyond English – the primary language for most LLMs today. “While mainstream AI models excel in English, they remain ‘functionally illiterate’ for 99 per cent of the world’s languages,” Ma said, adding that this limited AI access for the billions of non-English speakers.

Unlike Mandarin, English or Spanish, which have vast troves of digital text for AI training, Cantonese lacks written text that accurately reflects how it is spoken. “Cantonese is a ‘low-resource language’ because there is not much written text that we can use for training,” Ma said.

Votee AI, which says it is the first company to open source a Cantonese LLM, went through the exacting challenges of recording the language spoken by 85 million people worldwide. “Neglected languages should not be gated.

Open source is a critical path to ensuring linguistic diversity survives in the age of AI,” Ma said. “Neglected languages are constantly disappearing. “For us in the technology industry, we look to the future – but at the same time, we hope to preserve our past, which is the foundation of the present.” To collect training data, the team built a digital dictionary by gathering data from the internet, such as on social media and forums, and partners such as universities, then annotating the cultural context of words. “Words can have many meanings in different contexts.

Every word, corresponding to a specific context, will be represented by its own independent item in the data set we create,” Ma said.

For example, the word for “water” is often used to refer to money, in addition to its literal meaning as

原文链接: 南华早报