Challenges & Methods for Multilingual Sentiment Analysis in 2024

Text encoding differences example

Sentiment analysis has become an invaluable tool for businesses seeking to understand customer opinions and gauge reaction to their products and services. With globalization and the rapid growth of international markets, multilingual sentiment analysis capabilities are more important than ever. However, analyzing sentiment across different languages comes with some unique challenges. In this post, we’ll explore these challenges and the main methods used today to enable multilingual sentiment analysis.

Why Sentiment Analysis Across Languages is Difficult

While sentiment analysis in English is now quite robust, expanding to other languages brings new obstacles:

Pre-Processing Differences

The first step in any natural language processing task is text pre-processing – cleaning and preparing the data for analysis. This becomes more complex with different languages that may use non-standard encodings. For instance, Cyrillic languages often use KOI8 encoding, while most NLP tools expect Unicode formatted text. Without properly formatted input data, the analysis process grinds to a halt.

Text encoding differences example

Pre-processing must be tailored to handle distinct encodings, hyphenations, contractions, abbreviations and more that vary across languages. As an example, German contractions like "zur" (zu der) must be expanded to properly tokenize words. ThePrepare package (https://github.com/Aulia122/ThePrepare) highlights some of these key differences across languages that impact pre-processing. Ignoring them will undermine later analysis.

Lack of Resources

NLP packages rely on built-in lexicons and word lists to understand sentiment. This includes "stop words" that filter out non-meaningful text, as well as classifiers that label words as positive, negative or neutral. However, these resources are still predominantly focused on English. Supporting other languages requires manual effort to expand these word lists.

For instance, the popular NLTK package for Python currently supports stop word removal for just 11 languages (https://www.nltk.org/nltk_data/). For multilingual analysis, these limited word lists lead to noisy results. Without filtering stop words in the target language, key nuances can be lost.

Structural Differences

Languages have different grammatical structures, conventions and nuances that must be considered. Steps like tokenization, separating text into logical units, needs to be tailored for each language. Translating processes directly from English will fail to properly handle differences like word order, suffixes, and more.

For example, focusing only on individual words ignores important phrases and multi-word expressions in languages like French ("au contraire") or German ("zum Beispiel") that convey sentiment. Rules-based tokenization alone cannot handle these structural differences – smarter statistical or neural approaches are needed.

These challenges mean multilingual sentiment analysis requires more customization and special handling than monolingual English analysis. So how are these challenges currently being addressed?

Main Approaches for Multilingual Sentiment Analysis

Broadly speaking, there are two main approaches in use today:

1. Native Language Support

Some NLP packages like spaCy now include native support for 60+ languages using statistical and neural networks models like BERT. This allows directly analyzing the original text without translation. The upside is preserving original meaning and nuances. The downside is it requires much more extensive training data and resources for each language.

For instance, Google‘s BERT model was trained on Wikipedia data across 104 languages, allowing it to learn subtle features of each language directly. In comparison, a translation approach must attempt to map these back to English through another fallible model.

Recent research shows sentiment analysis performs better in original languages than English translation. For example, a 2020 study from the University of Wolverhampton (https://eprints.wlv.ac.uk/id/eprint/17415/) showed F1 scores for German sentiment analysis improving from 0.57 using English translation vs. 0.76 analyzing German text natively using BERT.

2. Translation to English

The other approach is translating input text to English first before applying sentiment analysis. This leverages existing English resources. Translation quality is constantly improving but can still lose some meaning, especially for rare languages.

Google Cloud, Microsoft, and AWS all offer cloud-based machine translation services that can enable this approach:

So which is better? Recent research indicates native language support provides more accurate results when available. However, English translation is a pragmatic option for expanding to new languages quickly. The ideal solution likely combines both approaches.

Hybrid Approach Case Study

UK-based CompareAsiaGroup provides financial services comparison platforms across 7 markets in Asia. They leverage both translation and native language analysis to gather consumer sentiment insights from reviews, surveys, and more (https://towardsdatascience.com/heres-how-compareasias-data-science-team-uses-ai-and-nlp-to-drive-business-solutions-8de89c1af0b).

Their hybrid approach allows CompareAsiaGroup to balance speed to market while still providing tailored analysis per language. As their data scientist Nikhil Krishnan explains:

"Sentiment analysis results are slightly better in native languages rather than translating to English first. But translation gives us a workable baseline for new languages quickly. Combining both gives us flexibility"

This showcases the pragmatic benefits of combining both translation and native language support, allowing broader coverage without sacrificing nuance.

Key Trends and Future Outlook

  • Wider adoption of multilingual models like mBERT show promise for native language support. According to Statista, investments in these types of AI models are projected to reach $20B by 2024.

Multilingual AI model revenue forecast

  • Continued growth of international markets will drive urgency for multilingual capabilities. Non-English internet users are expected to double in the next 6 years (https://www.internetworldstats.com/stats7.htm).

  • Major cloud providers are expanding machine translation services that can enable translation approaches. Microsoft now supports over 90 languages in its Azure translator service.

  • More open source multilingual datasets and benchmarks support model training. Multilingual BERT itself was trained across 104 languages using Wikipedia data.

  • We expect a hybrid approach combining native support and translation to dominate. Certain languages will warrant custom models trained on sufficient in-language data to preserve nuance. But the flexibility of translation will fill gaps where data is sparse.

Recommendations for Practitioners

  • Assess your target languages – What are your top priorities? Focus native language support on those first. Seek languages with both sufficient user base and training data.

  • Leverage cloud services – Cloud translation services from AWS, Microsoft and Google provide a quick path to enable translation approaches.

  • Evaluate tradeoffs – Understand if a translation approach is "good enough" right now vs. inaccuracies that will be problematic. Strike the right balance for each language.

  • Contribute data – Help grow multilingual capabilities by open sourcing annotated datasets in new languages. Target languages with potential for high ROI.

  • Consider hybrid approaches – Combine translation flexibility with selective use of custom native language models where they deliver the most value.

Multilingual sentiment analysis brings new intricacies but is a necessary capability for global organizations. With a smart roadmap leveraging hybrid approaches, robust multilingual analysis is within reach. Careful planning and evaluation of tradeoffs will pave the way for more nuanced understanding of global customer sentiment.