Machine Translation & Text Analytics: Friends or Foes?
Government agencies face similar challenges when it comes to understanding—and gaining intelligence from— foreign language content. They need to process, manage and gain insight from large volumes of content locked away in different formats, often across multiple languages. And they need to do all of this as quickly as possible. It’s no mean feat when you consider the mindboggling amounts of content being generated: 90% of the world’s content was created over the past two years alone.
Machine translation and text analytics have always been regarded as the two main ways for organizations and agencies to tackle this challenge. While machine translation allows you to take foreign language content and convert it into a language you can read and understand, text analytics allows you to understand the context of content while not having to read it at all.
Same problem, different approaches
Over the past decade, the use of text analytics and machine translation have always gone hand-in-hand when it comes to data analytics, whether it’s document exploitation, open source intelligence or eDiscovery. Strangely enough, most companies have focused their efforts and investments on one or the other. In rare cases when software vendors offered both disciplines as a combined natural language processing application, the major research and development investment was typically weighted one way or the other, with one application being regarded as a supporting technology for the other one that was more prominently funded and promoted.
Back then the balance, at least in terms of language support, was clearly in favor of machine translation. This growing technology, which is capable of instantly translating large volumes of content, supported more languages than any text analytics platform on the market. Most vendors offered around 40 languages, while the leading text analytics vendor supported just 10.
There’s a valid reason why…
While translations are virtually everywhere and parallel corpuses required to build machine translation engines could be found for free, harvested from multilingual websites or simply generated by the world’s translator community, text analytics required a more specialized, and more expensive to produce, corpus in order to use machine learning and generate production-ready engines.
Building some of the standard text analytics engines such as language detection, named entity extraction, sentiment analysis or content classification, requires an extensive effort to create the manually annotated documentation needed for the machine learning process.
And actually, the annotation is not a task similar to translation, which is learned in universities and schools and is a natural capability of any bilingual, but requires specialized, trained resources.
For classification, resources would have to plow through tens of thousands of paragraphs and specify the category of each, based on a taxonomy. For named entity extraction, names of people, locations, organization names, titles, currencies and so on would have to be identified in large corpuses of texts, extracted and categorized, to make sure the text analytics statistical engines would work properly.
So a lot of work, for each engine, for each language. The fact that text analytics was lagging behind machine translation some years ago was really no surprise.
What was the impact?
When required to process and triage large amounts of unstructured data that came in many languages, the process was relying heavily on machine translation. The source data was automatically translated into a target language, typically a language that was supported by a text analytics engine or the language of the end users and then structured using a monolingual text analytics engines and finally triaged. The caveat was that huge amounts of data, basically all the data, had to be machine translated. Out of that data, only a small portion would make it through the triage process and become useful, generating actionable insight and even a smaller portion would be actually read by a human analyst, reviewed or corrected.
Now, with the adoption of neural networks in the past couple of years, both machine translation and text analytics require less training corpus while producing better results. In the case of machine translation, the use of neural or deep learning algorithms mostly results in less effort while significantly increasing the translation quality for a large set of languages that was already mostly supported by the statistical engines. For text analytics, it meant the possibility to develop analysis engines for more languages, at a much lower cost. Thus, data intelligence projects rely less on machine translation and more and more on text analytics for triage and information extraction.
Aren’t machine translation and text analytics competing even more?
Well, not really. Even if one natural language processing application might be used more than the other for a data analytics project, machine translation and text analytics are really different sides of the same coin. Let’s look at an ideal scenario for text analytics, where all the content that needs to be processed and analyzed comes in languages that are supported by the text analytics platform. At a first glance, it would look like machine translation wouldn’t be needed anymore because all the triage, all the classification and the information extraction can be done relying solely on text analytics. Even if that would be the case, a classified data set with entities extracted in all the languages wouldn’t be really useful for a human analyst or a decision-making support system if the information wouldn’t be presented in the end-users target language. Imagine you are an English analyst working with Arabic data. Using a text analytics engine that handles most of the natural language processing tasks directly in Arabic, reading the content that has been identified by the text analytics engines as important, still wouldn’t be possible without translating it into English.
So even if it has a minimal usage in this scenario, machine translation is still indispensable. And with most text analytics vendors currently supporting typically 10 languages or at most 20 languages, the previous scenario is quite rare. In real-world use cases, machine translation and text analytics complement one another and have an equal share in the data processing responsibilities.
The complementary nature of machine translation and text analytics is not only observed operationally in real-life deployed systems, but it stems from the fact that both are natural language processing applications that tackle the issue of automated content understanding by machines with the language factor in the middle. While machine translation normalizes content from one language to another, text analytics extracts information from the content, in a specific language. Each natural language processing application is limited in use, but provides an end-to-end approach to automated content understanding when combined.
Machine translation and text analytics have competed in the past for resources and investments. This competitive approach led to the belief that these areas were at odds with one another and not fully complementary. However, one without the other will be challenged to deliver the full potential benefits of the two when combined and treated as equal parts of the overall solution. At some point in the future, machine translation and text analytics will combine into one solution as both together are critical to delivering the full and complete understanding required by the use cases these technologies serve. And until then, organizations would be wise to approach these two areas as complementary and necessary to achieve the optimal and desired end results.
George Bara is the director of strategic accounts, Government Solutions, SDL.