Interlingua (ISO 639 language codes ia, ina) is a naturalistic planned Italic international auxiliary language (IAL), developed between 1937 and 1951 by the International Auxiliary Language Association (IALA). Its vocabulary and grammar are derived from a wide range of western European natural languages. Interlingua was developed to combine a simple, mostly regular grammar with a vocabulary common to English, French, Italian, Spanish and Portuguese. These characteristics make it especially easy to learn for those whose native languages were sources of Interlingua's vocabulary and grammar. Interlingua can also be used as a rapid introduction to many natural languages. Written Interlingua is largely comprehensible to the hundreds of millions of people who speak Romance languages.
Despite its increasing popularity, Interlingua is not supported by online translation tools such as Google Translate. The purpose of the Interlingua-English Translator is to bridge this gap, by creating a computer program that can translate text between English and Interlingua. The hopes is that this translation tool will act as a valuable educational resource to Interlingua learners, and that it might spread awareness of Interlingua to a broader audience.
Yes, it is! You can find a link to the data (the Interlingua Corpus Project), as well as source code, in the "more resources" tab.
Written by Jason Ding on August 14th, 2021
I perform all work on this research project. I am directly supervised by Dr. Todd Mockler, a Principal Investigator at the Danforth Science Center.
The project was started in May of 2020 when I was an incoming junior in high school. The final goal of the project was to create an Interlingua-English Translator.
I came into the project with no knowledge of either Interlingua or Neural Networks. However, I devised and followed a plan to make the translator come to life.
The project has three main phases.
First, I needed to construct a large collection of Interlingua sentences by creating a web crawler program to automatically extract data from the Internet. In order to do this, I learned how to created a web crawler (i.e., a computer program that automatically searches through the Internet) that extracts any sentences written in Interlingua in both the HTML and the website's downloadable documents. I taught myself to use various Python modules, such as BeautifulSoup, requests, and os, while also learning and inventing techniques to do tasks such as accurately separating sentences from paragraphs. The final version of my web crawler program visited 6,373,297 websites and collected over 1,200,000 unique Interlingua sentences.
Second, I needed to collect as many matched pairs of Interlingua-English sentences as I could. To do this, I created a parallel sentence extractor program. The program would take as input a pair of texts that are near translations of each other and output the individual pairs of parallel sentences between the two texts. For example, I have used my program to extract parallel sentences from the Bible and the Book of Mormon. The key challenge to the program was identifying and rectifying edge cases that cause false positives and negatives, such as when one of the parallel texts skips a certain sentence and the other doesn’t. Thus far, I have used my program to extract over 80,000 parallel English-Interlingua sentences. (As a note, the link to the Interlingua Corpus Project which contains the data collected from these first two steps for free can be found below in the "More Resources" section)
Third, I used the data gathered from the first two steps to train a neural machine translation (NMT) system, or more specifically, a recurrent neural network (RNN) translator, that can translate between English and Interlingua. I self-taught myself how to use PyTorch and CUDA, and I learned to use git and terminal shell commands, Jupyter, HTML, Google Colab, deploy Python at Heroku, utilize Google and Dropbox API, and connect to remote servers all from scratch.
After around 16 months of work in August of 2021, I was able to release both the first version of the translator and the Interlingua corpus.
The data collected in the corpus and the translator projects can be used for free by anyone. The data from the projects can also be used for anything. However, it is requested that credits or links to the project pages are given.