Researchers from DeepMind and Oxford University have provided PYTHIA — epigraphic algorithm that recovers all the possible variants of the lost inscriptions on the monuments of ancient Greek language. Algorithm-based encoder and decoder with long short-term memory analyzes the remaining text and supplements labels, in context, using familiar vocabulary. Presents a system to predict the lost fragments of on average makes fewer mistakes than professionals in the field of Greek epigraphy. A Preprint describing the algorithm published in arXiv.org, briefly about it it is reported in the blog DeepMind.
The deciphering of inscriptions on a hard (e.g. stone or marble) linguistic monuments dedicated to a particular scientific discipline, epigraphy. Due to the fact that most of the monuments remain not fully skilled in this area need to recover the lost fragments of the text. In fact, if you have lost a few characters, knowing the original language and historical context (often well dated monuments, and many ancient languages studied in sufficient detail) to decipher the text is not very difficult. The complexity of the problem increases, when the gaps a lot here to resolve the ambiguity has to use the context saved on the monument fragments.
A new algorithm developed by researchers under the leadership of Yannis Assela (Yannis Assael) of DeepMind, is well suited for those cases when restoration of lost fragments of text can take a long time just because of the ambiguity of the written and the many options. For learning algorithm, they used a corpus of ancient Greek literature PHI: it took the texts, Dating from the seventh century BC to fifth century ad.
Based on PHI, the scientists collected new case PHI-ML. For him, the researchers made a frequency dictionary of all occurring symbols on the basis of which determined the basic “alphabet”: it included 147 characters, including all letters of the alphabet, punctuation marks and other service marks (for example, the designation of the length of the vowel) and — optional — a dash to indicate the omission places a question mark to denote those marks, which must then be predicted by the model. From the case also removed the linguistic markings made by the drafters of the body. Just in case PHI-ML included 3.2 million words.
The algorithm PYTHIA (named after the ancient Greek priestess Pythia, who, according to legend, possessed the gift of divination) includes an encoder and a decoder, each of which is based on neural networks with long short-term memory (LSTM). The algorithm receives the input text, where placeholders are replaced by dashes, and those that need to predict — on the question mark. Initially, the required marks are predicted taking into account the table of their vector representations — roughly speaking, the empty seats in the most probable words are inserted at the frequency of the letters. Additionally, to improve the quality of work to the system was connected to a dictionary of 100 thousand most frequent lemmas in the corpus: for the final prediction algorithm focuses in on him.