Prof. Dr. Hannah Bast, Albert-Ludwigs-Universität Freiburg
Within the DFG priority programme "Algorithm Engineering" (2007 - 2013), we have developed semantic full-text search, a deep integration of full-text and ontology search. This semantic search can cope with queries of the kind "german researchers who work on algorithms", which also work when part of the required information is contained in an ontology (e.g. profession and nationality), and part only in text documents (e.g. research interests).
So far, we achieve query times on the order of 100 milliseconds for an integration of the English Wikipedia (20 GB of text) with the YAGO ontology (20 million facts). Within the new priority programme "Algorithms for Big Data", we want to bring this semantic search to the level of big data, with up to 100 times larger data sets, that is, text collections of size up to 2 TB, and ontologies with up to 2 billion facts. The four core aspects of semantic search are: ontologies, entity recognition, natural language processing, indexing. To cope with big data, a proportional increase in computing power combined with basic parallelization is not sufficient for neither of these four aspects. Instead, all four aspects require new approaches and new algorithmic ideas. This will be explained in detail in the proposal. The goals of our projects are threefold: new approaches and algorithms, extensive experimental evaluation, high-quality software. In particular, we will, as already in the last priority programme, provide a fully functional prototype for our search that will prove the feasibility and practicability of our new approaches. We will also take care of the issue of reproducibility. This is particularly challenging in the big data scenario: the large amounts of data cannot easily be transferred, and reproducing the necessary precomputations requires a huge (and usual unacceptable) effort on the part of a third party.