The Leipzig Corpora Collection uses the NoSketch Engine with a custom FCS Endpoint implementation to connect corpora in various languages to the CLARIN Federated Content Search (FCS). It supports both basic full-text searches as well as the advanced FCS-QL to build complex queries. Our implementation supports multi-lingual resource metadata and search result links.
Custom NoSketch Engine
The NoSketchEngine has been modified to better support FCS endpoints by providing more corpus information for the FCS. Custom configuration options have been added:
HANDLEto support handles/PIDs/DOIs/…,FCSREFSfor custom result information (e.g., allowing results to link back to sentences in the NoSketch Engine, support differing corpus structures),FCSINFOSto configurate the FCS endpoint with multilingual resource metadata (title, description, institution, and landingpage) that might differ to the names and descriptions used in the NoSketch Engine.
All changes have been documented (see list of changes) to enable easy comparison with the official sources and upstream git repository, and to quickly identify changes. There are various other changes (primarily UI) besides those for the FCS endpoint.
FCS Endpoint
The FCS endpoint implementation supports some default layer (e.g., word (required), lemma, pos (with pos_ud17) and lc (required) / lemma_lc). Adding support for additional layers is possible, however using completely different word/sentence structures for corpora in the NoSketch Engine may require deeper changes.
The endpoint has some basic update logic to detect changes to PIDs and update the FCS Endpoint Description. There are various other configuration options to finetune what corpora of the NoSketch Engine are exposed via the FCS endpoint (e.g., whether to restrict to corpora with `HANDLE` configuration).
The source code has documentation and example configuration files to quickly bootstrap your own setup.
Resources
- NoSketch Engine - dockerized deployment, with custom configuration options
- FCS Endpoint implementation - based on CLARIN’s Java libraries
- FCS Endpoint (live)