LCC's FCS Endpoint for NoSketchEngine

The Leipzig Corpora Collection uses the NoSketch Engine with a custom FCS Endpoint implementation to connect corpora in various languages to the CLARIN Federated Content Search (FCS). It supports both basic full-text searches as well as the advanced FCS-QL to build complex queries. Our implementation supports multi-lingual resource metadata and search result links.

Custom NoSketch Engine

The NoSketchEngine has been modified to better support FCS endpoints by providing more corpus information for the FCS. Custom configuration options have been added:

  • HANDLE to support handles/PIDs/DOIs/…,
  • FCSREFS for custom result information (e.g., allowing results to link back to sentences in the NoSketch Engine, support differing corpus structures),
  • FCSINFOS to configurate the FCS endpoint with multilingual resource metadata (title, description, institution, and landingpage) that might differ to the names and descriptions used in the NoSketch Engine.

All changes have been documented (see list of changes) to enable easy comparison with the official sources and upstream git repository, and to quickly identify changes. There are various other changes (primarily UI) besides those for the FCS endpoint.

FCS Endpoint

The FCS endpoint implementation supports some default layer (e.g., word (required), lemma, pos (with pos_ud17) and lc (required) / lemma_lc). Adding support for additional layers is possible, however using completely different word/sentence structures for corpora in the NoSketch Engine may require deeper changes.

The endpoint has some basic update logic to detect changes to PIDs and update the FCS Endpoint Description. There are various other configuration options to finetune what corpora of the NoSketch Engine are exposed via the FCS endpoint (e.g., whether to restrict to corpora with `HANDLE` configuration).

The source code has documentation and example configuration files to quickly bootstrap your own setup.

Resources

1 Like

Thanks for sharing - this is very useful information. We’re planning to set up a NoSketchEngine instance here in Oxford, and to promote it as a solution to other CLARIN-UK partners, and FCS integration gives lots of added value.

If there are questions, please don’t hesitate to reach out. :slight_smile:

Our setup might not work for everyone as the structure of corpora in the (No)SketchEngine might be different, there is a lot of freedom. But this would also be interesting for us, to maybe improve or update our endpoint implementation to take other use-cases into account.

We also did modify our NoSketchEngine deployment to provide a bit more metadata about the resources (we wanted to keep everything in one place, data and metadata), so the NoSketchEngine provides the list of resources and a description of these, and the endpoint more or less only forwards this information but has no static list of “known” corpora. However, the FCS endpoint can also be set up to provide it’s own list of resources and their description without doing this dynamically with the NoSketchEngine. (That is not yet implemented as our requirements were different but is not too difficult to change.)