Returning to our presentation at the Annual Conference 2025: “Towards FAIR Metadata for Specialised Corpora: A Community-Informed Empirical Study of Schema Development in Two Communities” (slides), I wanted to link relevant Topics (from this forum):
but also emphasise that the usability of the annotated data is unclear - community-specific metadata, with its domain-specific complexity:
should it be integrated into existing infrastructures like the VLO?
(how) can it be integrated into/connected to the Resource Families?
how can it be made accessible (displayed/searched/browsed/…) to end users without reducing it to a common denominator?
and since this will (potentially) become relevant for a few more communities (~= K-centres, and likely C-Centres that would host the metadata - or B-centres) the question still remains:
How can CLARIN (ERIC / B,C,K-centres) support each other and the community efforts [for community-specific metadata]?
Dear Egon, to my knowledge there is indeed “only” CMDI as commonly agreed metastandard, but profiles (essentially XML schemas) vary. In Finland we use Profiles derived from the META-SHARE Schema in COMEDI. COMEDI currently exports them as clarin.eu:cr1:p_1361876010571. So for CLARIN you should use CMDI. But the world is bigger than CLARIN. I understood your talk as touching on the questions: If we have metadata in different granularity, how to we make sure that we are dealing with variants of the same thing? Example: (PID-1 is here the placeholder for a Handle used in CLARIN)
Very detailed metadata of dataset with “PID-1”: Contains names of subjects, etc. Sensitive.
Pseudomymized version of dataset with “PID-1” above. Less sensitive.
CMDI of dataset with PID-1: Describes the dataset, where and when and how it was created. Public, shown in VLO.
EOSC compatible HTML metadata of PID-1 in VLO landing page. Subset of CMDI. Increases FAIR score of fair tools
Subset of dataset with PID-1 exported to national service, like etsin.fairdata.fi. (The Language Bank data is available there for search)
Reference to dataset with PID in article. Has Author, Year, Name, repository, PID-1 (very small subset of the metadata)
My suggestion would now the following: PID-1 points to the CMDI descriptive metadata at the repository’s Metadata service (COMEDI in our case). This is the “master metadata”, subsets and supersets must be in sync with the data provided there. So if the superset of very detailed metadata mentions the Name of the dataset and it is not identical to CMDI, CMDI is the authoritative source.
Supersets should therefore not copy too much of the authoritative metadata, since it can be always found behind the PID.
The same holds true for subsets, like reference instructions. All sub and super sets need to contain the dataset PID (“PID-1”) as clear link between them. The CMDI metadata points to the data (via resource proxy). Also these pointers are authoritative.
If we can agree on this principle we can think of how to implement it. Descriptive Metadata does not change extremely often, but it does change, like due to incorrect creation which is detected later, etc. So mechanisms should be in place to deal with such changes.