HTML

View Péter Király's profile on LinkedIn

20 éves a MEK, 10 éves az EPA Én is
MEK önkéntes
vagyok

20 éves a MEK
10 éves az EPA

kirunews

Király Péter, keresés, Lucene, Solr, Java, Perl, PHP, OAI-PMH, webfejlesztés, digitális könyvtár, MARC, FRBR, RDA, Drupal, EAD, EAC, Europeana, eXtensibleCatalog.org, MEK, és sok minden más.

Friss topikok

Linkblog

Solr query facets in Europeana

2014.02.13. 00:56 kirunews

In Europeana we use Apache Solr for searching. Our data model is called EDM (Europeana Data Model), in which a real record* has two main parts: the metadata object, containing information about an objects stored in one of the 2400 cultural heritage institutions all over Europe, and the contextual entities, which stores information about the agents, places, concepts and timespans occured in the particular metadata object. This model has almost 200 fields, and in Solr we index all of them. We also have some special fields for facets, and we have some aggregated fields, which aggregates other fields, such as "who" field contains the metadata object's dc:creator, and dc:contributor, and the agent object's skos:prefLabel, skos:altLabel, and foaf:name fields, in order to provide the user a singe field for searching for personal names. For more information please consult our EDM and the Europeana API documentations.

One of Europeana's important aim is to make the rights statements of records clear and straightforward. You can imagine, that the 2400 partners have different approaches for licencing their objects, and right now in the database we have 60+ different licence types, in other words the RIGHTS facet has 60+ individual values. Some of them are language or version variations of the same CC licence. It turned out, that most of the users don't want to select from that range of options. And the thing is, that we can categorize these rights statements under 3 main categories:

  • freely resuable with attribution (CC0, CC BY, CC BY SA)
  • resable with some restrictions (CC BY NC, CC BY NC SA, CC BY NC ND, CC BY ND, OOC NC)
  • reusable only with permissions (licences of the Europeana Rights Framework)

What wanted to achive is to form a new facet from these options, but the most straighforward solution, i. e. to create a new field in Solr were not an easily implementable option, because it would require a full reindexing (it would be another blog entry which explains why that was not possible), so we have to search for another solution. To count the numbers belongs to the individual rights statements in the RIGHTS facet would work, but that's only good for displaying, and it doesn't cover the problem of user interaction. To use the RIGHTS field for search turn out to be risky, because it interferes with the RIGHTS facet, so that did not worked either. Finally we come up with a fake facet, which has two sides: one on the display side, and one on the search side.

reusability.png

Facets including the new reusability (”Can I use it?”) facet in Europeana.eu

To count the numbers we use a special Solr facet type: query facet. It is a simple, and at the same time a powerful solution. It doesn't gives you a list of existing field values with a number (which tells you how many records has those term given the main queries) as a normal facet. In the query facet the input is a query, and the returning value is a number, which tells you how many records fit the combination of the main query, and the query specified in the facet's parameter. Since we don't need to know the list of items in the categories, that's enough for us. We defined three queries:

  • RIGHTS:("CC0" OR "CC BY" OR "CC BY SA")
  • RIGHTS:("CC BY NC" OR "CC BY NC SA" OR "CC BY NC ND" OR "CC BY ND" OR "OOC NC")
  • RIGHTS:(NOT(
          "CC0" OR "CC BY" OR "CC BY SA"
    OR "CC BY NC" OR "CC BY NC SA" OR "CC BY NC ND" OR "CC BY ND" OR "OOC NC"))

In reallity we use URLs, and not string literals in the database, but the logic is the same. At the end of the blog entry I'll show you the real queries as well. There is a not well known gem in Solr: you can tag your parameters, and those tag will be in the return value. There are some tags, which has predefined meanings, but you can also add custom tags, which operationally will be ignored by Solr, so they won't affect the search itself. We use two attibute in our tag, id and ex:

&facet.query={!id=REUSABILITY:restricted ex=REUSABILITY}
   RIGHTS:("CC0" OR "CC BY" OR "CC BY SA")
  • ex - this is a standard tag, and stands for excluding. It means, that this query will exclude the filter tagged as REUSABILITY. This makes is possible, that when the user filters one of these 3 categories, he can see the numbers for all of them correctly.
  • id - a custom tag, we use it as an identifier. It helps us to identify the query when we retrieve the result. It is more easy to find it than the quite complicated Solr query. With a simple regex we can parse the query facets in the response and link the numbers to what it belongs to.

When the user selects an item in this reusability facet, the same query runs, but now as a filter. It effects the whole result set: the number of records, and the real facets. Its format is something like that:

&fq={!tag=REUSABILITY}RIGHTS:("CC0" OR "CC BY" OR "CC BY SA")
  • tag has the same role as id in the query facet. (The difference is that it is a standard Solr tag, and id is our custom solution. Unfortunatelly query facet doesn't support tag attribute, so we have to find a custom one.) We identify here this filter, and this filter will be ignored by those queries, which refers to this by the ex attribute.

All these Solr parameters runs on the background. On the Europeana portal we use a fake facet called ”REUSABILITY”, and we use it in our filtering parameter (&qf) as REUSABILITY:open, or REUSABILITY:restricted or REUSABILITY:permission. It is a shortcut for the lengthy query. We keep the interface (and the URL) clean. In the API we introduced the ”reusability” parameter with the same options as in the portal: "open", "restricted" and "permission" denotate the above mentioned categories:

http://europeana.eu/api/v2/search.json?wskey=[YOUR API KEY]&query=*:*&reusability=open

For those who interested, here is a real Solr query (slightly formatted for the sake of readability)

q=*:*
&fq={!tag=REUSABILITY}RIGHTS:(
     http\:\/\/creativecommons.org\/licenses\/by-nc\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-nc-sa\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-nc-nd\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-nd\/*
  OR http\:\/\/www.europeana.eu\/rights\/out-of-copyright-non-commercial\/*)
&rows=12
&start=0
&sort=score desc
&timeAllowed=30000
&facet.mincount=1
&facet=true
&facet.field=UGC
&facet.field=LANGUAGE
&facet.field=TYPE
&facet.field=YEAR
&facet.field=PROVIDER
&facet.field=DATA_PROVIDER
&facet.field=COUNTRY
&facet.field=RIGHTS
&facet.limit=750
&facet.query={!id=REUSABILITY:open ex=REUSABILITY}RIGHTS:(
     http\:\/\/creativecommons.org\/publicdomain\/mark\/*
  OR http\:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/*
  OR http\:\/\/creativecommons.org\/licenses\/by\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-sa\/*)
&facet.query={!id=REUSABILITY:restricted ex=REUSABILITY}RIGHTS:(
     http\:\/\/creativecommons.org\/licenses\/by-nc\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-nc-sa\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-nc-nd\/*
  OR http\:\/\/creativecommons.org\/licenses\/by-nd\/*
  OR http\:\/\/www.europeana.eu\/rights\/out-of-copyright-non-commercial\/*)
&facet.query={!id=REUSABILITY:permission ex=REUSABILITY}RIGHTS:(
  NOT(
        http\:\/\/creativecommons.org\/publicdomain\/mark\/*
     OR http\:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/*
     OR http\:\/\/creativecommons.org\/licenses\/by\/*
     OR http\:\/\/creativecommons.org\/licenses\/by-sa\/*
     OR http\:\/\/creativecommons.org\/licenses\/by-nc\/*
     OR http\:\/\/creativecommons.org\/licenses\/by-nc-sa\/*
     OR http\:\/\/creativecommons.org\/licenses\/by-nc-nd\/*
     OR http\:\/\/creativecommons.org\/licenses\/by-nd\/*
     OR http\:\/\/www.europeana.eu\/rights\/out-of-copyright-non-commercial\/*))

See it in action at Europeana.eu.

Notes

* Strictly speaking the EDM is based on linked data paradigm, so we don't have records the same ways as in a relational database. This is rather a named graph, but that's too technical, and we refer it as ”record” or ”object”.

Szólj hozzá!

Címkék: europeana solr code4lib #AllezCulture

A bejegyzés trackback címe:

https://kirunews.blog.hu/api/trackback/id/tr795811839

Kommentek:

A hozzászólások a vonatkozó jogszabályok  értelmében felhasználói tartalomnak minősülnek, értük a szolgáltatás technikai  üzemeltetője semmilyen felelősséget nem vállal, azokat nem ellenőrzi. Kifogás esetén forduljon a blog szerkesztőjéhez. Részletek a  Felhasználási feltételekben és az adatvédelmi tájékoztatóban.

Nincsenek hozzászólások.
süti beállítások módosítása