HTML

View Péter Király's profile on LinkedIn

20 éves a MEK, 10 éves az EPA Én is
MEK önkéntes
vagyok

20 éves a MEK
10 éves az EPA

kirunews

Király Péter, keresés, Lucene, Solr, Java, Perl, PHP, OAI-PMH, webfejlesztés, digitális könyvtár, MARC, FRBR, RDA, Drupal, EAD, EAC, Europeana, eXtensibleCatalog.org, MEK, és sok minden más.

Friss topikok

Linkblog

Federated search engine of European Poetical databases

2014.10.26. 14:53 kirunews

- A gentle proposal, v2.0 -

by Levente Seláf1 and Péter Király2

1levente.selaf (.) elte.hu, ELTE, Budapest
2peter.kiraly (.) gwdg.de, The Göttingen Society for Scientific Data Processing

INTRODUCTION

This is a technical suggestion for an implementation of a federated search engine provides the researchers a tool for querying multiple poetical databases simultaneously. This suggestion is based on the experiences of a pilot project, MegaRep (http://rpha.elte.hu/megarep/search.do), which queries two such databases Le Noveau Naetebus – Repertoire des poémes strophiques non-lyriques en langue francaise d'avant 1400 (http://nouveaunaetebus.elte.hu/) and Repertorire de la poésie hongroise ancienne (abbreviated as RPHA, http://rpha.elte.hu/), both created at Eötvös Loránd University, Budapest.

The main usage scenario of the tool is the following. The end user (the researcher) creates a query in a user interface. The user interface hides the technical, formal details and provides human readable dropdown lists, radio buttons and similar standard web user interface elements. When the user enter the form the tool creates a more-or-less language independent formal query, and sends it to the individual databases. The databases receive the query, transform it to their own query language, run the search, transform the hit list to an XML-based common format, and send it back to the caller, the federated search engine. The tool collects the results, transfroms XML to HTML, and display the merged list to the end user.

The technological background of the communication is based on the OpenSearch protocol. It is a widely accepted and used industrial standard, among others the internet browsers use it to communicate with custom search engines. You can read more at http://www.opensearch.org/. The standard is pretty straightforward, we should send a specific URL format to the server, which sends back a hit list in Atom RSS format.

THE FORMAT OF THE REQUEST URL

The simplicity of the OpenSearch is that it does not specify the format of the query itself, and because of its limitation we can not use custom URL parameters (such as &meter=hexameter), but we have to use one parameter (called searchTerms) to send our complex query. The solution is using a popular and well documented formal query language, the Lucene's query syntax.

The request URL which should be implemented by all participants:

[base URL]
?searchTerms=[query string]
&startIndex=[the index of first hit (default is 1)]

You can find the details of Lucene's query syntax here: http://lucene.apache.org/java/2_4_1/queryparsersyntax.html

This proposal suggest to implement only a limited set of the whole grammar, namely:

  • simple field-value pair
    [field]:[value]
    meaning: the record have field field with value as its value
    SQL equivalent: field = "value"

  • boolean AND, OR, NOT between field-value pairs
    [field1]:[value1] AND [field2]:[value2]
    meaning: the record have field1 field with value1 as its value, and another field2 field with value2 as its value
    SQL equivalent: field1 = "value1" AND field2 = "value2"

  • boolean AND, OR, NOT within one field
    [field]:([value1] AND [value2])
    meaning: the record have field field with both value1 and value2 as its value
    SQL equivalent: field IN ("value1", "value2")

All these is about the formal structure of the query, but we have to define a semantical structure; an initial set of fields, and possible values as a kind of common vocabulary for the concepts described in poetic databases.

VOCABULARY OF THE POETIC CONCEPTS

We defined an initial structure. This can be extended in a later phase of the project. We tried to find those concepts which are common in the databases used in our pilot. In the design of the vocabulary we had two rules: 1) it should be language agnostic where it is possible, so where we applied categories, we denoted them by numerical values; 2) it can be extendable later. We have a two level hierarchy: some elements has qualifiers, for example: we can make distinctions between subcategories of Graeco-Roman metrical versifications.

In the tables the header contain the field names. In the body of the table the first or first two columns contain the possible values, the last column contains the meaning of the field value.

Metrics

meter meter_qualifier
01 Graeco-Roman Metrical Versification
01-01-01 hexameter – one verse
01-01-02 hexameter – several verses
01-02-01 distich – one
01-02-02 distichs – several
01-03 Graeco-Roman metrical poetry (classical meter, different from hexameter or pentameter)
01-04 Graeco-Roman metrical versification – new meters without classical antecedents
02 syllabic
03 tendency to be syllabic
04 tonic
05 each word is a foot
06 free verse
07 syllabo-tonic
07-01 German or English syllabo-tonic versification
07-02 Graeco-Roman Metrical Versification combined with stricte syllabism
08 Mixed Compositions (different parts of the text in different metrical systems)

Examples:

?searchTerms=meter:02&startIndex=1
?searchTerms=meter:01 AND meter_qualifier:01-01 &startIndex=1

Segments

segmentation segmentation_qualifier
01 strophic – more than one stanza

01

isostrophic

02

heterostrophic
02 strophic – one strophe
03 rhyming couplets
04 laisses
05 rimes couées, serventese
06 terza rima

Examples:

?searchTerms=segmentation:01&startIndex=1
?searchTerms=segmentation:01 AND segmentation_qualifier:02 &startIndex=1

Rhymes

rhyme rhyme_qualifier
01 No end-rhymes
01 alliterating, non-rhyming
02 non-alliterating, non-rhyming
02 rhyming
03 assonanced
04 word-refrain rhyming

Examples:

?searchTerms=rhyme:01&startIndex=1
?searchTerms=rhyme:01 AND rhyme_qualifier:02 &startIndex=1

Rhyming Structure of the Stanza

The field name is rhyme_scheme. It contains a free text of rhyming structure in a scholarly accepted notation (such as AABA).

Example:

?searchTerms=rhyme_scheme:AAAB&startIndex=1

Metrical Structure (verse length)

The field name is metrical_scheme. It contains a free text of the metrical structure in a scholarly accepted notation (such as 12 16).

Example:

?searchTerms=metrical_scheme:12 16&startIndex=1

Declination of line

declination_line
01 rythme de vers descendant
02 rythme de vers ascendant
03 critere non applicable

Example:

?searchTerms=declination_line:01&startIndex=1

Gonic Structure – level of the poem

declination_strophe
01 homogonical
02 heterogonical

Example:

?searchTerms=declination_strophe:01&startIndex=1

Gonic Structure

The field name is declination_scheme. It contains a free text of gonic structure in a scholarly accepted notation, i.e. is one of more 'M', 'm', 'F', or 'f' character where 'M' and 'm' mean masculine rhyme, 'F' and 'f' mean feminine rhyme, and uppercase characters denote the beginning of a strophe.

Example:

?searchTerms=declination_scheme:MmMmMfMfMmMfMmMmMfMmFmMfMmMmFfMfFmMfFfMm&startIndex=1

Number of lines

The field name is number_of_lines. It contains a number denotes the number of lines.

Example:

?searchTerms=number_of_lines:8&startIndex=1

Number of strophes

The field name is number_of_strophes. It contains a number denotes the number of strophes.

Example:

?searchTerms=number_of_strophes:20&startIndex=1

Author

The field name is author. It contains a free text field denotes the author of the poem.

Example:

?searchTerms=author:Shakespeare&startIndex=1

Date

date date_qualifier
[ISO date format] ['before'|'after'|'circa'|'between']

Examples:

?searchTerms=date:1321-00-00&startIndex=1 (the year 1321)
?searchTerms=date:1321-01-00&startIndex=1 (January, 1321)
?searchTerms=date:1321-01-01&startIndex=1 (1st of January, 1321)

Melody

melody melody_qualifier
01 poem was sung
01 has musical notation
02 has no musical notation
02 poem was not sung
03 undecideable

Examples:

?searchTerms=melody=02&startIndex=1
?searchTerms=melody:01&melody_qualifier:01&startIndex=1

Genre

The field name is genre. It contains the genre of the poem. It should reference to a genre classification to be elaborated.

Caesuras

The field name is caesuras. It contains the free text description of caesuras in the poem.

Language

language language_qualifier Language
[text: ISO 639-1, 639-2, and 639-3 language codes] (repeatable) one language
01 sporadic bilinguism
02 change language by verses
03 change language by strophes
04 the refrain and body of the strophe are in different languages

Interstrophical relations – level of rhymes

interstrophical_relations_level1
01 coblas singulars
02 coblas unissonans
03 coblas doblas
04 coblas ternas
05 coblas alternas

Interstrophical relations - primary level note

The field name is interstrophical_relations_level1_note. It contains the free text note related to the previous field.

Interstrophical relations - secondary level

interstrophical_relations_level2
01 coblas capcaudadas
02 coblas capfinidas
03 coblas capdenals (niveau des strophes)
04 rimes constantes
05 acrostichon
06 telestichon
07 prayer with glosses
08 alphabetical poem
09 coblas retrogradadas
10 dialogue (the participants recite the strophe in alternance)
11 cantio cum auctoritate

Interstrophical relations - secondary level note

The field name is interstrophical_relations_level2_note. It contains the free text note related to the previous field.

Refrain

refrain refrain_qualifier
01 without refrain
02 with refrain
02-01-01 identical refrain
02-01-02 variation at the beginning
03 with (a joint) refrain
03-01-01 initial refrain
03-01-02 not initial refrain
04 multiple refrains

OPENSEARCH DESCRIPTOR FILE

Each OpenSearch implementor should publish its implementation via a descriptor file in order to the search engine understands the implementation details they support. The descriptor file is described with details in the OpenSearch standard. Here we show you an example, the RPHA's description file (you can access it at http://rpha.elte.hu/rpha/opensearchdescription.xml):

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
   <ShortName>RPHA Web Search</ShortName>
   <LongName>RPHA Web Search</LongName>
   <Description>RPHA OpenSearch interface</Description>
   <Developer>Seláf Levente, Király Péter</Developer>
   <Tags>rpha poems web</Tags>
   <Contact>kirunews@gmail.com</Contact>
   <Attribution>Creative Commons</Attribution>
   <Url type="application/atom+xml"
      template="http://rpha.elte.hu/rpha/openSearch.do/?searchTerms={searchTerms}&amp;startPage={startPage?}&amp;format=atom"/>
   <Url type="application/rss+xml"
      template="http://rpha.elte.hu/rpha/openSearch.do/?searchTerms={searchTerms}&amp;startPage={startPage?}&amp;format=rss"/>
   <Url type="text/html"
      template="http://rpha.elte.hu/rpha/openSearch.do/?searchTerms={searchTerms}&amp;startPage={startPage?}"/>
   <Image height="64" width="64" type="image/png">
      http://example.com/websearch.png</Image>
   <Image height="16" width="16" type="image/vnd.microsoft.icon">
      http://example.com/websearch.ico</Image>
   <Query role="example" searchTerms="cat" />
   <SyndicationRight>open</SyndicationRight>
   <AdultContent>false</AdultContent>
   <Language>en-us</Language>
   <OutputEncoding>UTF-8</OutputEncoding>
   <InputEncoding>UTF-8</InputEncoding>
</OpenSearchDescription>

RESPONSE FORMAT

The base structure of the response fit to Atom RSS. In the <channel> element there are some header fields, which contains information relevant to the whole response, and a number of <item> elements, for the individual results. In the header part of the response there are some important elements:

  • <totalResults>: the total number of results
  • <startIndex>: count number of the first element of the returned part of hit list (important: the first element's count number is 1, and not 0)
  • <itemsPerPage>: the number of records in one response

In the <item> elements the implementors should provide three elements in project specific way:

  1. the <title> element should contain the identifier of the repository, and the identifier of the record separated by a space character. For example: <title>RPHA 0373</title>
  2. the <link> element should contain the URL of the record
  3. the <description> element should make use of fields defined inside the project's own namepsace (which is http://www.megarep.org in the sample implementation). The field are the same what we use in the query term.

An example:

<?xml version="1.0" encoding="UTF-8"?>
<!-- application/rss+xml -->
<rss version="2.0" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
  <title>RPHA results</title>
  <link>http://rpha.elte.hu/rpha/openSearch.do?language:lat</link>
  <description>RPHA results</description>
  <opensearch:totalResults>16</opensearch:totalResults>
  <opensearch:startIndex>1</opensearch:startIndex>
  <opensearch:itemsPerPage>10</opensearch:itemsPerPage>
  <atom:link rel="search" type="application/opensearchdescription+xml" href="http://rpha.elte.hu/rpha/opensearchdescription.xml"/>
  <opensearch:Query role="request" searchTerms="language:lat" startPage="1" />
  <item>
    <title>RPHA 0373</title>
    <link>http://rpha.elte.hu/rpha/id/0373</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="0373">
       <mr:id>0373</mr:id>
       <mr:incipit>Emlékezzünk, én uraim, régen lett dologról</mr:incipit>
       <mr:title>Rusztán császár históriája</mr:title>
       <mr:author>Drávamelléki Névtelen</mr:author>
       <mr:language>la</mr:language>
       <mr:date>1600</mr:date>
       <mr:date_qualifier>2</mr:date_qualifier>
       <mr:melody>01</mr:melody>
       <mr:genre>048,049,051,057</mr:genre>
       <mr:number_of_strophes>226</mr:number_of_strophes>
       <mr:meter>02</mr:meter>
       <mr:rhyme_scheme>AAAA</mr:rhyme_scheme>
       <mr:metrical_scheme>14141414</mr:metrical_scheme>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 0381</title>
   <link>http://rpha.elte.hu/rpha/id/0381</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="0381">
       <mr:id>0381</mr:id>
       <mr:incipit>Én lelkecském, búdosócskám, hízelkedőcském</mr:incipit>
       <mr:title>Idézet</mr:title>
       <mr:author>Magyari István</mr:author>
       <mr:language>la</mr:language>
       <mr:date>1595-1600</mr:date>
       <mr:date_qualifier>6</mr:date_qualifier>
       <mr:melody>02</mr:melody>
       <mr:genre>001,003,025,101</mr:genre>
       <mr:genre>048,050,053,061,092</mr:genre>
       <mr:number_of_lines>5</mr:number_of_lines>
       <mr:meter>01-04</mr:meter>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 2052</title>
   <link>http://rpha.elte.hu/rpha/id/2052</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="2052">
       <mr:id>2052</mr:id>
       <mr:incipit>Az elefánt nagy, mégis megöletik</mr:incipit>
       <mr:title>Idézet</mr:title>
       <mr:author>Bornemisza Péter?</mr:author>
       <mr:language>la</mr:language>
       <mr:date>1578</mr:date>
       <mr:date_qualifier>2</mr:date_qualifier>
       <mr:melody>02</mr:melody>
       <mr:genre>048,050,053,061,092</mr:genre>
       <mr:number_of_strophes>1</mr:number_of_strophes>
       <mr:meter>02</mr:meter>
       <mr:rhyme_scheme>AAAX</mr:rhyme_scheme>
       <mr:metrical_scheme>11121216</mr:metrical_scheme>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 2053</title>
   <link>http://rpha.elte.hu/rpha/id/2053</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="2053">
       <mr:id>2053</mr:id>
       <mr:incipit>Én császár nem lennék</mr:incipit>
       <mr:title>Idézet</mr:title>
       <mr:author>Bornemisza Péter?</mr:author>
       <mr:language>la</mr:language>
       <mr:date>1578</mr:date>
       <mr:date_qualifier>2</mr:date_qualifier>
       <mr:melody>02</mr:melody>
       <mr:genre>048,050,053,061,092</mr:genre>
       <mr:number_of_strophes>2</mr:number_of_strophes>
       <mr:meter>02</mr:meter>
       <mr:rhyme_scheme>AAAA</mr:rhyme_scheme>
       <mr:metrical_scheme>6 6 6 7</mr:metrical_scheme>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 1340</title>
   <link>http://rpha.elte.hu/rpha/id/1340</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="1340">
       <mr:id>1340</mr:id>
       <mr:incipit>Szólok szerelem dolgáról nektek</mr:incipit>
       <mr:title>Paris és Görög Ilona históriája</mr:title>
       <mr:author>Lévai Névtelen</mr:author>
       <mr:language>la</mr:language>
       <mr:date>1570</mr:date>
       <mr:date_qualifier>1</mr:date_qualifier>
       <mr:melody>01</mr:melody>
       <mr:genre>048,049,051,057</mr:genre>
       <mr:number_of_strophes>289</mr:number_of_strophes>
       <mr:number_of_strophes>290</mr:number_of_strophes>
       <mr:number_of_strophes>291</mr:number_of_strophes>
       <mr:meter>02</mr:meter>
       <mr:rhyme_scheme>AAAA</mr:rhyme_scheme>
       <mr:metrical_scheme>11111111</mr:metrical_scheme>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 2054</title>
    <link>http://rpha.elte.hu/rpha/id/2054</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="2054">
       <mr:id>2054</mr:id>
       <mr:incipit>Bújdosó édes lelkecském</mr:incipit>
       <mr:title>Idézet</mr:title>
       <mr:language>la</mr:language>
       <mr:date>1578</mr:date>
       <mr:date_qualifier>2</mr:date_qualifier>
       <mr:melody>02</mr:melody>
       <mr:genre>048,050,053,061,092</mr:genre>
       <mr:number_of_strophes>1</mr:number_of_strophes>
       <mr:meter>02</mr:meter>
       <mr:rhyme_scheme>XXAAA</mr:rhyme_scheme>
       <mr:metrical_scheme>8 8 910 9</mr:metrical_scheme>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 3216</title>
    <link>http://rpha.elte.hu/rpha/id/3216</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="3216">
       <mr:id>3216</mr:id>
       <mr:incipit>Ó, Istennek teste édesség, e világnak oltalma</mr:incipit>
       <mr:title>Könyörgés a kenyér színe alatt jelenlévő Krisztushoz</mr:title>
       <mr:language>la</mr:language>
       <mr:date>1433</mr:date>
       <mr:date_qualifier>2</mr:date_qualifier>
       <mr:melody>01</mr:melody>
       <mr:genre>001,003,008,102</mr:genre>
       <mr:number_of_lines>5</mr:number_of_lines>
       <mr:meter>01-04</mr:meter>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 3211</title>
   <link>http://rpha.elte.hu/rpha/id/3211</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="3211">
       <mr:id>3211</mr:id>
       <mr:incipit>Krisztus feltámada menten nagy kínjából</mr:incipit>
       <mr:title>Húsvéti népének</mr:title>
       <mr:language>la</mr:language>
       <mr:date>1401-1450</mr:date>
       <mr:date_qualifier>6</mr:date_qualifier>
       <mr:melody>01</mr:melody>
       <mr:genre>001,003,008,200</mr:genre>
       <mr:number_of_strophes>1</mr:number_of_strophes>
       <mr:meter>02</mr:meter>
       <mr:rhyme_scheme>XXAAX</mr:rhyme_scheme>
       <mr:metrical_scheme>6 7 7 7 4</mr:metrical_scheme>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 3209</title>
    <link>http://rpha.elte.hu/rpha/id/3209</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="3209">
       <mr:id>3209</mr:id>
       <mr:title>Jephtes históriája</mr:title>
       <mr:author>Balassi Bálint</mr:author>
       <mr:language>la</mr:language>
       <mr:date>1589</mr:date>
       <mr:date_qualifier>4</mr:date_qualifier>
       <mr:melody>03</mr:melody>
       <mr:genre>001,002,004,009</mr:genre>
       <mr:genre>048,049,051,057</mr:genre>
       <mr:number_of_strophes>0</mr:number_of_strophes>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
  <item>
    <title>RPHA 3202</title>
   <link>http://rpha.elte.hu/rpha/id/3202</link>
    <description>
      <mr:record xmlns:mr="http://www.megarep.org" id="3202">
       <mr:id>3202</mr:id>
       <mr:incipit>Az újesztendő kezdessék tőled, Úristen</mr:incipit>
       <mr:title>Naptárvers</mr:title>
       <mr:language>la</mr:language>
       <mr:date>1582</mr:date>
       <mr:date_qualifier>2</mr:date_qualifier>
       <mr:melody>02</mr:melody>
       <mr:genre>048,050,053,059,074</mr:genre>
       <mr:number_of_lines>24</mr:number_of_lines>
       <mr:meter>01-02-01</mr:meter>
       <mr:language>hu</mr:language>
      </mr:record>
    </description>
  </item>
</channel>
</rss>

In these items you can find, that the field names are the same what we described in the vocabulary section of this paper. There are some minor differences however: the federated genre classification has not been created, so RPHA uses its own classification, and date is not full conform of the ISO date standard.

ABOUT THE PILOT IMPLEMENTATION

The MegaRep source code is available as Open Source software at http://github.org/pkiraly/megarep. The working implementation is available at http://rpha.elte.hu/megarep/search.do. The RPHA source code is also available at http://github.org/pkiraly/rpha, the OpenSearch endpoint is http://rpha.elte.hu/rpha/openSearch.do. Both implementation was writen in Java using Apache Struts 1.0 framework. The MegaRep contains translation files to English, French and Hungarian languages, so both search and record retrieval are available in all three languages.

AFTERWORDS

The background of this proposal, i.e. the services, and the data dictionary were created 5 years ago, but it was never documented other than a bunch of spreadsheet and readme files. Recently I had to find the documentations regarding to my work on RPHA, and unexpectedly I also found the Megarep's files, so I thought it's high time to create this proposal, even if I think some parts are outdated in the light of the advances of TEI and Linked Data. There is a Hungarian proverb: it is better to do it later than never. So while we don't come with an up-to-date proposal, here you can read and use this one. If you have any suggestion, please write us.

1 komment

Címkék: code4lib

A bejegyzés trackback címe:

http://kirunews.blog.hu/api/trackback/id/tr626834099

Kommentek:

A hozzászólások a vonatkozó jogszabályok  értelmében felhasználói tartalomnak minősülnek, értük a szolgáltatás technikai  üzemeltetője semmilyen felelősséget nem vállal, azokat nem ellenőrzi. Kifogás esetén forduljon a blog szerkesztőjéhez. Részletek a  Felhasználási feltételekben.

andrisi 2014.10.29. 19:13:56

Without going into the technical details, or reading the entire post, I must say that from my experience, it's better - or I must say it's only viable most of the time - if you translate your queries to the target systems' language yourself, and send it along that way. Asking other systems to understand a common query format seems nice idea from a standardiazation viewpoint, but in such a specific field, it only limits your possiblities and/or creatges work overhead. So you'd better study the target systems, come of with the best interface to query all they know, and get down to speaking their languages as good as you can.