- A gentle proposal, v2.0 -
by Levente Seláf1 and Péter Király2
1levente.selaf (.) elte.hu, ELTE, Budapest
2peter.kiraly (.) gwdg.de, The Göttingen Society for Scientific Data Processing
INTRODUCTION
This is a technical suggestion for an implementation of a federated search engine provides the researchers a tool for querying multiple poetical databases simultaneously. This suggestion is based on the experiences of a pilot project, MegaRep (http://rpha.elte.hu/megarep/search.do), which queries two such databases Le Noveau Naetebus – Repertoire des poémes strophiques non-lyriques en langue francaise d'avant 1400 (http://nouveaunaetebus.elte.hu/) and Repertorire de la poésie hongroise ancienne (abbreviated as RPHA, http://rpha.elte.hu/), both created at Eötvös Loránd University, Budapest.
The main usage scenario of the tool is the following. The end user (the researcher) creates a query in a user interface. The user interface hides the technical, formal details and provides human readable dropdown lists, radio buttons and similar standard web user interface elements. When the user enter the form the tool creates a more-or-less language independent formal query, and sends it to the individual databases. The databases receive the query, transform it to their own query language, run the search, transform the hit list to an XML-based common format, and send it back to the caller, the federated search engine. The tool collects the results, transfroms XML to HTML, and display the merged list to the end user.
The technological background of the communication is based on the OpenSearch protocol. It is a widely accepted and used industrial standard, among others the internet browsers use it to communicate with custom search engines. You can read more at http://www.opensearch.org/. The standard is pretty straightforward, we should send a specific URL format to the server, which sends back a hit list in Atom RSS format.
THE FORMAT OF THE REQUEST URL
The simplicity of the OpenSearch is that it does not specify the format of the query itself, and because of its limitation we can not use custom URL parameters (such as &meter=hexameter), but we have to use one parameter (called searchTerms) to send our complex query. The solution is using a popular and well documented formal query language, the Lucene's query syntax.
The request URL which should be implemented by all participants:
[base URL]
?searchTerms=[query string]
&startIndex=[the index of first hit (default is 1)]
You can find the details of Lucene's query syntax here: http://lucene.apache.org/java/2_4_1/queryparsersyntax.html
This proposal suggest to implement only a limited set of the whole grammar, namely:
-
simple field-value pair
[field]:[value]
meaning: the record have field field with value as its value
SQL equivalent: field = "value" -
boolean AND, OR, NOT between field-value pairs
[field1]:[value1] AND [field2]:[value2]
meaning: the record have field1 field with value1 as its value, and another field2 field with value2 as its value
SQL equivalent: field1 = "value1" AND field2 = "value2" -
boolean AND, OR, NOT within one field
[field]:([value1] AND [value2])
meaning: the record have field field with both value1 and value2 as its value
SQL equivalent: field IN ("value1", "value2")
All these is about the formal structure of the query, but we have to define a semantical structure; an initial set of fields, and possible values as a kind of common vocabulary for the concepts described in poetic databases.
VOCABULARY OF THE POETIC CONCEPTS
We defined an initial structure. This can be extended in a later phase of the project. We tried to find those concepts which are common in the databases used in our pilot. In the design of the vocabulary we had two rules: 1) it should be language agnostic where it is possible, so where we applied categories, we denoted them by numerical values; 2) it can be extendable later. We have a two level hierarchy: some elements has qualifiers, for example: we can make distinctions between subcategories of Graeco-Roman metrical versifications.
In the tables the header contain the field names. In the body of the table the first or first two columns contain the possible values, the last column contains the meaning of the field value.
Metrics
meter | meter_qualifier | |
01 | Graeco-Roman Metrical Versification | |
01-01-01 | hexameter – one verse | |
01-01-02 | hexameter – several verses | |
01-02-01 | distich – one | |
01-02-02 | distichs – several | |
01-03 | Graeco-Roman metrical poetry (classical meter, different from hexameter or pentameter) | |
01-04 | Graeco-Roman metrical versification – new meters without classical antecedents | |
02 | syllabic | |
03 | tendency to be syllabic | |
04 | tonic | |
05 | each word is a foot | |
06 | free verse | |
07 | syllabo-tonic | |
07-01 | German or English syllabo-tonic versification | |
07-02 | Graeco-Roman Metrical Versification combined with stricte syllabism | |
08 | Mixed Compositions (different parts of the text in different metrical systems) |
Examples:
?searchTerms=meter:02&startIndex=1
?searchTerms=meter:01 AND meter_qualifier:01-01 &startIndex=1
Segments
segmentation | segmentation_qualifier | |
01 | strophic – more than one stanza | |
01 |
isostrophic | |
02 |
heterostrophic | |
02 | strophic – one strophe | |
03 | rhyming couplets | |
04 | laisses | |
05 | rimes couées, serventese | |
06 | terza rima |
Examples:
?searchTerms=segmentation:01&startIndex=1
?searchTerms=segmentation:01 AND segmentation_qualifier:02 &startIndex=1
Rhymes
rhyme | rhyme_qualifier | |
01 | No end-rhymes | |
01 | alliterating, non-rhyming | |
02 | non-alliterating, non-rhyming | |
02 | rhyming | |
03 | assonanced | |
04 | word-refrain rhyming |
Examples:
?searchTerms=rhyme:01&startIndex=1
?searchTerms=rhyme:01 AND rhyme_qualifier:02 &startIndex=1
Rhyming Structure of the Stanza
The field name is rhyme_scheme. It contains a free text of rhyming structure in a scholarly accepted notation (such as AABA).
Example:
?searchTerms=rhyme_scheme:AAAB&startIndex=1
Metrical Structure (verse length)
The field name is metrical_scheme. It contains a free text of the metrical structure in a scholarly accepted notation (such as 12 16).
Example:
?searchTerms=metrical_scheme:12 16&startIndex=1
Declination of line
declination_line | |
01 | rythme de vers descendant |
02 | rythme de vers ascendant |
03 | critere non applicable |
Example:
?searchTerms=declination_line:01&startIndex=1
Gonic Structure – level of the poem
declination_strophe | |
01 | homogonical |
02 | heterogonical |
Example:
?searchTerms=declination_strophe:01&startIndex=1
Gonic Structure
The field name is declination_scheme. It contains a free text of gonic structure in a scholarly accepted notation, i.e. is one of more 'M', 'm', 'F', or 'f' character where 'M' and 'm' mean masculine rhyme, 'F' and 'f' mean feminine rhyme, and uppercase characters denote the beginning of a strophe.
Example:
?searchTerms=declination_scheme:MmMmMfMfMmMfMmMmMfMmFmMfMmMmFfMfFmMfFfMm&startIndex=1
Number of lines
The field name is number_of_lines. It contains a number denotes the number of lines.
Example:
?searchTerms=number_of_lines:8&startIndex=1
Number of strophes
The field name is number_of_strophes. It contains a number denotes the number of strophes.
Example:
?searchTerms=number_of_strophes:20&startIndex=1
Author
The field name is author. It contains a free text field denotes the author of the poem.
Example:
?searchTerms=author:Shakespeare&startIndex=1
Date
date | date_qualifier |
[ISO date format] | ['before'|'after'|'circa'|'between'] |
Examples:
?searchTerms=date:1321-00-00&startIndex=1 (the year 1321)
?searchTerms=date:1321-01-00&startIndex=1 (January, 1321)
?searchTerms=date:1321-01-01&startIndex=1 (1st of January, 1321)
Melody
melody | melody_qualifier | |
01 | poem was sung | |
01 | has musical notation | |
02 | has no musical notation | |
02 | poem was not sung | |
03 | undecideable |
Examples:
?searchTerms=melody=02&startIndex=1
?searchTerms=melody:01&melody_qualifier:01&startIndex=1
Genre
The field name is genre. It contains the genre of the poem. It should reference to a genre classification to be elaborated.
Caesuras
The field name is caesuras. It contains the free text description of caesuras in the poem.
Language
language | language_qualifier | Language |
[text: ISO 639-1, 639-2, and 639-3 language codes] (repeatable) | one language | |
01 | sporadic bilinguism | |
02 | change language by verses | |
03 | change language by strophes | |
04 | the refrain and body of the strophe are in different languages |
Interstrophical relations – level of rhymes
interstrophical_relations_level1 | |
01 | coblas singulars |
02 | coblas unissonans |
03 | coblas doblas |
04 | coblas ternas |
05 | coblas alternas |
Interstrophical relations - primary level note
The field name is interstrophical_relations_level1_note. It contains the free text note related to the previous field.
Interstrophical relations - secondary level
interstrophical_relations_level2 | |
01 | coblas capcaudadas |
02 | coblas capfinidas |
03 | coblas capdenals (niveau des strophes) |
04 | rimes constantes |
05 | acrostichon |
06 | telestichon |
07 | prayer with glosses |
08 | alphabetical poem |
09 | coblas retrogradadas |
10 | dialogue (the participants recite the strophe in alternance) |
11 | cantio cum auctoritate |
Interstrophical relations - secondary level note
The field name is interstrophical_relations_level2_note. It contains the free text note related to the previous field.
Refrain
refrain | refrain_qualifier | |
01 | without refrain | |
02 | with refrain | |
02-01-01 | identical refrain | |
02-01-02 | variation at the beginning | |
03 | with (a joint) refrain | |
03-01-01 | initial refrain | |
03-01-02 | not initial refrain | |
04 | multiple refrains |
OPENSEARCH DESCRIPTOR FILE
Each OpenSearch implementor should publish its implementation via a descriptor file in order to the search engine understands the implementation details they support. The descriptor file is described with details in the OpenSearch standard. Here we show you an example, the RPHA's description file (you can access it at http://rpha.elte.hu/rpha/opensearchdescription.xml):
<?xml version="1.0" encoding="UTF-8"?> <OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"> <ShortName>RPHA Web Search</ShortName> <LongName>RPHA Web Search</LongName> <Description>RPHA OpenSearch interface</Description> <Developer>Seláf Levente, Király Péter</Developer> <Tags>rpha poems web</Tags> <Contact>kirunews@gmail.com</Contact> <Attribution>Creative Commons</Attribution> <Url type="application/atom+xml" template="http://rpha.elte.hu/rpha/openSearch.do/?searchTerms={searchTerms}&startPage={startPage?}&format=atom"/> <Url type="application/rss+xml" template="http://rpha.elte.hu/rpha/openSearch.do/?searchTerms={searchTerms}&startPage={startPage?}&format=rss"/> <Url type="text/html" template="http://rpha.elte.hu/rpha/openSearch.do/?searchTerms={searchTerms}&startPage={startPage?}"/> <Image height="64" width="64" type="image/png"> http://example.com/websearch.png</Image> <Image height="16" width="16" type="image/vnd.microsoft.icon"> http://example.com/websearch.ico</Image> <Query role="example" searchTerms="cat" /> <SyndicationRight>open</SyndicationRight> <AdultContent>false</AdultContent> <Language>en-us</Language> <OutputEncoding>UTF-8</OutputEncoding> <InputEncoding>UTF-8</InputEncoding> </OpenSearchDescription>
RESPONSE FORMAT
The base structure of the response fit to Atom RSS. In the <channel> element there are some header fields, which contains information relevant to the whole response, and a number of <item> elements, for the individual results. In the header part of the response there are some important elements:
- <totalResults>: the total number of results
- <startIndex>: count number of the first element of the returned part of hit list (important: the first element's count number is 1, and not 0)
- <itemsPerPage>: the number of records in one response
In the <item> elements the implementors should provide three elements in project specific way:
- the <title> element should contain the identifier of the repository, and the identifier of the record separated by a space character. For example: <title>RPHA 0373</title>
- the <link> element should contain the URL of the record
- the <description> element should make use of fields defined inside the project's own namepsace (which is http://www.megarep.org in the sample implementation). The field are the same what we use in the query term.
An example:
<?xml version="1.0" encoding="UTF-8"?> <!-- application/rss+xml --> <rss version="2.0" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"> <channel> <title>RPHA results</title> <link>http://rpha.elte.hu/rpha/openSearch.do?language:lat</link> <description>RPHA results</description> <opensearch:totalResults>16</opensearch:totalResults> <opensearch:startIndex>1</opensearch:startIndex> <opensearch:itemsPerPage>10</opensearch:itemsPerPage> <atom:link rel="search" type="application/opensearchdescription+xml" href="http://rpha.elte.hu/rpha/opensearchdescription.xml"/> <opensearch:Query role="request" searchTerms="language:lat" startPage="1" /> <item> <title>RPHA 0373</title> <link>http://rpha.elte.hu/rpha/id/0373</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="0373"> <mr:id>0373</mr:id> <mr:incipit>Emlékezzünk, én uraim, régen lett dologról</mr:incipit> <mr:title>Rusztán császár históriája</mr:title> <mr:author>Drávamelléki Névtelen</mr:author> <mr:language>la</mr:language> <mr:date>1600</mr:date> <mr:date_qualifier>2</mr:date_qualifier> <mr:melody>01</mr:melody> <mr:genre>048,049,051,057</mr:genre> <mr:number_of_strophes>226</mr:number_of_strophes> <mr:meter>02</mr:meter> <mr:rhyme_scheme>AAAA</mr:rhyme_scheme> <mr:metrical_scheme>14141414</mr:metrical_scheme> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 0381</title> <link>http://rpha.elte.hu/rpha/id/0381</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="0381"> <mr:id>0381</mr:id> <mr:incipit>Én lelkecském, búdosócskám, hízelkedőcském</mr:incipit> <mr:title>Idézet</mr:title> <mr:author>Magyari István</mr:author> <mr:language>la</mr:language> <mr:date>1595-1600</mr:date> <mr:date_qualifier>6</mr:date_qualifier> <mr:melody>02</mr:melody> <mr:genre>001,003,025,101</mr:genre> <mr:genre>048,050,053,061,092</mr:genre> <mr:number_of_lines>5</mr:number_of_lines> <mr:meter>01-04</mr:meter> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 2052</title> <link>http://rpha.elte.hu/rpha/id/2052</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="2052"> <mr:id>2052</mr:id> <mr:incipit>Az elefánt nagy, mégis megöletik</mr:incipit> <mr:title>Idézet</mr:title> <mr:author>Bornemisza Péter?</mr:author> <mr:language>la</mr:language> <mr:date>1578</mr:date> <mr:date_qualifier>2</mr:date_qualifier> <mr:melody>02</mr:melody> <mr:genre>048,050,053,061,092</mr:genre> <mr:number_of_strophes>1</mr:number_of_strophes> <mr:meter>02</mr:meter> <mr:rhyme_scheme>AAAX</mr:rhyme_scheme> <mr:metrical_scheme>11121216</mr:metrical_scheme> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 2053</title> <link>http://rpha.elte.hu/rpha/id/2053</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="2053"> <mr:id>2053</mr:id> <mr:incipit>Én császár nem lennék</mr:incipit> <mr:title>Idézet</mr:title> <mr:author>Bornemisza Péter?</mr:author> <mr:language>la</mr:language> <mr:date>1578</mr:date> <mr:date_qualifier>2</mr:date_qualifier> <mr:melody>02</mr:melody> <mr:genre>048,050,053,061,092</mr:genre> <mr:number_of_strophes>2</mr:number_of_strophes> <mr:meter>02</mr:meter> <mr:rhyme_scheme>AAAA</mr:rhyme_scheme> <mr:metrical_scheme>6 6 6 7</mr:metrical_scheme> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 1340</title> <link>http://rpha.elte.hu/rpha/id/1340</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="1340"> <mr:id>1340</mr:id> <mr:incipit>Szólok szerelem dolgáról nektek</mr:incipit> <mr:title>Paris és Görög Ilona históriája</mr:title> <mr:author>Lévai Névtelen</mr:author> <mr:language>la</mr:language> <mr:date>1570</mr:date> <mr:date_qualifier>1</mr:date_qualifier> <mr:melody>01</mr:melody> <mr:genre>048,049,051,057</mr:genre> <mr:number_of_strophes>289</mr:number_of_strophes> <mr:number_of_strophes>290</mr:number_of_strophes> <mr:number_of_strophes>291</mr:number_of_strophes> <mr:meter>02</mr:meter> <mr:rhyme_scheme>AAAA</mr:rhyme_scheme> <mr:metrical_scheme>11111111</mr:metrical_scheme> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 2054</title> <link>http://rpha.elte.hu/rpha/id/2054</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="2054"> <mr:id>2054</mr:id> <mr:incipit>Bújdosó édes lelkecském</mr:incipit> <mr:title>Idézet</mr:title> <mr:language>la</mr:language> <mr:date>1578</mr:date> <mr:date_qualifier>2</mr:date_qualifier> <mr:melody>02</mr:melody> <mr:genre>048,050,053,061,092</mr:genre> <mr:number_of_strophes>1</mr:number_of_strophes> <mr:meter>02</mr:meter> <mr:rhyme_scheme>XXAAA</mr:rhyme_scheme> <mr:metrical_scheme>8 8 910 9</mr:metrical_scheme> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 3216</title> <link>http://rpha.elte.hu/rpha/id/3216</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="3216"> <mr:id>3216</mr:id> <mr:incipit>Ó, Istennek teste édesség, e világnak oltalma</mr:incipit> <mr:title>Könyörgés a kenyér színe alatt jelenlévő Krisztushoz</mr:title> <mr:language>la</mr:language> <mr:date>1433</mr:date> <mr:date_qualifier>2</mr:date_qualifier> <mr:melody>01</mr:melody> <mr:genre>001,003,008,102</mr:genre> <mr:number_of_lines>5</mr:number_of_lines> <mr:meter>01-04</mr:meter> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 3211</title> <link>http://rpha.elte.hu/rpha/id/3211</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="3211"> <mr:id>3211</mr:id> <mr:incipit>Krisztus feltámada menten nagy kínjából</mr:incipit> <mr:title>Húsvéti népének</mr:title> <mr:language>la</mr:language> <mr:date>1401-1450</mr:date> <mr:date_qualifier>6</mr:date_qualifier> <mr:melody>01</mr:melody> <mr:genre>001,003,008,200</mr:genre> <mr:number_of_strophes>1</mr:number_of_strophes> <mr:meter>02</mr:meter> <mr:rhyme_scheme>XXAAX</mr:rhyme_scheme> <mr:metrical_scheme>6 7 7 7 4</mr:metrical_scheme> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 3209</title> <link>http://rpha.elte.hu/rpha/id/3209</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="3209"> <mr:id>3209</mr:id> <mr:title>Jephtes históriája</mr:title> <mr:author>Balassi Bálint</mr:author> <mr:language>la</mr:language> <mr:date>1589</mr:date> <mr:date_qualifier>4</mr:date_qualifier> <mr:melody>03</mr:melody> <mr:genre>001,002,004,009</mr:genre> <mr:genre>048,049,051,057</mr:genre> <mr:number_of_strophes>0</mr:number_of_strophes> <mr:language>hu</mr:language> </mr:record> </description> </item> <item> <title>RPHA 3202</title> <link>http://rpha.elte.hu/rpha/id/3202</link> <description> <mr:record xmlns:mr="http://www.megarep.org" id="3202"> <mr:id>3202</mr:id> <mr:incipit>Az újesztendő kezdessék tőled, Úristen</mr:incipit> <mr:title>Naptárvers</mr:title> <mr:language>la</mr:language> <mr:date>1582</mr:date> <mr:date_qualifier>2</mr:date_qualifier> <mr:melody>02</mr:melody> <mr:genre>048,050,053,059,074</mr:genre> <mr:number_of_lines>24</mr:number_of_lines> <mr:meter>01-02-01</mr:meter> <mr:language>hu</mr:language> </mr:record> </description> </item> </channel> </rss>
In these items you can find, that the field names are the same what we described in the vocabulary section of this paper. There are some minor differences however: the federated genre classification has not been created, so RPHA uses its own classification, and date is not full conform of the ISO date standard.
ABOUT THE PILOT IMPLEMENTATION
The MegaRep source code is available as Open Source software at http://github.org/pkiraly/megarep. The working implementation is available at http://rpha.elte.hu/megarep/search.do. The RPHA source code is also available at http://github.org/pkiraly/rpha, the OpenSearch endpoint is http://rpha.elte.hu/rpha/openSearch.do. Both implementation was writen in Java using Apache Struts 1.0 framework. The MegaRep contains translation files to English, French and Hungarian languages, so both search and record retrieval are available in all three languages.
AFTERWORDS
The background of this proposal, i.e. the services, and the data dictionary were created 5 years ago, but it was never documented other than a bunch of spreadsheet and readme files. Recently I had to find the documentations regarding to my work on RPHA, and unexpectedly I also found the Megarep's files, so I thought it's high time to create this proposal, even if I think some parts are outdated in the light of the advances of TEI and Linked Data. There is a Hungarian proverb: it is better to do it later than never. So while we don't come with an up-to-date proposal, here you can read and use this one. If you have any suggestion, please write us.