The Flamenco
    Search Project banner
  comments welcome
 
introduction
   
people
   
publications
   
talks
   
related links
   
technical info

Flamenco Metadata Construction Guide

This document outlines the steps required to setup the metadata for Flamenco a database and interface.

In this document, we do not discuss how to assign metadata descriptors to a collection; we assume that the basic classification has already occurred. We do, however, offer some guidance on editing the metadata structure to make it serve most effectively in the browser interface.

Note that Flamenco can be used with text or image collections. We illustrate the example below with image collections.

1 Determine facets and attributes

When converting a dataset, you need to classify each feature in the dataset as either an attribute which will appear only in the metadata displayed alongside the item, or as a facet using which users are allowed to browse through the dataset.

For example, in the Fine Arts Museum collection, "Media," "Location," and "Date" are facets because different photos can share the same location, media type or date, and the user may want to search for all photographs in a certain medium (such as drawing or sculpture). On the other hand, the image record number is an attribute because few users will want to search for all photographs with a certain record number, though they are likely to want that information once they locate a useful photograph.

Note that "facets" are browseable item characteristics. Contrastingly, attributes are only shown after an image is found, although they can be shown with thumbnails/titles and sorted on.

2 Create tab-delimited (tsv) files

You will need to create the following tab spaced text files:

  • attrs.tsv
  • facets.tsv
  • items.tsv
  • [facetname]_hierarchy.tsv (for every facet you decide to have)
  • [facetname]_item_mapping.tsv (for every facet you decide to have)
  • fulltext.tsv
  • sortkeys.tsv

(Note: Large tab spaced files can be easily manipulated using Excel. Also, samples of all files you're required to generate can be found in the Resources section below.)

attrs.tsv and facets.tsv

attrs.tsv should be a list of the attributes you've decided on for your system. Each row in this file should represent a single attribute. For each of these rows, the first column should be the underlying system name for this attribute. The second column should be the display name you'd like users to see. A simplified attrs.tsv file for a collection of articles might look something like:

item PMID
title Title

This file is most easily generated by hand.

Similarly, facets.tsv should have a row for each of the facets you'd like your system to use. Like attrs.tsv, the first two columns of this file should be the underlying system name and the display name respectively. The third column is just a textual descriptor or comment of the facet; it is not used by the system but should be included for latter reference. A simplified facets.tsv file might look something like:

journal Journal Short name of the journal in which article appears
date_created Date Date the article was created

The only thing to note is that the identifiers in the first column of both files should be one word long only. In the attrs.tsv file, this identifier should be consistent with the column names of the items table. Remember, attrs or attributes are things that will only show up in the endgame view whereas the facet list descriptors will be used for navigation.

items.tsv

This file should contain one line for each item in your collection. For every row, values should exist for every attribute your system is using. (Note: Column headings are not included in the actual file). A collection with attribute fields RecordID, color, and date might look something like:

568945 blue 02-03-2001
938932 red 04-30-1999
934983 green 02-22-2000

The thing to note here is that the first column of every single row should be a unique identifier for the item.

[facetname]_hierarchy.tsv (for every facet you decide to have)

Each row in these files should represent a node in the facet hierarchy. The first column should be an identifier for that node, to be used in [facetname]_item_mapping.txt. Subsequent columns should be the values for that node, listed general to specific from left to right. A portion of this file for the location facet, location_hierarchy.txt might look something like:

1 United States California San Francisco
2 United States California Berkeley
3 United States Washington Seattle

The thing to note here is that the first column of every single row should be a unique identifier for that node.

[facetname]_item_mapping.tsv (for every facet you decide to have)

Each row in these files should represent the facet hierarchy mapping for that item's facet information. This is perhaps best described with an example. Consider once again, the location facet. If we are using the location_hierarchy.txt file from above, our location_item_mapping.txt file might look something like:

75635 1
434543 1
645654 3
534454 2

This would indicate that item 75625 has location values "United States->California->San Francisco." Likewise, item 645654 would have location values "United States->Washington->Seattle." For facets where items might have multiple values or "multi-valued facets," simply have multiple rows assigning values for that item.

fulltext.tsv

This file can be generated by hand. It will support fulltext searching and is only necessary if you plan to choose MYSQL fulltext searching later as opposed to lucene. For every item in your collection, provide any text associated with that item. The format of the file should be as follows.

001 all the text associated with item 001
002 all the text associated with item 002
003 all the text associated with item 003

sortkeys.tsv

This file will let the system know what attributes or facets you want to sort by. The first column should provide the display value for the sorting option. The second column should provide the name of the facet or attribute in the underlying system. That is, all the values of the second column should be found in either facets.tsv or attrs.tsv. A system filing publications might look something like this:

Journal journal
Date date_created

For more details see 3.6

3 Examine output files

The Flamenco installation script will generate the following files. Although you need not manipulate them, they can be helpful in verifying an accurate instance creation. The Flamenco installation script will produce the following files as output:
  • [facetname].tsv
  • item_[facetname].tsv
  • [facetname].txt
  • [facetname].sql
  • log.txt

(Note: Large tab spaced files can be easily manipulated using Excel. Also, samples of all files generated can be found in the Resources section below.)

[facetname].tsv

These .tsv files represent the hierarchy of the given facet. For each node in the hierarchy, a name, parent, depth and previous parents are listed. To exemplify, consider the following hypothetical .tsv generated for a location hierarchy represented by the tree above:

id name parent level p1 p2 p3
1 USA 0 0 1 0 0  
2 California 1 1 1 2 0  
3 Nevada 1 1 1 3 0  
4 Berkeley 2 2 1 2 4  
5 Los Angeles 2 2 1 2 5  

This table shows that the node "USA" has an id of 1, which will be used to represent it in the item_[facetname].tsv file described below. Since "USA" is the top of this location tree structure, it has no parent. Also, since it is the root node, it has a level of 0. It's p1 or first parent is itself. Since it has no parents above that p2 and p3 are also 0. Moving down the tree, consider "California." This node has the node "USA" as a direct parent, so its parent value is 1, indicating USA's id. The level is also now 1 since it is at the first level of the hierarchy (as opposed to the 0th as before). Since California has a parent higher than itself in the hierarchy, namely "USA," it's p1 is 1. This is also it's highest parent. It's next parent down the hierarchy is itself, so p2 has a value of 2. For all nodes, count indicates the number of items having this node as an assignment; the number of rows in item_[facetname]. The rest of the nodes can be described similarly. This file is used in generating some of the tables for the backend.

item_[facetname].tsv

Using the [facetname].tsv as a definition of the hierarchy, these files describe the actual mapping of the facet values to each individual item. Consider the following possible item_location.tsv file generated using the previously defined location.tsv hierarchy.

item id leaf
100 1 0
100 2 0
100 4 1

Here is what the portion of the table defining item might look like. This says item 100 has nodes 1, 2 and 4 associated with it, of which its leaf node is that which corresponds to 4. Looking this up in [facetname].tsv or the location.tsv from above you see that nodes 1, 2, and 4 correspond to the nodes "USA," "California," and "Berkeley" respectively, with Berkeley being the leaf node. This file is used in generating some of the tables for the backend.

 


Questions? Comments? Contact Kevin Li (kevinli@sims.berkeley.edu)