Flamenco Metadata Construction Guide
This document outlines the steps
required to setup the metadata for Flamenco a database and interface.
In this document, we do not discuss
how to assign metadata descriptors to a collection; we assume that the basic
classification has already occurred. We do, however, offer some guidance on
editing the metadata structure to make it serve most effectively in the browser
interface.
Note that Flamenco can be used with text or image collections. We illustrate the
example below with image collections.
1 Determine facets and attributes
When converting a
dataset, you need to classify each feature in the dataset as either an
attribute which will appear only in the metadata displayed alongside the item,
or as a facet using which users are allowed to browse through the dataset.
For example, in the Fine Arts
Museum collection, "Media," "Location," and "Date" are facets because different
photos can share the same location, media type or date, and the user may want to
search for all photographs in a certain medium (such as drawing or sculpture).
On the other hand, the image record number is an attribute because few users
will want to search for all photographs with a certain record number, though
they are likely to want that information once they locate a useful photograph.
Note that "facets"
are browseable item characteristics. Contrastingly, attributes are only
shown after an image is found, although they can be shown with thumbnails/titles
and sorted on.
2 Create tab-delimited (tsv)
files
You will need to create the following tab spaced text
files:
- attrs.tsv
- facets.tsv
- items.tsv
- [facetname]_hierarchy.tsv (for every facet you decide to have)
- [facetname]_item_mapping.tsv
(for every facet you decide to have)
- fulltext.tsv
- sortkeys.tsv
(Note: Large tab spaced files can be easily manipulated
using Excel. Also, samples of all files you're required to generate can be
found in the Resources section below.)
attrs.tsv and facets.tsv
attrs.tsv should be a list of
the attributes you've decided on for your system. Each row in this file
should represent a single attribute. For each of these rows, the first column
should be the underlying system name for this attribute. The second column
should be the display name you'd like users to see. A simplified attrs.tsv
file for a collection of articles might look something like:
This file is most easily generated by hand.
Similarly, facets.tsv should have a row for each of the
facets you'd like your system to use. Like attrs.tsv, the first two columns of
this file should be the underlying system name and the display name
respectively. The third column is just a textual descriptor or comment of the
facet; it is not used by the system but should be included for latter
reference. A simplified facets.tsv file might look something like:
journal |
Journal |
Short name of the journal in which article appears |
date_created |
Date |
Date the article was created |
The
only thing to note is that the identifiers in the first column of both files
should be one word long only. In the attrs.tsv file, this identifier should be
consistent with the column names of the items table. Remember, attrs or attributes are
things that will only show up in the endgame view whereas the facet list
descriptors will be used for navigation.
items.tsv
This file should contain one line
for each item in your collection. For every row, values should exist for every
attribute your system is using. (Note:
Column headings are not included in the actual file). A collection with
attribute fields RecordID, color, and date might look something like:
568945 |
blue |
02-03-2001 |
938932 |
red |
04-30-1999 |
934983 |
green |
02-22-2000 |
The thing to note here is that the first column of every
single row should be a unique identifier for the item.
[facetname]_hierarchy.tsv (for every facet you decide to have)
Each row in these files should
represent a node in the facet hierarchy. The first column should be an
identifier for that node, to be used in [facetname]_item_mapping.txt.
Subsequent columns should be the values for that node, listed general to
specific from left to right. A portion of this file for the location facet,
location_hierarchy.txt might look something like:
1 |
United States |
California |
San Francisco |
2 |
United States |
California |
Berkeley |
3 |
United States |
Washington |
Seattle |
The thing to note here is that the first column of every
single row should be a unique identifier for that node.
[facetname]_item_mapping.tsv (for every facet you decide to have)
Each row in these files should
represent the facet hierarchy mapping for that item's facet information. This
is perhaps best described with an example. Consider once again, the location
facet. If we are using the location_hierarchy.txt file from above, our
location_item_mapping.txt file might look something like:
75635 |
1 |
434543 |
1 |
645654 |
3 |
534454 |
2 |
This would indicate that item 75625 has location values
"United States->California->San Francisco." Likewise, item 645654 would have
location values "United
States->Washington->Seattle." For facets where items might have multiple
values or "multi-valued facets," simply have multiple rows assigning values
for that item.
fulltext.tsv
This file can be generated by hand.
It will support fulltext searching and is only necessary if you plan to choose
MYSQL fulltext searching later as opposed to lucene. For every item in your
collection, provide any text associated with that item. The format of the file
should be as follows.
001 |
all the text associated with item 001 |
002 |
all the text associated with item 002 |
003 |
all the text associated with item 003 |
sortkeys.tsv
This file will let the system know
what attributes or facets you want to sort by. The first column should provide
the display value for the sorting option. The second column should provide the
name of the facet or attribute in the underlying system. That is, all the values
of the second column should be found in either facets.tsv or attrs.tsv. A system
filing publications might look something like this:
Journal |
journal |
Date |
date_created |
For more details see 3.6
3 Examine output files
The Flamenco installation
script will generate the following files. Although you need not manipulate
them, they can be helpful in verifying an accurate instance creation. The
Flamenco installation script will produce the following files as output:
- [facetname].tsv
- item_[facetname].tsv
- [facetname].txt
- [facetname].sql
- log.txt
(Note: Large tab spaced files can
be easily manipulated using Excel. Also, samples of all files generated can
be found in the Resources section below.)
[facetname].tsv
These .tsv files represent the
hierarchy of the given facet. For each node in the hierarchy, a name, parent,
depth and previous parents are listed. To exemplify, consider the following
hypothetical .tsv generated for a location hierarchy represented by the tree
above:
id |
name |
parent |
level |
p1 |
p2 |
p3 |
1 |
USA |
0 |
0 |
1 |
0 |
0 |
|
2 |
California |
1 |
1 |
1 |
2 |
0 |
|
3 |
Nevada |
1 |
1 |
1 |
3 |
0 |
|
4 |
Berkeley |
2 |
2 |
1 |
2 |
4 |
|
5 |
Los Angeles |
2 |
2 |
1 |
2 |
5 |
|
This table shows that the node
"USA" has an id of 1, which will be used to represent it in the
item_[facetname].tsv file described below. Since "USA" is the
top of this location tree structure, it has no parent. Also, since it is
the root node, it has a level of 0. It's p1 or first parent is itself. Since
it has no parents above that p2 and p3 are also 0. Moving down the tree,
consider "California." This node has the node "USA"
as a direct parent, so its parent value is 1, indicating USA's id. The level
is also now 1 since it is at the first level of the hierarchy (as opposed
to the 0th as before). Since California has a parent higher than itself
in the hierarchy, namely "USA," it's p1 is 1. This is also it's
highest parent. It's next parent down the hierarchy is itself, so p2 has
a value of 2. For all nodes, count indicates the number of items having
this node as an assignment; the number of rows in item_[facetname]. The
rest of the nodes can be described similarly. This file is used in generating
some of the tables for the backend.
item_[facetname].tsv
Using the [facetname].tsv as a
definition of the hierarchy, these files describe the actual mapping of
the facet values to each individual item. Consider the following possible
item_location.tsv file generated using the previously defined location.tsv
hierarchy.
item |
id |
leaf |
100 |
1 |
0 |
100 |
2 |
0 |
100 |
4 |
1 |
Here is what
the portion of the table defining item might look like. This says item 100
has nodes 1, 2 and 4 associated with it, of which its leaf node is that
which corresponds to 4. Looking this up in [facetname].tsv or the location.tsv
from above you see that nodes 1, 2, and 4 correspond to the nodes "USA,"
"California," and "Berkeley" respectively, with Berkeley
being the leaf node. This file is used in generating some of the tables
for the backend.
Questions? Comments? Contact Kevin
Li (kevinli@sims.berkeley.edu)
|