This document is under construction. It will describe the details about integrating already generated metadata with Clowder.
Table of Contents
Table of Contents |
---|
outline | true |
---|
exclude | Table of Contents |
---|
|
Background
At the time of writing this document, there are about 171,000 021 images from the Library of Congress Farm Security Administration/Office of War Information Photograph Collection that were processed for extracting various features like faces, eyes, facial profile, closeups, printed text, presence of Stryker hole, presence of border, mean and standard deviation of grayscale values, subject details, photographer details, and category details. These were computed on XSEDE Comet by using stripped down version of Clowder Extractors or in certain cases, by creating new standalone programs. Integrating this information with Clowder is important to use its features like RESTful API, authentication and authorization, available visualizations, etc.
...
The following set of tables contain description about the database tables that contain extracted metadata. Following each database table description is an example JSON document that will be generated by the Extractor Integration Script for that particular table.
CategoryInfo
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id | loc_id | LOC Index | String; |
2 | category | category | LOC Category number (other_number field in the image JSON document) | String; |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | CategoryInfo Sample JSON Document |
---|
|
// Input CSV Data: fsa1997018591#F 665
{
"loc_id": "fsa1997018591",
"category": "F 665"
} |
...
NOTE: some creators are empty strings, so it might need some refinement.
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id | loc_id | LOC Index | String; |
2 | name | Creator | Creator name loosely in the format: <last name>, <first name>, <birth year> - <death year>. One or all of these sub parts could be missing. If the creator name is blank the value is NULL. | String; |
3 | year_mon | Year |
and month (abbreviated in certain cases) in | Year in which the photograph was taken in the format |
: <year> - <month | month1 - month2 | season >(null if not available). | String; Some year |
- month values are like '[between 1940 and 1946]'; The format mentioned in the left cell may not be strictly followed. Need to look into this in detail when doing the transformation. |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | CreatorInfo Sample JSON Document |
---|
|
// Input CSV Data: fsa1997018591#Shahn, Ben, 1898-1969#1938 Aug.
{
"loc_id": "fsa1997018591",
"Creator": "Shahn, Ben, 1898-1969",
"Year": "1938"
} |
FacesInfo
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id | loc_id | LOC Index | String; |
2 | imght | image_height | Image height | Float; |
3 | imgwid | image_width | Image width | Float; |
4 | dumb1 | N/A | The letter F, it's only there to help browse raw data | String; |
5 | num_faces | Number of People | Number of faces found | Integer; |
6 | face_segs | face_segments | Bounding box location of faces | String; this is a text string that has the ith face, x, y, width, height of face segment. Each face segment is separated by a semicolon. |
7 | dumb2 | N/A | The letter P, it's only there to help browse raw data | String; |
8 | num_profiles | num_profiles | Number of profiles found | Integer; |
9 | prof_segs | profile_segments | Bounding box location of profiles | String; |
10 | dumb3 | N/A | The letter Y, it's only there to help browse raw data | String; |
11 | num_eyes | num_eyes | Number of eyes found | Integer; |
12 | eye_segs | eye_segments | Bounding box location of eyes | String; |
13 | dumb4 | N/A | The letter C , it's only there to help browse raw data | String; |
14 | num_fullcls | num_full_closeups | Number of face full closeups | Integer; 'FULL' is relative to image size |
15 | num_midcls | num_mid_closeups | Number of face mid closeups | Integer; 'MID' is relative to image size |
16 | num_fullprof | num_full_profiles | Number of profile full closeups | Integer; 'FULL' is relative to image size |
17 | num_midprof | num_mid_profiles | Number of profile mid closeups | Integer; 'MID' is relative to image size |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | FacesInfo Sample JSON Document |
---|
|
//Input CSV Data: fsa1997018503#723#1024#F#1#1,185,210,113,113;#P#0#-1#Y#1#1,55,31,26,26;#C#0#0#0#0
{
"loc_id": "fsa1997018503",
"image_height": 723.0,
"image_width": 1024.0,
"Number of People": 1,
"face_segments":
[
{
"x": 185,
"y": 210,
"width": 113,
"height": 113
}
],
"num_profiles": 0,
"profile_segments": [],
"num_eyes": 1,
"eye_segments":
[
{
"x": 55,
"y": 31,
"width": 26,
"height": 26
}
],
"num_full_closeups": 0,
"num_midc_closeups": 0,
"num_full_profiles": 0,
"num_mid_profiles": 0
} |
ImageFilesList
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | fileid | N/A | File ID (Serial number) | Integer; |
2 | id | loc_id | LOC Index | String; |
3 | cometfn | N/A | Filename in Comet | String; |
4 | locurl | loc_url | URL of the photograph in LOC website | String; |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | ImageFilesList Sample JSON Document |
---|
|
// Input CSV Data: 1#fsa1997018564#/oasis/projects/nsf/vlp101/sandeeps/complete_fsa_owi_data/186/fsa1997018564.PP_large.jpg#//hdl.loc.gov/loc.pnp/fsa.8a18634
{
"loc_id": "fsa1997018503",
"loc_url": "http://hdl.loc.gov/loc.pnp/fsa.8a18634"
}
|
ImageProperties
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id | loc_id | LOC Index | String; |
2 | hole | Stryker Punch | Presence of Stryker hole | Boolean; |
3 | border | Border | Presence of border | Boolean; |
4 | meangray | mean_grayscale_value | Mean of grayscale values (not including hole and border) | Float; |
5 | stdgray | std_grayscale_value | Standard deviation of grayscale values (not including hole and border) | Float; |
...
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | ImageProperties Sample JSON Document |
---|
|
// Input CSV Data: fsa1997018564#True#True#139.37#42.41
{
"loc_id": "fsa1997018564",
"Stryker Punch": true,
"Border": true,
"mean_grayscale_value": 139.37,
"std_grayscale_value": 42.41
} |
LocationInfo
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id |
fileid | File ID (Serial number) | Integer; | 2 | id | LOC Index | String; |
3 | cometfn | Filename in Comet | String; |
4 | locurl | URL of the photograph in LOC website | String;loc_id | LOC Index | String; |
2 | lat | latitude | Latitude coordinate | Float; |
3 | long | longitude | Longitude coordinate | Float; |
4 | state | State | Abbreviation or state name (if abbreviation not found) or "United States" if no. Other country names are kept as it is. | String; |
5 | city | city | City name, when available | String; |
6 | county | county | County name, when available (this value is mostly unavailable) | String; |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | LocationInfo Sample JSON Document |
---|
|
// Input CSV Data: fsa1997018591#40.106639#-83.767142#OH#Urbana#nop
{
"loc_id": "fsa1997018591",
"latitude": 40.106639,
"longitude": -83.767142,
"State": "OH",
"city": "Urbana",
"county": null
} |
OCRInfo
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id | loc_id | LOC Index | String; |
2 | ocr_pred | Text | Overall prediction of whether or not text is present in image. 'nop' means OCR found nothing. Where if any one box predicted text then the final prediction is set to T. | String; nop means no text was found; F means after finding possible text regions, it was ruled out not to be text; |
3 | scores | ocr_scores | Prediction scores. A string that consists of sets of 3 numbers (separated by semicolon) where, for each OCR text box found, a 0/1 classification value indicating no-text/text predicted, 2 floats indicating classification score for no-text/text | String; |
4 | box_sum | num_text_predictions | A count of number of 1's found across text box score sets | Integer; |
5 | box_cnt | num_text_boxes | Number of text boxes. Note that box_sum / box_cnt is another possible score instead of the T/F above. | Integer; box_sum is the count of '1's while box_cnt is the count of 'T's. |
6 | box_txt | ocr_results | Set of strings separated by semicolon. One string for each text box found in OCR process | String; |
7 | box_locs | ocr_text_boxes | A string that consists of sets of 4 numbers (separated by semicolon) one set for each text box, where the numbers are upper left x coordinate, upper left y coordinate, box width, box height. | String; |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | OCRInfo Sample JSON Document |
---|
|
// Input CSV Data: fsa1997000226#T#1,0.49,0.51;1,0.06,0.94#2#2#Ironnzny;GARAGE#703,264,71,60;713,277,55,14
{
"loc_id": "fsa1997000226",
"Text": "true",
"ocr_scores":
[
{
"prediction_bit": 1,
"prediction_score": 0.51
},
{
"prediction_bit": 1,
"prediction_score": 0.94
}
],
"num_text_predictions": 2,
"num_text_boxes": 2,
"ocr_results": ["Ironnzny", "GARAGE"],
"ocr_text_boxes":
[
{
"x": 703,
"y": 264,
"width": 71,
"height": 60
},
{
"x": 713,
"y": 277,
"width": 55,
"height": 14
}
]
} |
SubjectInfo
Sl. No. | Database Column Name | JSON Metadata Field Name | Field Description | Remarks |
---|
1 | id | loc_id | LOC Index | String; |
2 | subject | subject | Subject information | String; |
Code Block |
---|
language | js |
---|
theme | Eclipse |
---|
title | SubjectInfo Sample JSON Document |
---|
|
// Input CSV Data: fsa1998017950#Farms, rural scenes--Vermont
{
"loc_id": "fsa1998017950",
"subject": "Farms, rural scenes--Vermont"
} |
PyClowder2
Write a short note about PyClowder2 - the latest version of Python library for writing Clowder extractors
...