This document is under construction. It will describe the details about integrating already generated metadata with Clowder.

Table of Contents

Background

At the time of writing this document, there are 171,021 images from the Library of Congress Farm Security Administration/Office of War Information Photograph Collection that were processed for extracting various features like faces, eyes, facial profile, closeups, printed text, presence of Stryker hole, presence of border, mean and standard deviation of grayscale values, subject details, photographer details, and category details. These were computed on XSEDE Comet by using stripped down version of Clowder Extractors or in certain cases, by creating new standalone programs. Integrating this information with Clowder is important to use its features like RESTful API, authentication and authorization, available visualizations, etc.

Database Table Descriptions and Converted Sample JSON Documents

The following set of tables contain description about the database tables that contain extracted metadata. Following each database table description is an example JSON document that will be generated by the Extractor Integration Script for that particular table.

CategoryInfo

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1idloc_idLOC IndexString;
2categorycategoryLOC Category number (other_number field in the image JSON document)String;
CategoryInfo Sample JSON Document
// Input CSV Data: fsa1997018591#F 665
{
	"loc_id": "fsa1997018591",
	"category": "F 665"
}

CreatorInfo

NOTE: some creators are empty strings, so it might need some refinement. 

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1id

loc_id

LOC IndexString;
2name

Creator

Creator name loosely in the format: <last name>, <first name>, <birth year> - <death year>. One or all of these sub parts could be missing. If the creator name is blank the value is NULL.String;
3year_mon

Year

Year in which the photograph was taken in the format (null if not available).String; Some year values are like '[between 1940 and 1946]'; The format mentioned in the left cell may not be strictly followed. Need to look into this in detail when doing the transformation.
CreatorInfo Sample JSON Document
// Input CSV Data: fsa1997018591#Shahn, Ben, 1898-1969#1938 Aug.
{
	"loc_id": "fsa1997018591",
	"Creator": "Shahn, Ben, 1898-1969",
	"Year": "1938"
}

FacesInfo

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1idloc_idLOC IndexString;
2imghtimage_heightImage heightFloat;
3imgwidimage_widthImage widthFloat;
4dumb1N/A

The letter F, it's only there to help browse raw data

String;
5num_facesNumber of PeopleNumber of faces foundInteger;
6face_segsface_segmentsBounding box location of faces

String; this is a text string that has the ith face, x, y, width, height of face segment. Each face segment is separated by a semicolon.

7dumb2N/AThe letter P, it's only there to help browse raw dataString;
8num_profilesnum_profilesNumber of profiles foundInteger;
9prof_segsprofile_segmentsBounding box location of profilesString;
10dumb3N/AThe letter Y, it's only there to help browse raw dataString;
11num_eyesnum_eyesNumber of eyes foundInteger;
12eye_segseye_segmentsBounding box location of eyesString;
13dumb4N/AThe letter C , it's only there to help browse raw dataString;
14num_fullclsnum_full_closeupsNumber of face full closeupsInteger; 'FULL' is relative to image size
15num_midclsnum_mid_closeupsNumber of face mid closeupsInteger; 'MID' is relative to image size
16num_fullprofnum_full_profilesNumber of profile full closeupsInteger; 'FULL' is relative to image size
17num_midprofnum_mid_profilesNumber of profile mid closeupsInteger; 'MID' is relative to image size
FacesInfo Sample JSON Document
//Input CSV Data: fsa1997018503#723#1024#F#1#1,185,210,113,113;#P#0#-1#Y#1#1,55,31,26,26;#C#0#0#0#0
{
	"loc_id": "fsa1997018503",
	"image_height": 723.0,
	"image_width": 1024.0,
	"Number of People": 1,
	"face_segments": 
		[
			{
				"x": 185,
				"y": 210,
				"width": 113,
				"height": 113
			}
		],
	"num_profiles": 0,
	"profile_segments": [],
	"num_eyes": 1,
	"eye_segments": 
		[
			{
				"x": 55,
				"y": 31,
				"width": 26,
				"height": 26
			}
		],
	"num_full_closeups": 0,
	"num_midc_closeups": 0,
	"num_full_profiles": 0,
	"num_mid_profiles": 0
}

ImageFilesList

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1fileidN/AFile ID (Serial number)Integer;
2idloc_idLOC IndexString;
3cometfnN/AFilename in CometString;
4locurlloc_urlURL of the photograph in LOC websiteString;
ImageFilesList Sample JSON Document
// Input CSV Data: 1#fsa1997018564#/oasis/projects/nsf/vlp101/sandeeps/complete_fsa_owi_data/186/fsa1997018564.PP_large.jpg#//hdl.loc.gov/loc.pnp/fsa.8a18634
{
	"loc_id": "fsa1997018503",
	"loc_url": "http://hdl.loc.gov/loc.pnp/fsa.8a18634"
}
 

ImageProperties

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1idloc_idLOC IndexString;
2holeStryker PunchPresence of Stryker holeBoolean;
3borderBorderPresence of borderBoolean;
4meangraymean_grayscale_valueMean of grayscale values (not including hole and border)Float;
5stdgraystd_grayscale_valueStandard deviation of grayscale values (not including hole and border)Float;
ImageProperties Sample JSON Document
//  Input CSV Data: fsa1997018564#True#True#139.37#42.41
{
	"loc_id": "fsa1997018564",
	"Stryker Punch": true,
	"Border": true,
	"mean_grayscale_value": 139.37,
	"std_grayscale_value": 42.41
}

LocationInfo

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1idloc_idLOC IndexString;
2latlatitudeLatitude coordinateFloat;
3longlongitudeLongitude coordinateFloat;
4stateStateAbbreviation or state name (if abbreviation not found) or "United States" if no. Other country names are kept as it is.String;
5citycityCity name, when availableString;
6countycountyCounty name, when available (this value is mostly unavailable)String;
LocationInfo Sample JSON Document
//  Input CSV Data: fsa1997018591#40.106639#-83.767142#OH#Urbana#nop
{
	"loc_id": "fsa1997018591",
	"latitude": 40.106639,
	"longitude": -83.767142,
	"State": "OH",
	"city": "Urbana",
	"county": null
}

OCRInfo

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1idloc_idLOC IndexString;
2ocr_predText

Overall prediction of whether or not text is present in image. 'nop' means OCR found nothing. Where if any one box predicted text then the final prediction is set to T.

String; nop means no text was found; F means after finding possible text regions, it was ruled out not to be text;

3scoresocr_scoresPrediction scores. A string that consists of sets of 3 numbers (separated by semicolon) where, for each OCR text box found, a 0/1 classification value indicating no-text/text predicted, 2 floats indicating classification score for no-text/textString;
4box_sumnum_text_predictionsA count of number of 1's found across text box score sets Integer;
5box_cntnum_text_boxes

Number of text boxes. Note that box_sum / box_cnt is another possible score instead of the T/F above.

Integer; box_sum is the count of '1's while box_cnt is the count of 'T's. 

6box_txtocr_results

Set of strings separated by semicolon. One string for each text box found in OCR process

String;
7box_locsocr_text_boxes

A string that consists of sets of 4 numbers (separated by semicolon) one set for each text box, where the numbers are upper left x coordinate, upper left y coordinate, box width, box height.

String;
OCRInfo Sample JSON Document
//  Input CSV Data: fsa1997000226#T#1,0.49,0.51;1,0.06,0.94#2#2#Ironnzny;GARAGE#703,264,71,60;713,277,55,14
{
	"loc_id": "fsa1997000226",
	"Text": "true",
	"ocr_scores": 
	[
		{
			"prediction_bit": 1,
			"prediction_score": 0.51
		},
		{
			"prediction_bit": 1,
			"prediction_score": 0.94
		}
	],
	"num_text_predictions": 2,
	"num_text_boxes": 2,
	"ocr_results": ["Ironnzny", "GARAGE"],
	"ocr_text_boxes": 
	[
		{
			"x": 703,
			"y": 264,
			"width": 71,
			"height": 60
		},
		{

			"x": 713,
			"y": 277,
			"width": 55,
			"height": 14
		}
	]
	
}

SubjectInfo

Sl. No.Database Column NameJSON Metadata Field NameField DescriptionRemarks
1idloc_idLOC IndexString;
2subjectsubjectSubject informationString;
SubjectInfo Sample JSON Document
// Input CSV Data: fsa1998017950#Farms, rural scenes--Vermont
{
	"loc_id": "fsa1998017950",
	"subject": "Farms, rural scenes--Vermont"
}

PyClowder2

Write a short note about PyClowder2  - the latest version of Python library for writing Clowder extractors 

 

 

  • No labels