Feature #1848
closedAPI: fetch data from ARCHE
Description
For the INDIGO project, we will import data from ARCHE to OpenAtlas.
Update
Base functionality is implemented. It is possible to fetch metadata from ARCHE and import it to OpenAtlas. The feature is still experimental and will be further expanded (e.g. Creation event to track photographers) and generalized in the near future.
CIDOC mapping (in progress)¶
More information available at ARCHE import
INDIGO test collection (ARCHE)¶
A test collection with data provided by Geert Verhoeven and Benjamin Wild was imported into the staging instance of ARCHE, hosted on Minerva. The test collection has identifier https://id.acdh.oeaw.ac.at/indigo_test, which automatically resolves to the page with the details of the collection on ARCHE staging (i.e., https://arche-curation.acdh-dev.oeaw.ac.at/browser/oeaw_detail/1390136).
Collection arrangement¶
The main collection INDIGO Test Collection is what is called in ARCHE a Top Collection, the main folder that contains all the data related to the collection: https://id.acdh.oeaw.ac.at/indigo_test.
This contains two Collections, i.e. two "sub-folders", which correspond to the two batches of data sent by Benjamin and Geert:- Large Ortophotos (https://id.acdh.oeaw.ac.at/indigo_test/large_orthos)
Four large TIFF files sent by Benjamin Wild, with sizes ranging from 270 MB to 4 GB. - Test Photos (https://id.acdh.oeaw.ac.at/indigo_test/test_photos)
Eight test photos sent by Geert Verhoeven, including two color checkers, in different formats and with accompanying metadata files.
Each file contained in these Collections is called a Resource in the ARCHE ontology.
Test Photos¶
I would suggest to start working with the Test Photos collection, since it is now the most complete with different formats and metadata.
More precisely, each picture was provided by Geert in both JPG and NEF format (Nikon proprietary RAW format) and is accompanied by an XMP sidecar file, containing metadata to the picture. More information about the different metadata formats can be found in Geert's info document.
In addition, each picture was processed by means of ExifTool. All the metadata contained in the JPG file, NEF file, and XMP file were combined into one single JSON file, where each line contains a specific property with a tag identifying its metadata schema. For example: "IPTC:Sub-location": "Donaukanal"
. These metadata files are identified by the suffix _metadata
. When specific metadata properties coming from different files (JPG, NEF, XMP) did not have the same value in each of the sources, they were moved to a different metadata file, identified by the suffix _not_unique_values
.
Each of these metadata files is of class Metadata in the ARCHE ontology, and it is linked to the original file through property acdh:isMetadataFor
. You can see the relationship in the GUI too, by viewing the Details page of a metadata file (e.g., https://arche-curation.acdh-dev.oeaw.ac.at/browser/oeaw_detail/1390181):
Otherwise, if you view the Details of the original file (e.g., https://arche-curation.acdh-dev.oeaw.ac.at/browser/oeaw_detail/1390166), you can find the info by switching to the Expert-View (which is in general very useful for viewing more metadata about a resource):
and then scrolling to the Inverse Data section:
Therefore, givenINDIGO_2022-07-22_Z7II-A_0007
as name of one picture, in the Test Photos collection you can find five different resources about this picture:
INDIGO_2022-07-22_Z7II-A_0007.jpg
INDIGO_2022-07-22_Z7II-A_0007.nef
INDIGO_2022-07-22_Z7II-A_0007.xmp
INDIGO_2022-07-22_Z7II-A_0007_metadata.json
INDIGO_2022-07-22_Z7II-A_0007_not_unique_values.json
Files
Updated by Alexander Watzinger about 2 years ago
- Assignee changed from Bernhard Koschiček-Krombholz to Massimiliano Carloni
After a test ARCHE collection is provided please reassign to Bernhard.
Updated by Bernhard Koschiček-Krombholz about 2 years ago
- Example for thumbnail: https://arche-thumbnails.acdh.oeaw.ac.at/?id=https%3A%2F%2Farche.acdh.oeaw.ac.at%2Fapi%2F4812&width=500&height=500
- https://app.swaggerhub.com/apis/zozlak/arche/3.5#/default/get__resourceId__metadata
- GET {resourceId}/metadata/ and format application/ld+json will output and json ld
Updated by Alexander Watzinger about 2 years ago
- Target version set to Wishlist
Because there was no target version set, I put this on the wishlist for now.
Updated by Alexander Watzinger almost 2 years ago
Just a few notes for our OpenAtlas/ARCHE meeting tomorrow, implementation would be:
- Fetching ARCHE ids, compare with already imported and only get the new ones
- Fetch the thumbnail and a metadata file
- Create new entries with metadata and thumbnail in OpenAtlas
- Pushing a new, different metadata file back to ARCHE after the INDIGO team finished working on it OpenAtlas
Updated by Alexander Watzinger almost 2 years ago
- Status changed from Assigned to In Progress
- Assignee changed from Massimiliano Carloni to Bernhard Koschiček-Krombholz
- Target version changed from Wishlist to 7.9.0
After looking at the test data in ARCHE, thanks a lot Massimiliano, we will continue building our API.
New ideas:- also check for maybe not existing identifiers in ARCHE
- some values we may should take from ARCHE metadata directly
Updated by Massimiliano Carloni almost 2 years ago
- File GUI_InverseData.png GUI_InverseData.png added
- File GUI_ExpertView.png GUI_ExpertView.png added
- File GUI_isMetadataFor.png GUI_isMetadataFor.png added
- Description updated (diff)
Updated by Alexander Watzinger almost 2 years ago
- Description updated (diff)
Fixed size of way too large images in description
Updated by Massimiliano Carloni almost 2 years ago
Alexander Watzinger wrote:
Fixed size of way too large images in description
Thanks Alex! I wanted to ask how to do this..
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
Just for development reasons, add this to production.py
ARCHE_ID = 1390136
ARCHE_COLLECTION_IDS = [1390141]
ARCHE_BASE_URL = 'https://arche-curation.acdh-dev.oeaw.ac.at/'
Updated by Alexander Watzinger almost 2 years ago
I just tested it and it worked great. Thank you Bernhard and Massimiliano, this was a very important first step.
Next step would be to add a thumbnail image as a file entity in OpenAtlas (if it isn't already in the system) and add information provided.
I'm kind of busy at the moment with other stuff so feel free Bernhard to continue if you like, or just assign it to me and I will take care of it later.
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
Thank you, Alex, I will continue and assign the ticket to you, when I'm away.
One question is, how we keep track, if the images exist or not. Should we use origin_id from import to add the original ARCHE image ID?
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
In the feature_arche_import branch a rudimentary import is available.
If you go to Admin→Data→ARCHE, there is a crude overview from the ARCHE IDs. If you click on Fetch data, a list of all data in the collection is presented. At the tab import data, you can import the data. If the filenames are existing, it will through a warning and a list of which entities are already in the database. We definitively have to refine the duplicate method and think about how/where to store the original ARCHE IDs.
When data is imported, a new file entity will be created with the complete filename (e.g. INDIGO_2022-07-22_Z7II-A_1021.jpg) and the related file (max width=1200) will be stored in upload and renamed, so it will be displayed in OpenAtlas. Additionally, an artefact entity is created, named without the file extension (e.g. INDIGO_2022-07-22_Z7II-A_1021), with the geographic locations, creation date and also linked to the file entity.
One issue I noticed is the image rotation. Some images are upside down and some on the side.
Updated by Alexander Watzinger almost 2 years ago
Thanks Bernhard for the good progress. Next step would be to connect the artifact entities to an external reference system. One way to do it would be to look for one that it is called ARCHE and give a notice if none called like this exists before allowing to use these functions. Once this is done we can use the ARCHE identifiers to check for comparing what is already imported to avoid duplicates.
Once this is done, we can look what else information we can extract and map in OpenAtlas.
About the image rotation issues, in a similar project I'm using the code below which fixes rotation issues of photographs (exiftran package needed for that), not sure if this is helpful for these issues too but worth a try:
from subprocess import call
if f'.{ext}' in app.config['IMAGE_EXTENSIONS']:
call(f'exiftran -ai {path}', shell=True) # Fix rotation
But depending on the specifics, maybe this should be done already in ARCHE before? But it can't hurt to have these correction functions in OpenAtlas as well.
Anyway, thanks again for the good progress and as said before, you can assign this issue to me before leaving us for your true calling :)
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Assignee changed from Bernhard Koschiček-Krombholz to Alexander Watzinger
I heed Christmas call and assigning this issue to Alex.
Artifacts are connected to an external reference system, if there is a system called 'ARCHE' with a default precision. Next step is to check for duplications.
Updated by Alexander Watzinger almost 2 years ago
- Target version changed from 7.9.0 to 7.10.0
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Assignee changed from Alexander Watzinger to Bernhard Koschiček-Krombholz
Updated by Alexander Watzinger almost 2 years ago
- Description updated (diff)
Status report:
Already implemented- Connect to ARCHE API
- Check for new files
- Fetch metadata and a thumbnail
- Insert an artifact with following details:
- Name
- Geolocation
- Creation date (with time)
- Reference system link back to ARCHE
- Thumbnail (file entity linked to the artifact)
- Creator
Updated by Alexander Watzinger almost 2 years ago
- as persons (E21) - if no person with the provided name exists it will be created
- with a type (E55), so we need a type for persons to differentiate between graffito artists and photographers, maybe something like "project reference"
- as creator at a production event (E12) of the image (with the photo timestamp as begin, same location as graffito), we should also add a type for the production event, e.g. "photograph"
As soon as this is implemented we deploy it on the INDIGO instance so it can be demonstrated/tested.
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
Added license type and creator type to files
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
Alexander Watzinger wrote:
- as persons (E21) - if no person with the provided name exists it will be created
What if two or more persons with the same name exists? Should we just take the first occurrence?
Updated by Alexander Watzinger almost 2 years ago
I was thinking of that too, in that case I would say it's ok to just take the first one.
Persons can only exist if they either were created automatically (because not existing when importing data) or created manually. I think it unlikely that a person name duplicate is created manually but the responsibility not to create duplicates lies with the data entry persons.
Even if this happens, there is not much we can do about it within the automated script. In case we later notice that this happened we can try to merge these persons (manually) and updating links from one to another is relatively trivial.
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
Ok, it is done. Let's have a look over it together.
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- File ARCHE_import_OpenAtlas.jpg added
- Description updated (diff)
Added first draft of CIDOC mapping.
Currently, there is a firewall issue to ARCHE. As soon as this is solved, we will provide an online test version for the import.
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Related to Feature #1934: New creation event class added
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Description updated (diff)
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- File ARCHE_import_OpenAtlas.jpg added
- File deleted (
ARCHE_import_OpenAtlas.jpg)
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Status changed from In Progress to Resolved
Changes are live on the INDIGO OpenAtlas instance and test data was imported.
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- File deleted (
ARCHE_import_OpenAtlas.jpg)
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- File ARCHE_import_OpenAtlas.jpg added
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- File deleted (
ARCHE_import_OpenAtlas.jpg)
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Related to Feature #1943: Auto rotate image added
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Related to Feature #1944: Manual: ARCHE import added
Updated by Bernhard Koschiček-Krombholz almost 2 years ago
- Description updated (diff)
- Status changed from Resolved to Closed