Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup File format: JSON compliant + Only includes set attributes + One line per record #75

Closed
porg opened this issue Nov 10, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@porg
Copy link

porg commented Nov 10, 2022

.osxmeta JSON v0.99.37

.osxmeta JSON v1.0.0 — This update brought 1 improvement but 2 disadvantages

  • ✅ Compliant JSON — Properly read/rendered in all apps
  • ❌ Each record contains ALL 188 attributes, mostly only a handful set, the big majority are unset ("null")
  • ❌ All attributes broken on single lines, tag attribute saved as array, each item on own row.

Which together cause these downsides:

  • Increases file size tremendously (bad in LAN use cases)
  • Terrible human readability:
    • Hundreds of rows with just "null" attributes
    • Cannot compare items to each other line by line as each items takes ca 180-250 lines

Proposed next JSON format

  • ✅ Compliant JSON while
  • ✅ Only contains attributes whose value is set
  • ✅ One record takes only one row, also arrays fit in there without any line breaks
@RhetTbull
Copy link
Owner

JSON is a format for interoperability with computers, not humans. There are many tools such as jq (brew install jq) and json_pp for working for json files. I'm using the json library from the standard python library to output the json using standard JSON indentation rules. The backup file is currently output with an indentation of 2 spaces to make it at least human inspectable (but the goal is machine readability, not human). There are tools such as jd (brew install jd) for doing structural diff of JSON already. I do not plan to change the format. Similar to the discussion on #73, I believe that command line tools should make use of other specialized command line tools rather than implement all functionality in a single tool

The suggestion to backup only attributes that are non-null is worth considering. Some file formats such as TOML do this by default. I'll take a look at adding a filter for null values before outputting the JSON but I need to first ensure there are no 2nd order effects on the restore process from doing so.

@RhetTbull RhetTbull added the enhancement New feature or request label Nov 10, 2022
RhetTbull added a commit that referenced this issue Nov 11, 2022
@RhetTbull
Copy link
Owner

Version v1.1.0 strips null values from the backup JSON file. This fixes the issue of lots of null fields and improves readability somewhat. As stated above, I don't intend to fix the other formatting issues as I think there are other ways to address this (using 3rd party tools) and I'm content to use the python standard JSON library.

@porg
Copy link
Author

porg commented Nov 11, 2022

  • At least the size optimization was achieved then.
  • On the issue of formatting as human readible as possible if it comes at almost no extra development/maintenance/performance cost, I don't agree, but I respect.
    • Ofc there are tools for many things (thanks for your hints on jd and jq) but nothing beats having native good readability.
    • Quick check of backup file for every average user with no reliance on extra software is value-able!
      • Ascertains a user "Did the backup work? Were all files included by my shell pattern?"
      • Especially for users not dealing with furthermore extra specialized software.
    • And especially if the development/maintenance/performance is negligible, because I suspect most serializing frameworks have formatting/beautifying options, so it's just a matter of setting a few flags.
      • But if the python standard JSON library doesn't and it's not worth to go the extra mile, then I very much respect that ofc, you are the expert!

@RhetTbull
Copy link
Owner

On the issue of formatting as human readible as possible if it comes at almost no extra development/maintenance/performance cost, I don't agree, but I respect.

From your original issue, I think what you want is every record on a single line but this is not human readable. The only alternative is to use JSON indentation, which osxmetadata now does, but that also means that arrays are broken across lines -- this is standard formatting for JSON. So I don't really understand the alternative you want (other than writing a custom JSON parser that doesn't follow JSON conventions).

for every average user with no reliance on extra software

Average user shouldn't be inspecting the backup file. It's an internal thing used by osxmetadata and the format is subject to change. I considered using sqlite as the format which would have been even more difficult for average user to inspect. But my point is that the backup file is designed for the use of osxmetadata not the user. The fact that it's JSON and in #57 I made it well-formed so it could be inspected easily, is a bonus, not a feature.

Ascertains a user "Did the backup work? Were all files included by my shell pattern?"

You can use osxmetadata with the --verbose flag which prints out the name of each file being processed.

You could do one of the following:

grep _filepath .osxmetadata.json

of slightly more pretty:

jq '.[] | ._filepath' .osxmetadata.json

Both of these print the names (with full path) of the files in the backup file.

The .[] is needed in jq to tell jq to operate on the values of the array that are in the backup file (the backup file is actually one JSON object (array) with a bunch of values representing each file). I changed this in #57 to make the JSON well formed (though the previous "malformed" format was much easier to work with in osxmetadata).

If you wanted to inspect the filename and a specific key, for example, kMDItemFinderComment which is the Finder comment, you could do this:

jq '.[] | ._filepath, .kMDItemFinderComment' .osxmetadata.json | paste -d, - -

This command extracts the _filepath and kMDItemFinderComment keys and then prints them as comma separated values (CSV) that could then easily be imported into some other tool. The - - in paste tells it to take two values at a time then join them with , (specified by -d = delimiter).

most serializing frameworks have formatting/beautifying options,

The python JSON library allows you to set the indentation, or no indentation. I use indentation of 2 for osxmetadata. The alternative would be using no indentation and that results in one file record per line (this is what earlier versions of osxmetadata did) but that is completely unreadable without using a 3rd party tool. For my purposes, I use the bat command (a replacement for cat) to view JSON files and many other things because it uses colorized output and automatic paging. This makes it easy to quickly inspect the backup file which would not be possible if no indentation was used.

@porg
Copy link
Author

porg commented Nov 12, 2022

Ascertain user that backup goes / went well

You convinces me that these are indeed sufficient:

  • --verbose during backup
  • or grep _filepath .osxmetadata.json for inspecting a backup file in retrospect

Readability

  • One file record per line is what I meant and deem as very well human readable
    • If the filepaths and the tags are somehow of similar length
    • Then one can read that almost like a spreadsheet

Excerpt .osxmetadata v0.99.37

{"_version": "0.99.37", "_filepath": "/Volumes/Shared/Videos/Creative-Ads/Apple iPhone 4 (2010).mp4", "_filename": "Apple iPhone 4 (2010)", "com.apple.FinderInfo": {"color": 2, "stationarypad": false}, "com.apple.metadata:_kMDItemUserTags": [["Technology", 0], ["Entertainment", 0], ["\u2022Done", 2]], "com.apple.metadata:kMDItemFinderComment": ""}
{"_version": "0.99.37", "_filepath": "/Volumes/Shared/Videos/Creative-Ads/Benneton (1998).avi", "_filename": "Benneton (1998).avi", "com.apple.FinderInfo": {"color": 4, "stationarypad": false}, "com.apple.metadata:_kMDItemUserTags": [["Clothing", 0], ["Anti-Racism", 0], ["Peace", 0], ["Cooperation", 0], ["\u2022Done but unimportant", 4]], "com.apple.metadata:kMDItemFinderComment": ""}
{"_version": "0.99.37", "_filepath": "/Volumes/Shared/Videos/Creative-Ads/Hugo Boss (1995).mp4", "_filename": "Hugo Boss (1995).mp4", "com.apple.FinderInfo": {"color": 2, "stationarypad": false}, "com.apple.metadata:_kMDItemUserTags": [["Clothing", 0], ["Youth", 0], ["\u2022Done", 2]], "com.apple.metadata:kMDItemFinderComment": ""}
  • For human readable JSON this was already as close as it gets
  • If only possible in a way which would allow this one record per row while staying standard compatible (I dunno the syntax too well to reshape that v0.99.37 sample accordingly)

As I see it TSV would be quite efficient in terms of storage

  • The keys which come with values most probably filled can be leftmost, and those less likely filled more on the right.
  • That way you have the "essence" on the left, and to the right the tendency to emptiness (only one TAB per unfilled attribute, or even just a linebreak right away if empty until the rightmost column.
  • QuickLookCSV.qlgenerator shows TSVs very nicely like a spreadsheet, with sticky header, then you can totally inspect it like a spreadsheet.
  • But yeah, the application then is not soo much a backup/restore, but more an offline catalog (keeping the osxmeta file of a remote directory or offline disk locally for quick lookup). But I just wanted to bring TSV into the discussion, maybe worth considering.

@RhetTbull
Copy link
Owner

RhetTbull commented Nov 12, 2022

I personally don't find the excerpt from v0.99.37 very readable because I don't like horizontal scrolling. However, the point is moot. That format is not valid JSON (each line is valid JSON, but the file is not...and there was an issue (#57) to request the file be valid JSON hence the current format. Valid JSON does not support one record per line in the form that osxmetadata needs the data. It would be one line with all the records or the current format.

TSV might be more human readable but the backup format is meant to read by osxmetadata and TSV would require more work. As it is, I can read the backup file and get it into the format osxmetadata needs in 3 lines of code:

with open(backup_file, mode="r") as fp:
    backup_records = json.load(fp)
backup_data = {data["_filename"]: data for data in backup_records}

I think you're trying to use the backup file for something it's not intended for. A better solution would be to add an option to osxmetadata -list to output the data in json, tsv, or csv and then you could create whatever report you wanted in whatever format. I'll open a separate issue for this feature.

@porg
Copy link
Author

porg commented Nov 12, 2022

osxmetadata -list output to json, tsv, or csv. Yes! Fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants