Arq stores backup data in S3 in a format similar to that of the open-source version control system 'git'. When you first run Arq and give it your public and private S3 keys, it creates an S3 "bucket" with the name ".com.haystacksoftware.arq". It also creates a "universally unique identifier" (UUID) for your computer. Folder Configuration Files -------------------------- Each time you add a folder for backup, Arq creates a UUID for it and stores 2 objects in your S3 "bucket": S3 object: //buckets/ This file contains a "plist"-format XML document that looks like this: AWSRegionName us-east-1 BucketUUID 408E376B-ECF7-4688-902A-1E7671BC5B9A BucketName company ComputerUUID 600150F6-70BB-47C6-A538-6F3A2258D524 LocalPath /Users/stefan/src/company LocalMountPoint / StorageType 1 VaultName arq_408E376B-ECF7-4688-902A-1E7671BC5B9A VaultCreatedTime 12345678.0 Excludes Enabled MatchAny Conditions Only Glacier-backed folders have "VaultName" and "VaultCreatedTime" keys. NOTE: The folder's UUID and name are called "BucketUUID" and "BucketName" in the plist; this is a holdover from previous iterations of Arq and is not to be confused with S3's "bucket" concept. As of Arq 3, this plist document is encrypted. The file starts with the plain-text string "encrypted", followed by the above plist encrypted with the user's encryption password. Commits, Trees and Blobs ------------------------ When Arq backs up a folder, it creates 3 types of objects in S3, "commits", "trees" and "blobs". Each backup that you see in Arq corresponds to a "commit" object in the backup data. Its name is the SHA1 hash of its contents. The commit contains the SHA1 of a "tree" object in the backup data. This tree corresponds to the folder you're backing up. Each tree contains "nodes"; each node has either the SHA1 of another tree, or the SHA1 of a file (or multiple SHA1s, see "Tree format" below). S3 Storage Basics ----------------- At the most basic level, Arq stores name/blob pairs in S3. The name of each blob is its SHA1 hash. Because of this naming, each unique blob is stored only once. If 2 files on your system have the same contents, only 1 copy of the contents will be stored in S3. If the contents of a file change, the SHA1 hash is different and the file is stored as a different blob. Files are blobs, and commits and trees are blobs as well. Encryption ---------- Each blob is encrypted using AES-256 with the encryption key you chose when you first ran Arq. The SHA1 used for the blob's name is the SHA1 of the *encrypted* blob. Arq version 1 used your encryption password as the encryption key. Arq version 2 derives a key from your encryption password using the PKCS5_PBKDF2_HMAC_SHA1() function. See the CryptoKey.m source code for more details: https://github.com/sreitshamer/arq_restore/blob/master/io/CryptoKey.m Compression ----------- Starting with version 2, Arq now compresses all data chunks, extended attributes, and ACLs. Compression is gzip format. In the Tree and Node formats below, the fields "xattrs_are_compressed", "acl_is_compressed", and "data_are_compressed" indicate whether the data referred to by the SHA1 needs to be unzipped (after it's decrypted). Commit Format ------------- A commit contains the following bytes (see "Data Format Documentation" below for explanation of [String], [UInt32], [Date], etc): 43 6f 6d 6d 69 74 56 30 30 36 "CommitV009" [String:""] [String:""] [UInt64:num_parent_commits] ( [BlobKey:parent_commit_blob_key] ) /* repeat num_parent_commits times */ [BlobKey:tree_blob_key] [String:"file://"] [String:""] /* only present for Commit version 7 or *older* (was never used) */ [Bool:is_merge_common_ancestor_encryption_key_stretched] /* only present for Commit version 4 to 7 */ [Date:creation_date] [UInt64:num_failed_files] /* only present for Commit version 3 or later */ ( [String:""] /* only present for Commit version 3 or later */ [String:""] /* only present for Commit version 3 or later */ ) /* repeat num_failed_files times */ [Bool:has_missing_nodes] /* only present for Commit version 8 or later */ [Bool:is_complete] /* only present for Commit version 9 or later */ [Data:config_plist_xml] /* a copy of the XML file as described above */ Tree Format ----------- A tree contains the following bytes: 54 72 65 65 56 30 31 35 "Treev018" [Bool:xattrs_are_compressed] /* only present for Tree version 12 or later */ [Bool:acl_is_compressed] /* only present for Tree version 12 or later */ [BlobKey:xattrs_blob_key] [UInt64:xattrs_size] [BlobKey:acl_blob_key] [Int32:uid] [Int32:gid] [Int32:mode] [Int64:mtime_sec] [Int64:mtime_nsec] [Int64:flags] [Int32:finderFlags] [Int32:extendedFinderFlags] [Int32:st_dev] [Int32:st_ino] [UInt32:st_nlink] [Int32:st_rdev] [Int64:ctime_sec] [Int64:ctime_nsec] [Int64:st_blocks] [UInt32:st_blksize] [UInt64:aggregate_size_on_disk] /* only present for Tree version 11 to 16 (never used) */ [Int64:create_time_sec] /* only present for Tree version 15 or later */ [Int64:create_time_nsec] /* only present for Tree version 15 or later */ [UInt32:missing_node_count] /* only present for Tree version 18 or later */ ( [String:""] /* only present for Tree version 18 or later */ ) [UInt32:node_count] ( [String:""] [Node] ) /* repeat times */ Each [Node] contains the following bytes: [Bool:isTree] [Bool:data_are_compressed] /* only present for Tree version 12 or later */ [Bool:xattrs_are_compressed] /* only present for Tree version 12 or later */ [Bool:acl_is_compressed] /* only present for Tree version 12 or later */ [Int32:data_blob_keys_count] ( [BlobKey:data_blob_key] ) /* repeat times */ [UIn64:data_size] [String:""] /* only present for Tree version 18 or earlier (never used) */ [Bool:is_thumbnail_encryption_key_stretched] /* only present for Tree version 14 to 18 */ [String:""] /* only present for Tree version 18 or earlier (never used) */ [Bool:is_preview_encryption_key_stretched] /* only present for Tree version 14 to 18 */ [BlobKey:xattrs_blob_key] [UInt64:xattrs_size] [BlobKey:acl_blob_key] [Int32:uid] [Int32:gid] [Int32:mode] [Int64:mtime_sec] [Int64:mtime_nsec] [Int64:flags] [Int32:finderFlags] [Int32:extendedFinderFlags] [String:""] [String:""] [Bool:is_file_extension_hidden] [Int32:st_dev] [Int32:st_ino] [UInt32:st_nlink] [Int32:st_rdev] [Int64:ctime_sec] [Int64:ctime_nsec] [Int64:create_time_sec] [Int64:create_time_nsec] [Int64:st_blocks] [UInt32:st_blksize] Notes: - A Node can have multiple data SHA1s if the file is very large. Arq breaks up large files into multiple blobs using a rolling checksum algorithm. This way Arq only backs up the parts of a file that have changed. - "" is currently unused. - "" is currently unused. - "" is the SHA1 of a blob containing the sorted extended attributes of the file (see "XAttrSet Format" below). Note this means extended-attribute sets are "de-duplicated". - "" is the SHA1 of the blob containing the result of acl_to_text() on the file's ACL. Note this means the ACLs are "de-duplicated". - "create_time_sec" and "create_time_nsec" contain the value of the ATTR_CMN_CRTIME attribute of the file XAttrSet Format --------------- Each XAttrSet blob contains the following bytes: 58 41 74 74 72 53 65 74 56 30 30 32 "XAttrSetV002" [UInt64:xattr_count] ( [String:""] [Data:xattr_data] ) More on S3 Storage ------------------ In general, each blob is stored as an S3 object with a path of the form: //objects/ But for small files, the overhead associated with putting and getting the objects to/from S3 makes backing them up very inefficient. So, small files (files under 64KB in length) are stored in "packs", which are explained below. Packs ----- Each folder configured for backup maintains 2 "packsets", one for trees and commits, and one for all other small files. The packsets are named: -trees -blobs Small files are separated into 2 packsets because the trees and commits are cached locally (so that Arq gives reasonable performance for browsing backups); all other small blobs don't need to be cached. A packset is a set of "packs". When Arq is backing up a folder, it combines small files into a single larger packfile; when the packfile reaches 10MB, it is stored on S3. Also, when Arq finishes backing up a folder it stores its unsaved packfiles no matter their sizes. When storing a pack, Arq stores the packfile as: //packsets/-(blobs|trees)/.pack It also stores an index of the SHA1s contained in the pack as: //packsets/-(blobs|trees)/.index Pack Index Format ----------------- magic number ff 74 4f 63 version (2) 00 00 00 02 network-byte-order fanout[0] 00 00 00 02 (4-byte count of SHA1s starting with 0x00) ... fanout[255] 00 00 f0 f2 (4-byte count of total objects == count of SHA1s starting with 0xff or smaller) object[0] 00 00 00 00 (8-byte network-byte-order offset) 00 00 00 00 00 00 00 00 (8-byte network-byte-order data length) 00 00 00 00 00 xx xx xx (sha1 starting with 00) xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx 00 00 00 00 (4 bytes for alignment) object[1] 00 00 00 00 (8-byte network-byte-order offset) 00 00 00 00 00 00 00 00 (8-byte network-byte-order data length) 00 00 00 00 00 xx xx xx (sha1 starting with 00) xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx 00 00 00 00 (4 bytes for alignment) object[2] 00 00 00 00 (8-byte network-byte-order offset) 00 00 00 00 00 00 00 00 (8-byte network-byte-order data length) 00 00 00 00 00 xx xx xx (sha1 starting with 00) xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx 00 00 00 00 (4 bytes for alignment) ... object[f0f1] 00 00 00 00 (8-byte network-byte-order offset) 00 00 00 00 00 00 00 00 (8-byte network-byte-order data length) 00 00 00 00 ff xx xx xx (sha1 starting with ff) xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx 00 00 00 00 (4 bytes for alignment) Glacier archiveId not null 01 (1 byte) /* Glacier only */ Glacier archiveId strlen 00 00 00 00 (network-byte-order 8 bytes) /* Glacier only */ 00 00 00 08 /* Glacier only */ Glacier archiveId string xx xx xx xx (n bytes) /* Glacier only */ xx xx xx xx /* Glacier only */ Glacier pack size 00 00 00 00 (8-byte network-byte-order data length) /* Glacier only */ 00 00 00 00 /* Glacier only */ 20-byte SHA1 of all of the xx xx xx xx above xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Pack File Format ---------------- signature 50 41 43 4b ("PACK") version (2) 00 00 00 02 (network-byte-order 4 bytes) object count 00 00 f0 f2 (network-byte-order 4 bytes) object[0] mimetype not null 01 (1 byte) object[0] mimetype strlen 00 00 00 00 (network-byte-order 8 bytes) 00 00 00 08 object[0] mimetype string xx xx xx xx (n bytes) xx xx xx xx object[0] name not null 01 (1 byte) object[0] name strlen 00 00 00 00 (network-byte-order 8 bytes) 00 00 00 08 object[0] name string xx xx xx xx (n bytes) xx xx xx xx object[0] data length 00 00 00 00 (network-byte-order 8 bytes) 00 00 00 06 object[0] data xx xx xx xx (n bytes) xx xx ... object[f0f2] mimetype not null 01 (1 byte) object[f0f2] mimetype len 00 00 00 00 (network-byte-order 8 bytes) 00 00 00 08 object[f0f2] mimetype str xx xx xx xx (n bytes) xx xx xx xx object[f0f2] name not null 01 (1 byte) object[f0f2] name strlen 00 00 00 00 (network-byte-order 8 bytes) 00 00 00 08 object[f0f2] name string xx xx xx xx (n bytes) xx xx xx xx object[f0f2] data length 00 00 00 00 (network-byte-order 8 bytes) 00 00 00 04 object[f0f2] data 12 34 12 34 20-byte SHA1 of all of the xx xx xx xx above xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx Data Format Documentation Conventions ------------------------------------- We used a few shortcuts in some of the data format explanations above: [BlobKey:value] A [BlobKey] is stored as: [String:sha1] [Bool:is_encryption_key_stretched] /* only present for Tree version 14 or later, Commit version 4 or later */ [UInt32:storage_type] /* 1==S3, 2==Glacier; only present for Tree version 17 or later */ [String:archive_id] /* only present for Tree version 17 or later */ [UInt64:archive_size] /* only present for Tree version 17 or later */ [Date:archive_upload_date] /* only present for Tree version 17 or later */ [Bool:value] A [Bool] is stored as 1 byte, either 00 or 01. [String:""] A [String] is stored as: 00 or 01 isNull flag if not null: 00 00 00 00 8-byte network-byte-order length 00 00 00 0c xx xx xx xx UTF-8 string data xx xx xx xx xx xx xx xx [UInt32:] A [UInt32] is stored as: 00 00 00 00 network-byte-order uint32_t [Int32:] An [Int32] is stored as: 00 00 00 00 network-byte-order int32_t [UInt64:] A [UInt64] is stored as: 00 00 00 00 network-byte-order uint64_t 00 00 00 00 [Int64:] An [Int64] is stored as: 00 00 00 00 network-byte-order int64_t 00 00 00 00 [Date:] A [Date] is stored as: 00 or 01 isNull flag if not null: 00 00 01 26 8-byte network-byte-order milliseconds a8 79 09 48 since the first instant of 1 January 1970, GMT. [Data:] A [Data] is stored as: [UInt64:] data length xx xx xx xx bytes xx xx xx xx xx xx xx xx ...