@FIXME{need an intro here}
tar Archives More PortableCreating a tar archive on a particular system that is meant to
be useful later on many other machines and with other versions of
tar is more challenging than you might think. tar
archive formats have been evolving since the first versions of Unix. Many such
formats are around, and are not always comptible with each other. This section
discusses a few problems, and gives some advice about making tar
archives more portable.
One golden rule is simplicity. For example, limit your tar
archives to contain only regular files and directories, avoiding other kind of
special files. Do not attempt to save sparse files or contiguous files as such.
Let's discuss a few more problems, in turn.
Use straight file and directory names, made up of printable ASCII characters, avoiding colons, slashes, backslashes, spaces, and other dangerous characters. Avoid deep directory nesting. Accounting for oldish System V machines, limit your file and directory names to 14 characters or less.
If you intend to have your tar archives to be read under MSDOS,
you should not rely on case distinction for file names, and you might use the
GNU doschk program for helping you further diagnosing illegal MSDOS
names, which are even more limited than System V's.
Normally, when tar archives a symbolic link, it writes a block
to the archive naming the target of the link. In that way, the tar
archive is a faithful record of the filesystem contents.
--dereference (-h) is used with --create
(-c), and causes tar to archive the files symbolic links
point to, instead of the links themselves. When this option is used, when
tar encounters a symbolic link, it will archive the linked-to file,
instead of simply recording the presence of a symbolic link.
The name under which the file is stored in the file system is not recorded in
the archive. To record both the symbolic link name and the file name in the
system, archive the file under both names. If all links were recorded
automatically by tar, an extracted file might be linked to a file
name that no longer exists in the file system.
If a linked-to file is encountered again by tar while creating
the same archive, an entire second copy of it will be stored. (This
might be considered a bug.)
So, for portable archives, do not archive symbolic links as such, and use --dereference (-h): many systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links.
Certain old versions of tar cannot handle additional information
recorded by newer tar programs. To create an archive in V7 format
(not ANSI), which can be read by these old versions, specify the
--old-archive (-o) option in conjunction with the
--create (-c). tar also accepts
`--portability' for this option. When you specify it,
tar leaves out information about directories, pipes, fifos,
contiguous files, and device files, and specifies file ownership by group and
user IDs instead of group and user names.
When updating an archive, do not use --old-archive (-o) unless the archive was created with using this option.
In most cases, a new format archive can be read by an old
tar program without serious trouble, so this option should seldom
be needed. On the other hand, most modern tars are able to read old
format archives, so it might be safer for you to always use
--old-archive (-o) for your distributions.
tar and POSIX tarGNU tar was based on an early draft of the POSIX 1003.1
ustar standard. GNU extensions to tar, such as the
support for file names longer than 100 characters, use portions of the
tar header record which were specified in that POSIX draft as
unused. Subsequent changes in POSIX have allocated the same parts of the header
record for other purposes. As a result, GNU tar is incompatible
with the current POSIX spec, and with tar programs that follow it.
We plan to reimplement these GNU extensions in a new way which is upward
compatible with the latest POSIX tar format, but we don't know when
this will be done.
In the mean time, there is simply no telling what might happen if you read a
GNU tar archive, which uses the GNU extensions, using some other
tar program. So if you want to read the archive with another
tar program, be sure to write it using the
`--old-archive' option (`-o').
@FIXME{is there a way to tell which flavor of tar was used to write a particular archive before you try to read it?}
Traditionally, old tars have a limit of 100 characters. GNU
tar attempted two different approaches to overcome this limit,
using and extending a format specified by a draft of some P1003.1. The first way
was not that successful, and involved `@MaNgLeD@' file names, or such;
while a second approach used `././@LongLink' and other tricks, yielding
better success. In theory, GNU tar should be able to handle file
names of practically unlimited length. So, if GNU tar fails to dump
and retrieve files having more than 100 characters, then there is a bug in GNU
tar, indeed.
But, being strictly POSIX, the limit was still 100 characters. For various
other purposes, GNU tar used areas left unassigned in the POSIX
draft. POSIX later revised P1003.1 ustar format by assigning
previously unused header fields, in such a way that the upper limit for file
name length was raised to 256 characters. However, the actual POSIX limit
oscillates between 100 and 256, depending on the precise location of slashes in
full file name (this is rather ugly). Since GNU tar use the same
fields for quite other purposes, it became incompatible with the latest POSIX
standards.
For longer or non-fitting file names, we plan to use yet another set of GNU
extensions, but this time, complying with the provisions POSIX offers for
extending the format, rather than conflicting with it. Whenever an archive uses
old GNU tar extension format or POSIX extensions, would it be for
very long file names or other specialities, this archive becomes non-portable to
other tar implementations. In fact, anything can happen. The most
forgiving tars will merely unpack the file using a wrong name, and
maybe create another file named something like `@LongName', with the
true file name in it. tars not protecting themselves may segment
violate!
Compatibility concerns make all this thing more difficult, as we will have to
support all these things together, for a while. GNU tar
should be able to produce and read true POSIX format files, while being able to
detect old GNU tar formats, besides old V7 format, and process them
conveniently. It would take years before this whole area stabilizes...
There are plans to raise this 100 limit to 256, and yet produce POSIX
conformant archives. Past 256, I do not know yet if GNU tar will go
non-POSIX again, or merely refuse to archive the file.
There are plans so GNU tar support more fully the latest POSIX
format, while being able to read old V7 format, GNU (semi-POSIX plus extension),
as well as full POSIX. One may ask if there is part of the POSIX format that we
still cannot support. This simple question has a complex answer. Maybe that, on
intimate look, some strong limitations will pop up, but until now, nothing
sounds too difficult (but see below). I only have these few pages of POSIX
telling about `Extended tar Format' (P1003.1-1990 -- section 10.1.1), and there
are references to other parts of the standard I do not have, which should
normally enforce limitations on stored file names (I suspect things like fixing
what / and NUL means). There are also some
points which the standard does not make clear, Existing practice will then drive
what I should do.
POSIX mandates that, when a file name cannot fit within 100 to 256 characters
(the variance comes from the fact a / is ideally needed as the 156'th
character), or a link name cannot fit within 100 characters, a warning should be
issued and the file not be stored. Unless some --posix
option is given (or POSIXLY_CORRECT is set), I suspect that GNU
tar should disobey this specification, and automatically switch to
using GNU extensions to overcome file name or link name length limitations.
There is a problem, however, which I did not intimately studied yet. Given a
truly POSIX archive with names having more than 100 characters, I guess that GNU
tar up to 1.11.8 will process it as if it were an old V7 archive,
and be fooled by some fields which are coded differently. So, the question is to
decide if the next generation of GNU tar should produce POSIX
format by default, whenever possible, producing archives older versions of GNU
tar might not be able to read correctly. I fear that we will have
to suffer such a choice one of these days, if we want GNU tar to go
closer to POSIX. We can rush it. Another possibility is to produce the current
GNU tar format by default for a few years, but have GNU
tar versions from some 1.POSIX and up able to recognize
all three formats, and let older GNU tar fade out slowly. Then, we
could switch to producing POSIX format by default, with not much harm to those
still having (very old at that time) GNU tar versions prior to
1.POSIX.
POSIX format cannot represent very long names, volume headers, splitting of
files in multi-volumes, sparse files, and incremental dumps; these would be all
disallowed if --posix or POSIXLY_CORRECT. Otherwise, if
tar is given long names, or `-[VMSgG]', then it should
automatically go non-POSIX. I think this is easily granted without much
discussion.
Another point is that only mtime is stored in POSIX archives,
while GNU tar currently also store atime and
ctime. If we want GNU tar to go closer to POSIX, my
choice would be to drop atime and ctime support on
average. On the other hand, I perceive that full dumps or incremental dumps need
atime and ctime support, so for those special
applications, POSIX has to be avoided altogether.
A few users requested that --sparse (-S) be always
active by default, I think that before replying to them, we have to decide if we
want GNU tar to go closer to POSIX on average, while producing
files. My choice would be to go closer to POSIX in the long run. Besides
possible double reading, I do not see any point of not trying to save files as
sparse when creating archives which are neither POSIX nor old-V7, so the actual
--sparse (-S) would become selected by default when
producing such archives, whatever the reason is. So, --sparse
(-S) alone might be redefined to force GNU-format archives, and
recover its previous meaning from this fact.
GNU-format as it exists now can easily fool other POSIX tar, as
it uses fields which POSIX considers to be part of the file name prefix. I
wonder if it would not be a good idea, in the long run, to try changing
GNU-format so any added field (like ctime, atime, file
offset in subsequent volumes, or sparse file descriptions) be wholly and always
pushed into an extension block, instead of using space in the POSIX header
block. I could manage to do that portably between future GNU tars.
So other POSIX tars might be at least able to provide kind of
correct listings for the archives produced by GNU tar, if not able
to process them otherwise.
Using these projected extensions might induce older tars to
fail. We would use the same approach as for POSIX. I'll put out a
tar capable of reading POSIXier, yet extended archives, but will
not produce this format by default, in GNU mode. In a few years, when newer GNU
tars will have flooded out tar 1.11.X and previous, we
could switch to producing POSIXier extended archives, with no real harm to
users, as almost all existing GNU tars will be ready to read
POSIXier format. In fact, I'll do both changes at the same time, in a few years,
and just prepare tar for both changes, without effecting them, from
1.POSIX. (Both changes: 1--using POSIX convention for getting over
100 characters; 2--avoiding mangling POSIX headers for GNU extensions, using
only POSIX mandated extension techniques).
So, a future tar will have a --posix flag forcing the
usage of truly POSIX headers, and so, producing archives previous GNU
tar will not be able to read. So, once pretest will
announce that feature, it would be particularly useful that users test how
exchangeable will be archives between GNU tar with
--posix and other POSIX tar.
In a few years, when GNU tar will produce POSIX headers by
default, --posix will have a strong meaning and will disallow GNU
extensions. But in the meantime, for a long while, --posix in GNU tar
will not disallow GNU extensions like
--label=archive-label (-V
archive-label), --multi-volume (-M),
--sparse (-S), or very long file or link names. However,
--posix with GNU extensions will use POSIX headers with
reserved-for-users extensions to headers, and I will be curious to know how well
or bad POSIX tars will react to these.
GNU tar prior to 1.POSIX, and after
1.POSIX without --posix, generates and checks `ustar
', with two suffixed spaces. This is sufficient for older GNU
tar not to recognize POSIX archives, and consequently, wrongly
decide those archives are in old V7 format. It is a useful bug for me, because
GNU tar has other POSIX incompatibilities, and I need to segregate
GNU tar semi-POSIX archives from truly POSIX archives, for GNU
tar should be somewhat compatible with itself, while migrating
closer to latest POSIX standards. So, I'll be very careful about how and when I
will do the correction.
SunOS and HP-UX tar fail to accept archives created using GNU
tar and containing non-ASCII file names, that is, file names having
characters with the eight bit set, because they use signed checksums, while GNU
tar uses unsigned checksums while creating archives, as per POSIX
standards. On reading, GNU tar computes both checksums and accept
any. It is somewhat worrying that a lot of people may go around doing backup of
their files using faulty (or at least non-standard) software, not learning about
it until it's time to restore their missing files with an incompatible file
extractor, or vice versa.
GNU tar compute checksums both ways, and accept any on read, so
GNU tar can read Sun tapes even with their wrong checksums. GNU tar
produces the standard checksum, however, raising incompatibilities with Sun.
That is to say, GNU tar has not been modified to produce
incorrect archives to be read by buggy tar's. I've been told that
more recent Sun tar now read standard archives, so maybe Sun did a
similar patch, after all?
The story seems to be that when Sun first imported tar sources
on their system, they recompiled it without realizing that the checksums were
computed differently, because of a change in the default signing of
char's in their compiler. So they started computing checksums
wrongly. When they later realized their mistake, they merely decided to stay
compatible with it, and with themselves afterwards. Presumably, but I do not
really know, HP-UX has chosen that their tar archives to be
compatible with Sun's. The current standards do not favor Sun tar
format. In any case, it now falls on the shoulders of SunOS and HP-UX users to
get a tar able to read the good archives they receive.
gzip. @FIXME{ach; these two bits orig from "compare" (?). where to put?} Some format parameters must be taken into consideration when modifying an archive: @FIXME{???}. Compressed archives cannot be modified.
You can use `--gzip' and `--gunzip' on physical
devices (tape drives, etc.) and remote files as well as on normal files; data to
or from such devices or remote files is reblocked by another copy of the
tar program to enforce the specified (or default) record size. The
default compression parameters are used; if you need to override them, avoid the
--gzip (--gunzip, --ungzip, -z)
option and run gzip explicitly. (Or set the `GZIP'
environment variable.)
The --gzip (--gunzip, --ungzip, -z) option does not work with the --multi-volume (-M) option, or with the --update (-u), --append (-r), --concatenate (--catenate, -A), or --delete operations.
It is not exact to say that GNU tar is to work in concert with
gzip in a way similar to zip, say. Surely, it is
possible that tar and gzip be done with a single call,
like in:
$ tar cfz archive.tar.gz subdir
to save all of `subdir' into a gzip'ed archive.
Later you can do:
$ tar xfz archive.tar.gz
to explode and unpack.
The difference is that the whole archive is compressed. With
zip, archive members are archived individually. tar's
method yields better compression. On the other hand, one can view the contents
of a zip archive without having to decompress it. As for the
tar and gzip tandem, you need to decompress the
archive to see its contents. However, this may be done without needing disk
space, by using pipes internally:
$ tar tfz archive.tar.gz
About corrupted compressed archives: gzip'ed
files have no redundancy, for maximum compression. The adaptive nature of the
compression scheme means that the compression tables are implicitly spread all
over the archive. If you lose a few blocks, the dynamic construction of the
compression tables becomes unsychronized, and there is little chance that you
could recover later in the archive.
There are pending suggestions for having a per-volume or per-file compression
in GNU tar. This would allow for viewing the contents without
decompression, and for resynchronizing decompression at every volume or file, in
case of corrupted archives. Doing so, we might loose some compressibility. But
this would have make recovering easier. So, there are pros and cons. We'll see!
compress. Otherwise like
--gzip (--gunzip, --ungzip, -z).
--compress (--uncompress, -Z) stores an
archive in compressed format. This option is useful in saving time over networks
and space in pipes, and when storage space is at a premium.
--compress (--uncompress, -Z) causes
tar to compress when writing the archive, or to uncompress when
reading the archive.
To perform compression and uncompression on the archive, tar
runs the compress utility. tar uses the default
compression parameters; if you need to override them, avoid the
--compress (--uncompress, -Z) option and run
the compress utility explicitly. It is useful to be able to call
the compress utility from within tar because the
compress utility by itself cannot access remote tape drives.
The --compress (--uncompress, -Z) option
will not work in conjunction with the --multi-volume (-M)
option or the --append (-r), --update
(-u), --append (-r) and --delete
operations. See section The Five
Advanced tar Operations, for more information on these
operations.
If there is no compress utility available, tar will report an
error. Please note that the compress program may
be covered by a patent, and therefore we recommend you stop using it.
tar will compress (when
writing an archive), or uncompress (when reading an archive). Used in
conjunction with the --create (-c), --extract
(--get, -x), --list (-t) and
--compare (--diff, -d) operations. You can have archives be compressed by using the --gzip
(--gunzip, --ungzip, -z) option. This will
arrange for tar to use the gzip program to be used to
compress or uncompress the archive wren writing or reading it.
To use the older, obsolete, compress program, use the
--compress (--uncompress, -Z) option. The GNU
Project recommends you not use compress, because there is a patent
covering the algorithm it uses. You could be sued for patent infringment merely
by running compress.
I have one question, or maybe it's a suggestion if there isn't a way to do it
now. I would like to use --gzip (--gunzip,
--ungzip, -z), but I'd also like the output to be fed
through a program like GNU ecc (actually, right now that's
`exactly' what I'd like to use :-)), basically adding ECC
protection on top of compression. It seems as if this should be quite easy to
do, but I can't work out exactly how to go about it. Of course, I can pipe the
standard output of tar through ecc, but then I lose
(though I haven't started using it yet, I confess) the ability to have
tar use rmt for it's I/O (I think).
I think the most straightforward thing would be to let me specify a general set of filters outboard of compression (preferably ordered, so the order can be automatically reversed on input operations, and with the options they require specifiable), but beggars shouldn't be choosers and anything you decide on would be fine with me.
By the way, I like ecc but if (as the comments say) it can't
deal with loss of block sync, I'm tempted to throw some time at adding that
capability. Supposing I were to actually do such a thing and get it (apparantly)
working, do you accept contributed changes to utilities like that? (Leigh
Clayton `loc@soliton.com', May 1995).
Isn't that exactly the role of the --use-compress-prog=program option? I never tried it myself, but I suspect you may want to write a prog script or program able to filter stdin to stdout to way you want. It should recognize the `-d' option, for when extraction is needed rather than creation.
It has been reported that if one writes compressed data (through the --gzip (--gunzip, --ungzip, -z) or --compress (--uncompress, -Z) options) to a DLT and tries to use the DLT compression mode, the data will actually get bigger and one will end up with less space on the tape.
This option causes all files to be put in the archive to be tested for
sparseness, and handled specially if they are. The --sparse
(-S) option is useful when many dbm files, for example,
are being backed up. Using this option dramatically decreases the amount of
space needed to store such a file.
In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive.
Files in the filesystem occasionally have "holes." A hole in a file is a
section of the file's contents which was never written. The contents of a hole
read as all zeros. On many operating systems, actual disk storage is not
allocated for holes, but they are counted in the length of the file. If you
archive such a file, tar could create an archive longer than the
original. To have tar attempt to recognize the holes in a file, use
--sparse (-S). When you use the --sparse
(-S) option, then, for any file using less disk space than would be
expected from its length, tar searches the file for consecutive
stretches of zeros. It then records in the archive for the file where the
consecutive stretches of zeros are, and only archives the "real contents" of the
file. On extraction (using --sparse (-S) is not needed on
extraction) any such files have hols created wherever the continuous stretches
of zeros were found. Thus, if you use --sparse (-S),
tar archives won't take more space than the original.
A file is sparse if it contains blocks of zeros whose existence is recorded,
but that have no space allocated on disk. When you specify the
--sparse (-S) option in conjunction with the
--create (-c) operation, tar tests all files
for sparseness while archiving. If tar finds a file to be sparse,
it uses a sparse representation of the file in the archive. See section How to Create
Archives, for more information about creating archives.
--sparse (-S) is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive.
Please Note: Always use --sparse (-S) when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.
Even if your system has no sparse files currently, some may be created in the future. If you use --sparse (-S) while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). @FIXME-xref{incremental when node name is set.}
tar ignores the --sparse (-S) option when
reading an archive.
However, users should be well aware that at archive creation time, GNU
tar still has to read whole disk file to locate the holes,
and so, even if sparse files use little space on disk and in the archive, they
may sometimes require inordinate amount of time for reading and examining
all-zero blocks of a file. Although it works, it's painfully slow for a large
(sparse) file, even though the resulting tar archive may be small. (One user
reports that dumping a `core' file of over 400 megabytes, but with only
about 3 megabytes of actual data, took about 9 minutes on a Sun Sparstation ELC,
with full CPU utilisation.)
This reading is required in all cases and is not related to the fact the --sparse (-S) option is used or not, so by merely not using the option, you are not saving time(6).
Programs like dump do not have to read the entire file; by
examining the file system directly, they can determine in advance exactly where
the holes are and thus avoid reading through them. The only data it need read
are the actual allocated data blocks. GNU tar uses a more portable
and straightforward archiving approach, it would be fairly difficult that it
does otherwise. Elizabeth Zwicky writes to `comp.unix.internals', on
1990-12-10:
What I did say is that you cannot tell the difference between a hole and an equivalent number of nulls without reading raw blocks.
st_blocksat best tells you how many holes there are; it doesn't tell you where. Just as programs may, conceivably, care whatst_blocksis (care to name one that does?), they may also care where the holes are (I have no examples of this one either, but it's equally imaginable).I conclude from this that good archivers are not portable. One can arguably conclude that if you want a portable program, you can in good conscience restore files with as many holes as possible, since you can't get it right.
@UNREVISED
When tar reads files, this causes them to have the access times
updated. To have tar attempt to set the access times back to what
they were before they were read, use the --atime-preserve option.
This doesn't work for files that you don't own, unless you're root, and it
doesn't interact with incremental dumps nicely (see section Performing
Backups and Restoring Files), but it is good enough for some purposes.
Handling of file attributes
tar leaves the modification times of the files it extracts as the
time when the files were extracted, instead of setting it to the time recorded
in the archive. This option is meaningless with --list
(-t).
tar is
executed on those systems able to give files away. This is considered as a
security flaw by many people, at least because it makes quite difficult to
correctly account users for the disk space they occupy. Also, the
suid or sgid attributes of files are easily and
silently lost when files are given away. When writing an archive,
tar writes the user id and user name separately. If it can't find
a user name (because the user id is not in `/etc/passwd'), then it
does not write one. When restoring, and doing a chmod like when
you use --same-permissions (--preserve-permissions,
-p) (@FIXME{same-owner?}), it tries to look the name (if one was
written) up in `/etc/passwd'. If it fails, then it uses the user id
stored in the archive instead.
tar archives. The identifying names are added at create time when
provided by the system, unless --old-archive (-o) is
used. Numeric ids could be used when moving archives between a collection of
machines using a centralized management for attribution of numeric ids to
users and groups. This is often made through using the NIS capabilities. When
making a tar file for distribution to other sites, it is
sometimes cleaner to use a single owner for all files in the distribution, and
nicer to specify the write permission bits of the files as stored in the
archive independently of their actual value on the file system. The way to
prepare a clean distribution is usually to have some Makefile rule creating a
directory, copying all needed files in that directory, then setting ownership
and permissions as wanted (there are a lot of possible schemes), and only then
making a tar archive out of this directory, before cleaning
everything out. Of course, we could add a lot of options to GNU
tar for fine tuning permissions and ownership. This is not the
good way, I think. GNU tar is already crowded with options and
moreover, the approach just explained gives you a great deal of control
already.
tar to
set the modes (access permissions) of extracted files exactly as recorded in
the archive. If this option is not used, the current umask
setting limits the permissions on extracted files. This option is meaningless
with --list (-t).
@UNREVISED
While an archive may contain many files, the archive itself is a single
ordinary file. Like any other file, an archive file can be written to a storage
device such as a tape or disk, sent through a pipe or over a network, saved on
the active file system, or even stored in another archive. An archive file is
not easy to read or manipulate without using the tar utility or Tar
mode in GNU Emacs.
Physically, an archive consists of a series of file entries terminated by an
end-of-archive entry, which consists of 512 zero bytes. A file entry usually
describes one of the files in the archive (an archive member), and
consists of a file header and the contents of the file. File headers contain
file names and statistics, checksum information which tar uses to
detect file corruption, and information about file types.
Archives are permitted to have more than one member with the same member name. One way this situation can occur is if more than one version of a file has been stored in the archive. For information about adding new versions of a file to an archive, see section Updating an Archive, and to learn more about having more than one archive member with the same name, see @FIXME-xref{-backup node, when it's written}.
In addition to entries describing archive members, an archive may contain
entries which tar itself uses to store information. See section Including a
Label in the Archive, for an example of such an archive entry.
A tar archive file contains a series of blocks. Each block
contains BLOCKSIZE bytes. Although this format may be thought of as
being on magnetic tape, other media are often used.
Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there may be a block filled with binary zeros as an end-of-file marker. A reasonable system should write a block of zeros at the end, but must not assume that such a block exists when reading an archive.
The blocks may be blocked for physical I/O operations. Each record
of n blocks (where n is set by the
--blocking-factor=512-size (-b
512-size) option to tar) is written with a single
`write ()' operation. On magnetic tapes, the result of such a write
is a single record. When writing an archive, the last record of blocks should be
written at the full size, with blocks after the zero block containing all zeros.
When reading an archive, a reasonable system should properly handle an archive
whose last record is shorter than the rest, or which contains garbage records
after a zero block.
The header block is defined in C as follows. In the GNU tar
distribution, this is part of file `src/tar.h':
/* GNU tar Archive Format description. */
/* If OLDGNU_COMPATIBILITY is not zero, tar produces archives which, by
default, are readable by older versions of GNU tar. This can be
overriden by using --posix; in this case, POSIXLY_CORRECT in environment
may be set for enforcing stricter conformance. If OLDGNU_COMPATIBILITY
is zero or undefined, tar will eventually produces archives which, by
default, POSIX compatible; then either using --posix or defining
POSIXLY_CORRECT enforces stricter conformance.
This #define will disappear in a few years. FP, June 1995. */
#define OLDGNU_COMPATIBILITY 1
/*---------------------------------------------.
| `tar' Header Block, from POSIX 1003.1-1990. |
`---------------------------------------------*/
/* POSIX header. */
struct posix_header
{ /* byte offset */
char name[100]; /* 0 */
char mode[8]; /* 100 */
char uid[8]; /* 108 */
char gid[8]; /* 116 */
char size[12]; /* 124 */
char mtime[12]; /* 136 */
char chksum[8]; /* 148 */
char typeflag; /* 156 */
char linkname[100]; /* 157 */
char magic[6]; /* 257 */
char version[2]; /* 263 */
char uname[32]; /* 265 */
char gname[32]; /* 297 */
char devmajor[8]; /* 329 */
char devminor[8]; /* 337 */
char prefix[155]; /* 345 */
/* 500 */
};
#define TMAGIC "ustar" /* ustar and a null */
#define TMAGLEN 6
#define TVERSION "00" /* 00 and no null */
#define TVERSLEN 2
/* Values used in typeflag field. */
#define REGTYPE '0' /* regular file */
#define AREGTYPE '\0' /* regular file */
#define LNKTYPE '1' /* link */
#define SYMTYPE '2' /* reserved */
#define CHRTYPE '3' /* character special */
#define BLKTYPE '4' /* block special */
#define DIRTYPE '5' /* directory */
#define FIFOTYPE '6' /* FIFO special */
#define CONTTYPE '7' /* reserved */
/* Bits used in the mode field, values in octal. */
#define TSUID 04000 /* set UID on execution */
#define TSGID 02000 /* set GID on execution */
#define TSVTX 01000 /* reserved */
/* file permissions */
#define TUREAD 00400 /* read by owner */
#define TUWRITE 00200 /* write by owner */
#define TUEXEC 00100 /* execute/search by owner */
#define TGREAD 00040 /* read by group */
#define TGWRITE 00020 /* write by group */
#define TGEXEC 00010 /* execute/search by group */
#define TOREAD 00004 /* read by other */
#define TOWRITE 00002 /* write by other */
#define TOEXEC 00001 /* execute/search by other */
/*-------------------------------------.
| `tar' Header Block, GNU extensions. |
`-------------------------------------*/
/* In GNU tar, SYMTYPE is for to symbolic links, and CONTTYPE is for
contiguous files, so maybe disobeying the `reserved' comment in POSIX
header description. I suspect these were meant to be used this way, and
should not have really been `reserved' in the published standards. */
/* *BEWARE* *BEWARE* *BEWARE* that the following information is still
boiling, and may change. Even if the OLDGNU format description should be
accurate, the so-called GNU format is not yet fully decided. It is
surely meant to use only extensions allowed by POSIX, but the sketch
below repeats some ugliness from the OLDGNU format, which should rather
go away. Sparse files should be saved in such a way that they do *not*
require two passes at archive creation time. Huge files get some POSIX
fields to overflow, alternate solutions have to be sought for this. */
/* Descriptor for a single file hole. */
struct sparse
{ /* byte offset */
char offset[12]; /* 0 */
char numbytes[12]; /* 12 */
/* 24 */
};
/* Sparse files are not supported in POSIX ustar format. For sparse files
with a POSIX header, a GNU extra header is provided which holds overall
sparse information and a few sparse descriptors. When an old GNU header
replaces both the POSIX header and the GNU extra header, it holds some
sparse descriptors too. Whether POSIX or not, if more sparse descriptors
are still needed, they are put into as many successive sparse headers as
necessary. The following constants tell how many sparse descriptors fit
in each kind of header able to hold them. */
#define SPARSES_IN_EXTRA_HEADER 16
#define SPARSES_IN_OLDGNU_HEADER 4
#define SPARSES_IN_SPARSE_HEADER 21
/* The GNU extra header contains some information GNU tar needs, but not
foreseen in POSIX header format. It is only used after a POSIX header
(and never with old GNU headers), and immediately follows this POSIX
header, when typeflag is a letter rather than a digit, so signaling a GNU
extension. */
struct extra_header
{ /* byte offset */
char atime[12]; /* 0 */
char ctime[12]; /* 12 */
char offset[12]; /* 24 */
char realsize[12]; /* 36 */
char longnames[4]; /* 48 */
char unused_pad1[68]; /* 52 */
struct sparse sp[SPARSES_IN_EXTRA_HEADER];
/* 120 */
char isextended; /* 504 */
/* 505 */
};
/* Extension header for sparse files, used immediately after the GNU extra
header, and used only if all sparse information cannot fit into that
extra header. There might even be many such extension headers, one after
the other, until all sparse information has been recorded. */
struct sparse_header
{ /* byte offset */
struct sparse sp[SPARSES_IN_SPARSE_HEADER];
/* 0 */
char isextended; /* 504 */
/* 505 */
};
/* The old GNU format header conflicts with POSIX format in such a way that
POSIX archives may fool old GNU tar's, and POSIX tar's might well be
fooled by old GNU tar archives. An old GNU format header uses the space
used by the prefix field in a POSIX header, and cumulates information
normally found in a GNU extra header. With an old GNU tar header, we
never see any POSIX header nor GNU extra header. Supplementary sparse
headers are allowed, however. */
struct oldgnu_header
{ /* byte offset */
char unused_pad1[345]; /* 0 */
char atime[12]; /* 345 */
char ctime[12]; /* 357 */
char offset[12]; /* 369 */
char longnames[4]; /* 381 */
char unused_pad2; /* 385 */
struct sparse sp[SPARSES_IN_OLDGNU_HEADER];
/* 386 */
char isextended; /* 482 */
char realsize[12]; /* 483 */
/* 495 */
};
/* OLDGNU_MAGIC uses both magic and version fields, which are contiguous.
Found in an archive, it indicates an old GNU header format, which will be
hopefully become obsolescent. With OLDGNU_MAGIC, uname and gname are
valid, though the header is not truly POSIX conforming. */
#define OLDGNU_MAGIC "ustar " /* 7 chars and a null */
/* The standards committee allows only capital A through capital Z for
user-defined expansion. */
/* This is a dir entry that contains the names of files that were in the
dir at the time the dump was made. */
#define GNUTYPE_DUMPDIR 'D'
/* Identifies the *next* file on the tape as having a long linkname. */
#define GNUTYPE_LONGLINK 'K'
/* Identifies the *next* file on the tape as having a long name. */
#define GNUTYPE_LONGNAME 'L'
/* This is the continuation of a file that began on another volume. */
#define GNUTYPE_MULTIVOL 'M'
/* For storing filenames that do not fit into the main header. */
#define GNUTYPE_NAMES 'N'
/* This is for sparse files. */
#define GNUTYPE_SPARSE 'S'
/* This file is a tape/volume header. Ignore it on extraction. */
#define GNUTYPE_VOLHDR 'V'
/*--------------------------------------.
| tar Header Block, overall structure. |
`--------------------------------------*/
/* tar files are made in basic blocks of this size. */
#define BLOCKSIZE 512
enum archive_format
{
DEFAULT_FORMAT, /* format to be decided later */
V7_FORMAT, /* old V7 tar format */
OLDGNU_FORMAT, /* GNU format as per before tar 1.12 */
POSIX_FORMAT, /* restricted, pure POSIX format */
GNU_FORMAT /* POSIX format with GNU extensions */
};
union block
{
char buffer[BLOCKSIZE];
struct posix_header header;
struct extra_header extra_header;
struct oldgnu_header oldgnu_header;
struct sparse_header sparse_header;
};
/* End of Format description. */
All characters in header blocks are represented by using 8-bit characters in the local variant of ASCII. Each field within the structure is contiguous; that is, there is no padding used within the structure. Each character on the archive medium is stored contiguously.
Bytes representing the contents of files (after the header block of each
file) are not translated in any way and are not constrained to represent
characters in any character set. The tar format does not
distinguish text files from binary files, and no translation of file contents is
performed.
The name, linkname, magic,
uname, and gname are null-terminated character
strings. All other fileds are zero-filled octal numbers in ASCII. Each numeric
field of width w contains w minus 2 digits, a space, and a
null, except size, and mtime, which do not contain the
trailing null.
The name field is the file name of the file, with directory
names (if any) preceding the file name, separated by slashes.
@FIXME{how big a name before field overflows?}
The mode field provides nine bits specifying file permissions
and three bits to specify the Set UID, Set GID, and Save Text (sticky)
modes. Values for these bits are defined above. When special permissions are
required to create a file with a given mode, and the user restoring files from
the archive does not hold such permissions, the mode bit(s) specifying those
special permissions are ignored. Modes which are not supported by the operating
system restoring files from the archive will be ignored. Unsupported modes
should be faked up when creating or updating an archive; e.g. the group
permission could be copied from the other permission.
The uid and gid fields are the numeric user and
group ID of the file owners, respectively. If the operating system does not
support numeric user or group IDs, these fields should be ignored.
The size field is the size of the file in bytes; linked files
are archived with this field specified as zero. @FIXME-xref{Modifiers}, in
particular the --incremental (-G) option.
The mtime field is the modification time of the file at the time
it was archived. It is the ASCII representation of the octal value of the last
time the file was modified, represented as an integer number of seconds since
January 1, 1970, 00:00 Coordinated Universal Time.
The chksum field is the ASCII representation of the octal value
of the simple sum of all bytes in the header block. Each 8-bit byte in the
header is added to an unsigned integer, initialized to zero, the precision of
which shall be no less than seventeen bits. When calculating the checksum, the
chksum field is treated as if it were all blanks.
The typeflag field specifies the type of file archived. If a
particular implementation does not recognize or permit the specified type, the
file will be extracted as if it were a regular file. As this action occurs,
tar issues a warning to the standard error.
The atime and ctime fields are used in making
incremental backups; they store, respectively, the particular file's access time
and last inode-change time.
The offset is used by the --multi-volume
(-M) option, when making a multi-volume archive. The offset is number
of bytes into the file that we need to restart at to continue the file on the
next tape, i.e., where we store the location that a continued file is continued
at.
The following fields were added to deal with sparse files. A file is
sparse if it takes in unallocated blocks which end up being represented
as zeros, i.e., no useful data. A test to see if a file is sparse is to look at
the number blocks allocated for it versus the number of characters in the file;
if there are fewer blocks allocated for the file than would normally be
allocated for a file of that size, then the file is sparse. This is the method
tar uses to detect a sparse file, and once such a file is detected,
it is treated differently from non-sparse files.
Sparse files are often dbm files, or other database-type files
which have data at some points and emptiness in the greater part of the file.
Such files can appear to be very large when an `ls -l' is done on
them, when in truth, there may be a very small amount of important data
contained in the file. It is thus undesirable to have tar think
that it must back up this entire file, as great quantities of room are wasted on
empty blocks, which can lead to running out of room on a tape far earlier than
is necessary. Thus, sparse files are dealt with so that these empty blocks are
not written to the tape. Instead, what is written to the tape is a description,
of sorts, of the sparse file: where the holes are, how big the holes are, and
how much data is found at the end of the hole. This way, the file takes up
potentially far less room on the tape, and when the file is extracted later on,
it will look exactly the way it looked beforehand. The following is a
description of the fields used to handle a sparse file:
The sp is an array of struct sparse. Each
struct sparse contains two 12-character strings which represent an
offset into the file and a number of bytes to be written at that offset. The
offset is absolute, and not relative to the offset in preceding array element.
The header can hold four of these struct sparse at the moment;
if more are needed, they are not stored in the header.
The isextended flag is set when an extended_header
is needed to deal with a file. Note that this means that this flag can only be
set when dealing with a sparse file, and it is only set in the event that the
description of the file will not fit in the alloted room for sparse structures
in the header. In other words, an extended_header is needed.
The extended_header structure is used for sparse files which
need more sparse structures than can fit in the header. The header can fit 4
such structures; if more are needed, the flag isextended gets set
and the next block is an extended_header.
Each extended_header structure contains an array of 21 sparse
structures, along with a similar isextended flag that the header
had. There can be an indeterminate number of such extended_headers
to describe a sparse file.
REGTYPE
AREGTYPE
tar, a typeflag value of
AREGTYPE should be silently recognized as a regular file. New
archives should be created using REGTYPE. Also, for backward
compatibility, tar treats a regular file whose name ends with a
slash as a directory.
LNKTYPE
linkname field with a trailing null.
SYMTYPE
linkname field with a trailing null.
CHRTYPE
BLKTYPE
devmajor and devminor
fields will contain the major and minor device numbers respectively. Operating
systems may map the device specifications to their own local specification, or
may ignore the entry.
DIRTYPE
name field should end with a slash. On systems where disk
allocation is performed on a directory basis, the size field will
contain the maximum number of bytes (which may be rounded to the nearest disk
block allocation unit) which the directory may hold. A size field
of zero indicates no such limiting. Systems which do not support limiting in
this manner should ignore the size field.
FIFOTYPE
CONTTYPE
A ... Z
Other values are reserved for specification in future revisions of the P1003
standard, and should not be used by any tar program.
The magic field indicates that this archive was output in the
P1003 archive format. If this field contains TMAGIC, the
uname and gname fields will contain the ASCII
representation of the owner and group of the file respectively. If found, the
user and group IDs are used rather than the values in the uid and
gid fields.
For references, see ISO/IEC 9945-1:1990 or IEEE Std 1003.1-1990, pages 169-173 (section 10.1) for Archive/Interchange File Format; and IEEE Std 1003.2-1992, pages 380-388 (section 4.48) and pages 936-940 (section E.4.48) for pax - Portable archive interchange.
@UNREVISED
The GNU format uses additional file types to describe new types of files in an archive. These are listed below.
GNUTYPE_DUMPDIR
'D'
size field
gives the total size of the associated list of files. Each file name is
preceded by either a `Y' (the file should be in this archive) or
an `N'. (The file is a directory, or is not stored in the
archive.) Each file name is terminated by a null. There is an additional null
after the last file name.
GNUTYPE_MULTIVOL
'M'
size field gives
the maximum size of this piece of the file (assuming the volume does not end
before the file is written out). The offset field gives the
offset from the beginning of the file where this part of the file begins. Thus
size plus offset should equal the original size of
the file.
GNUTYPE_SPARSE
'S'
GNUTYPE_VOLHDR
'V'
name field contains the name given after the
--label=archive-label (-V
archive-label) option. The size field is zero.
Only the first file in each volume of an archive should have this type.
You may have trouble reading a GNU format archive on a non-GNU system if the
options --incremental (-G), --multi-volume
(-M), --sparse (-S), or
--label=archive-label (-V
archive-label) were used when writing the archive. In general,
if tar does not use the GNU-added fields of the header, other
versions of tar should be able to read the archive. Otherwise, the
tar program will give an error, the most likely one being a
checksum error.
tar and cpio@UNREVISED
@FIXME{Reorganize the following material}
The cpio archive formats, like tar, do have maximum
pathname lengths. The binary and old ASCII formats have a max path length of
256, and the new ASCII and CRC ASCII formats have a max path length of 1024. GNU
cpio can read and write archives with arbitrary pathname lengths,
but other cpio implementations may crash unexplainedly trying to
read them.
tar handles symbolic links in the form in which it comes in BSD;
cpio doesn't handle symbolic links in the form in which it comes in
System V prior to SVR4, and some vendors may have added symlinks to their system
without enhancing cpio to know about them. Others may have enhanced
it in a way other than the way I did it at Sun, and which was adopted by
AT&T (and which is, I think, also present in the cpio that
Berkeley picked up from AT&T and put into a later BSD release--I think I
gave them my changes).
(SVR4 does some funny stuff with tar; basically, its
cpio can handle tar format input, and write it on
output, and it probably handles symbolic links. They may not have bothered doing
anything to enhance tar as a result.)
cpio handles special files; traditional tar
doesn't.
tar comes with V7, System III, System V, and BSD source;
cpio comes only with System III, System V, and later BSD (4.3-tahoe
and later).
tar's way of handling multiple hard links to a file can handle
file systems that support 32-bit inumbers (e.g., the BSD file system);
cpios way requires you to play some games (in its "binary" format,
i-numbers are only 16 bits, and in its "portable ASCII" format, they're 18
bits--it would have to play games with the "file system ID" field of the header
to make sure that the file system ID/i-number pairs of different files were
always different), and I don't know which cpios, if any, play those
games. Those that don't might get confused and think two files are the same file
when they're not, and make hard links between them.
tars way of handling multiple hard links to a file places only
one copy of the link on the tape, but the name attached to that copy is the
only one you can use to retrieve the file; cpios way puts
one copy for every link, but you can retrieve it using any of the names.
What type of check sum (if any) is used, and how is this calculated.
See the attached manual pages for tar and cpio
format. tar uses a checksum which is the sum of all the bytes in
the tar header for a file; cpio uses no checksum.
If anyone knows why
cpiowas made whentarwas present at the unix scene,
It wasn't. cpio first showed up in PWB/UNIX 1.0; no
generally-available version of UNIX had tar at the time. I don't
know whether any version that was generally available within AT&T
had tar, or, if so, whether the people within AT&T who did
cpio knew about it.
On restore, if there is a corruption on a tape tar will stop at
that point, while cpio will skip over it and try to restore the
rest of the files.
The main difference is just in the command syntax and header format.
tar is a little more tape-oriented in that everything is blocked
to start on a record boundary.
Is there any differences between the ability to recover crashed archives between the two of them. (Is there any chance of recovering crashed archives at all.)
Theoretically it should be easier under tar since the blocking
lets you find a header with some variation of `dd
skip=nn'. However, modern cpio's and variations
have an option to just search for the next file header after an error with a
reasonable chance of re-syncing. However, lots of tape driver software won't
allow you to continue past a media error which should be the only reason for
getting out of sync unless a file changed sizes while you were writing the
archive.
If anyone knows why
cpiowas made whentarwas present at the unix scene, please tell me about this too.
Probably because it is more media efficient (by not blocking everything and
using only the space needed for the headers where tar always uses
512 bytes per file header) and it knows how to archive special files.
You might want to look at the freely available alternatives. The major ones
are afio, GNU tar, and pax, each of which
have their own extensions with some backwards compatibility.
Sparse files were tarred as sparse files (which you can easily
test, because the resulting archive gets smaller, and GNU cpio can
no longer read it).
Go to the first, previous, next, last section, table of contents.