Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,248 @@
Scatterlist Cryptographic API
INTRODUCTION
The Scatterlist Crypto API takes page vectors (scatterlists) as
arguments, and works directly on pages. In some cases (e.g. ECB
mode ciphers), this will allow for pages to be encrypted in-place
with no copying.
One of the initial goals of this design was to readily support IPsec,
so that processing can be applied to paged skb's without the need
for linearization.
DETAILS
At the lowest level are algorithms, which register dynamically with the
API.
'Transforms' are user-instantiated objects, which maintain state, handle all
of the implementation logic (e.g. manipulating page vectors) and provide an
abstraction to the underlying algorithms. However, at the user
level they are very simple.
Conceptually, the API layering looks like this:
[transform api] (user interface)
[transform ops] (per-type logic glue e.g. cipher.c, compress.c)
[algorithm api] (for registering algorithms)
The idea is to make the user interface and algorithm registration API
very simple, while hiding the core logic from both. Many good ideas
from existing APIs such as Cryptoapi and Nettle have been adapted for this.
The API currently supports five main types of transforms: AEAD (Authenticated
Encryption with Associated Data), Block Ciphers, Ciphers, Compressors and
Hashes.
Please note that Block Ciphers is somewhat of a misnomer. It is in fact
meant to support all ciphers including stream ciphers. The difference
between Block Ciphers and Ciphers is that the latter operates on exactly
one block while the former can operate on an arbitrary amount of data,
subject to block size requirements (i.e., non-stream ciphers can only
process multiples of blocks).
Support for hardware crypto devices via an asynchronous interface is
under development.
Here's an example of how to use the API:
#include <linux/crypto.h>
#include <linux/err.h>
#include <linux/scatterlist.h>
struct scatterlist sg[2];
char result[128];
struct crypto_hash *tfm;
struct hash_desc desc;
tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm))
fail();
/* ... set up the scatterlists ... */
desc.tfm = tfm;
desc.flags = 0;
if (crypto_hash_digest(&desc, sg, 2, result))
fail();
crypto_free_hash(tfm);
Many real examples are available in the regression test module (tcrypt.c).
DEVELOPER NOTES
Transforms may only be allocated in user context, and cryptographic
methods may only be called from softirq and user contexts. For
transforms with a setkey method it too should only be called from
user context.
When using the API for ciphers, performance will be optimal if each
scatterlist contains data which is a multiple of the cipher's block
size (typically 8 bytes). This prevents having to do any copying
across non-aligned page fragment boundaries.
ADDING NEW ALGORITHMS
When submitting a new algorithm for inclusion, a mandatory requirement
is that at least a few test vectors from known sources (preferably
standards) be included.
Converting existing well known code is preferred, as it is more likely
to have been reviewed and widely tested. If submitting code from LGPL
sources, please consider changing the license to GPL (see section 3 of
the LGPL).
Algorithms submitted must also be generally patent-free (e.g. IDEA
will not be included in the mainline until around 2011), and be based
on a recognized standard and/or have been subjected to appropriate
peer review.
Also check for any RFCs which may relate to the use of specific algorithms,
as well as general application notes such as RFC2451 ("The ESP CBC-Mode
Cipher Algorithms").
It's a good idea to avoid using lots of macros and use inlined functions
instead, as gcc does a good job with inlining, while excessive use of
macros can cause compilation problems on some platforms.
Also check the TODO list at the web site listed below to see what people
might already be working on.
BUGS
Send bug reports to:
linux-crypto@vger.kernel.org
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
David S. Miller <davem@redhat.com>
FURTHER INFORMATION
For further patches and various updates, including the current TODO
list, see:
http://gondor.apana.org.au/~herbert/crypto/
AUTHORS
James Morris
David S. Miller
Herbert Xu
CREDITS
The following people provided invaluable feedback during the development
of the API:
Alexey Kuznetzov
Rusty Russell
Herbert Valerio Riedel
Jeff Garzik
Michael Richardson
Andrew Morton
Ingo Oeser
Christoph Hellwig
Portions of this API were derived from the following projects:
Kerneli Cryptoapi (http://www.kerneli.org/)
Alexander Kjeldaas
Herbert Valerio Riedel
Kyle McMartin
Jean-Luc Cooke
David Bryson
Clemens Fruhwirth
Tobias Ringstrom
Harald Welte
and;
Nettle (http://www.lysator.liu.se/~nisse/nettle/)
Niels Möller
Original developers of the crypto algorithms:
Dana L. How (DES)
Andrew Tridgell and Steve French (MD4)
Colin Plumb (MD5)
Steve Reid (SHA1)
Jean-Luc Cooke (SHA256, SHA384, SHA512)
Kazunori Miyazawa / USAGI (HMAC)
Matthew Skala (Twofish)
Dag Arne Osvik (Serpent)
Brian Gladman (AES)
Kartikey Mahendra Bhatt (CAST6)
Jon Oberheide (ARC4)
Jouni Malinen (Michael MIC)
NTT(Nippon Telegraph and Telephone Corporation) (Camellia)
SHA1 algorithm contributors:
Jean-Francois Dive
DES algorithm contributors:
Raimar Falke
Gisle Sælensminde
Niels Möller
Blowfish algorithm contributors:
Herbert Valerio Riedel
Kyle McMartin
Twofish algorithm contributors:
Werner Koch
Marc Mutz
SHA256/384/512 algorithm contributors:
Andrew McDonald
Kyle McMartin
Herbert Valerio Riedel
AES algorithm contributors:
Alexander Kjeldaas
Herbert Valerio Riedel
Kyle McMartin
Adam J. Richter
Fruhwirth Clemens (i586)
Linus Torvalds (i586)
CAST5 algorithm contributors:
Kartikey Mahendra Bhatt (original developers unknown, FSF copyright).
TEA/XTEA algorithm contributors:
Aaron Grothe
Michael Ringe
Khazad algorithm contributors:
Aaron Grothe
Whirlpool algorithm contributors:
Aaron Grothe
Jean-Luc Cooke
Anubis algorithm contributors:
Aaron Grothe
Tiger algorithm contributors:
Aaron Grothe
VIA PadLock contributors:
Michal Ludvig
Camellia algorithm contributors:
NTT(Nippon Telegraph and Telephone Corporation) (Camellia)
Generic scatterwalk code by Adam J. Richter <adam@yggdrasil.com>
Please send any credits updates or corrections to:
Herbert Xu <herbert@gondor.apana.org.au>

View file

@ -0,0 +1,312 @@
=============================================
ASYMMETRIC / PUBLIC-KEY CRYPTOGRAPHY KEY TYPE
=============================================
Contents:
- Overview.
- Key identification.
- Accessing asymmetric keys.
- Signature verification.
- Asymmetric key subtypes.
- Instantiation data parsers.
========
OVERVIEW
========
The "asymmetric" key type is designed to be a container for the keys used in
public-key cryptography, without imposing any particular restrictions on the
form or mechanism of the cryptography or form of the key.
The asymmetric key is given a subtype that defines what sort of data is
associated with the key and provides operations to describe and destroy it.
However, no requirement is made that the key data actually be stored in the
key.
A completely in-kernel key retention and operation subtype can be defined, but
it would also be possible to provide access to cryptographic hardware (such as
a TPM) that might be used to both retain the relevant key and perform
operations using that key. In such a case, the asymmetric key would then
merely be an interface to the TPM driver.
Also provided is the concept of a data parser. Data parsers are responsible
for extracting information from the blobs of data passed to the instantiation
function. The first data parser that recognises the blob gets to set the
subtype of the key and define the operations that can be done on that key.
A data parser may interpret the data blob as containing the bits representing a
key, or it may interpret it as a reference to a key held somewhere else in the
system (for example, a TPM).
==================
KEY IDENTIFICATION
==================
If a key is added with an empty name, the instantiation data parsers are given
the opportunity to pre-parse a key and to determine the description the key
should be given from the content of the key.
This can then be used to refer to the key, either by complete match or by
partial match. The key type may also use other criteria to refer to a key.
The asymmetric key type's match function can then perform a wider range of
comparisons than just the straightforward comparison of the description with
the criterion string:
(1) If the criterion string is of the form "id:<hexdigits>" then the match
function will examine a key's fingerprint to see if the hex digits given
after the "id:" match the tail. For instance:
keyctl search @s asymmetric id:5acc2142
will match a key with fingerprint:
1A00 2040 7601 7889 DE11 882C 3823 04AD 5ACC 2142
(2) If the criterion string is of the form "<subtype>:<hexdigits>" then the
match will match the ID as in (1), but with the added restriction that
only keys of the specified subtype (e.g. tpm) will be matched. For
instance:
keyctl search @s asymmetric tpm:5acc2142
Looking in /proc/keys, the last 8 hex digits of the key fingerprint are
displayed, along with the subtype:
1a39e171 I----- 1 perm 3f010000 0 0 asymmetri modsign.0: DSA 5acc2142 []
=========================
ACCESSING ASYMMETRIC KEYS
=========================
For general access to asymmetric keys from within the kernel, the following
inclusion is required:
#include <crypto/public_key.h>
This gives access to functions for dealing with asymmetric / public keys.
Three enums are defined there for representing public-key cryptography
algorithms:
enum pkey_algo
digest algorithms used by those:
enum pkey_hash_algo
and key identifier representations:
enum pkey_id_type
Note that the key type representation types are required because key
identifiers from different standards aren't necessarily compatible. For
instance, PGP generates key identifiers by hashing the key data plus some
PGP-specific metadata, whereas X.509 has arbitrary certificate identifiers.
The operations defined upon a key are:
(1) Signature verification.
Other operations are possible (such as encryption) with the same key data
required for verification, but not currently supported, and others
(eg. decryption and signature generation) require extra key data.
SIGNATURE VERIFICATION
----------------------
An operation is provided to perform cryptographic signature verification, using
an asymmetric key to provide or to provide access to the public key.
int verify_signature(const struct key *key,
const struct public_key_signature *sig);
The caller must have already obtained the key from some source and can then use
it to check the signature. The caller must have parsed the signature and
transferred the relevant bits to the structure pointed to by sig.
struct public_key_signature {
u8 *digest;
u8 digest_size;
enum pkey_hash_algo pkey_hash_algo : 8;
u8 nr_mpi;
union {
MPI mpi[2];
...
};
};
The algorithm used must be noted in sig->pkey_hash_algo, and all the MPIs that
make up the actual signature must be stored in sig->mpi[] and the count of MPIs
placed in sig->nr_mpi.
In addition, the data must have been digested by the caller and the resulting
hash must be pointed to by sig->digest and the size of the hash be placed in
sig->digest_size.
The function will return 0 upon success or -EKEYREJECTED if the signature
doesn't match.
The function may also return -ENOTSUPP if an unsupported public-key algorithm
or public-key/hash algorithm combination is specified or the key doesn't
support the operation; -EBADMSG or -ERANGE if some of the parameters have weird
data; or -ENOMEM if an allocation can't be performed. -EINVAL can be returned
if the key argument is the wrong type or is incompletely set up.
=======================
ASYMMETRIC KEY SUBTYPES
=======================
Asymmetric keys have a subtype that defines the set of operations that can be
performed on that key and that determines what data is attached as the key
payload. The payload format is entirely at the whim of the subtype.
The subtype is selected by the key data parser and the parser must initialise
the data required for it. The asymmetric key retains a reference on the
subtype module.
The subtype definition structure can be found in:
#include <keys/asymmetric-subtype.h>
and looks like the following:
struct asymmetric_key_subtype {
struct module *owner;
const char *name;
void (*describe)(const struct key *key, struct seq_file *m);
void (*destroy)(void *payload);
int (*verify_signature)(const struct key *key,
const struct public_key_signature *sig);
};
Asymmetric keys point to this with their type_data[0] member.
The owner and name fields should be set to the owning module and the name of
the subtype. Currently, the name is only used for print statements.
There are a number of operations defined by the subtype:
(1) describe().
Mandatory. This allows the subtype to display something in /proc/keys
against the key. For instance the name of the public key algorithm type
could be displayed. The key type will display the tail of the key
identity string after this.
(2) destroy().
Mandatory. This should free the memory associated with the key. The
asymmetric key will look after freeing the fingerprint and releasing the
reference on the subtype module.
(3) verify_signature().
Optional. These are the entry points for the key usage operations.
Currently there is only the one defined. If not set, the caller will be
given -ENOTSUPP. The subtype may do anything it likes to implement an
operation, including offloading to hardware.
==========================
INSTANTIATION DATA PARSERS
==========================
The asymmetric key type doesn't generally want to store or to deal with a raw
blob of data that holds the key data. It would have to parse it and error
check it each time it wanted to use it. Further, the contents of the blob may
have various checks that can be performed on it (eg. self-signatures, validity
dates) and may contain useful data about the key (identifiers, capabilities).
Also, the blob may represent a pointer to some hardware containing the key
rather than the key itself.
Examples of blob formats for which parsers could be implemented include:
- OpenPGP packet stream [RFC 4880].
- X.509 ASN.1 stream.
- Pointer to TPM key.
- Pointer to UEFI key.
During key instantiation each parser in the list is tried until one doesn't
return -EBADMSG.
The parser definition structure can be found in:
#include <keys/asymmetric-parser.h>
and looks like the following:
struct asymmetric_key_parser {
struct module *owner;
const char *name;
int (*parse)(struct key_preparsed_payload *prep);
};
The owner and name fields should be set to the owning module and the name of
the parser.
There is currently only a single operation defined by the parser, and it is
mandatory:
(1) parse().
This is called to preparse the key from the key creation and update paths.
In particular, it is called during the key creation _before_ a key is
allocated, and as such, is permitted to provide the key's description in
the case that the caller declines to do so.
The caller passes a pointer to the following struct with all of the fields
cleared, except for data, datalen and quotalen [see
Documentation/security/keys.txt].
struct key_preparsed_payload {
char *description;
void *type_data[2];
void *payload;
const void *data;
size_t datalen;
size_t quotalen;
};
The instantiation data is in a blob pointed to by data and is datalen in
size. The parse() function is not permitted to change these two values at
all, and shouldn't change any of the other values _unless_ they are
recognise the blob format and will not return -EBADMSG to indicate it is
not theirs.
If the parser is happy with the blob, it should propose a description for
the key and attach it to ->description, ->type_data[0] should be set to
point to the subtype to be used, ->payload should be set to point to the
initialised data for that subtype, ->type_data[1] should point to a hex
fingerprint and quotalen should be updated to indicate how much quota this
key should account for.
When clearing up, the data attached to ->type_data[1] and ->description
will be kfree()'d and the data attached to ->payload will be passed to the
subtype's ->destroy() method to be disposed of. A module reference for
the subtype pointed to by ->type_data[0] will be put.
If the data format is not recognised, -EBADMSG should be returned. If it
is recognised, but the key cannot for some reason be set up, some other
negative error code should be returned. On success, 0 should be returned.
The key's fingerprint string may be partially matched upon. For a
public-key algorithm such as RSA and DSA this will likely be a printable
hex version of the key's fingerprint.
Functions are provided to register and unregister parsers:
int register_asymmetric_key_parser(struct asymmetric_key_parser *parser);
void unregister_asymmetric_key_parser(struct asymmetric_key_parser *subtype);
Parsers may not have the same name. The names are otherwise only used for
displaying in debugging messages.

View file

@ -0,0 +1,225 @@
Asynchronous Transfers/Transforms API
1 INTRODUCTION
2 GENEALOGY
3 USAGE
3.1 General format of the API
3.2 Supported operations
3.3 Descriptor management
3.4 When does the operation execute?
3.5 When does the operation complete?
3.6 Constraints
3.7 Example
4 DMAENGINE DRIVER DEVELOPER NOTES
4.1 Conformance points
4.2 "My application needs exclusive control of hardware channels"
5 SOURCE
---
1 INTRODUCTION
The async_tx API provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies. It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations. Code
that is written to the API can optimize for asynchronous operation and
the API will fit the chain of operations to the available offload
resources.
2 GENEALOGY
The API was initially designed to offload the memory copy and
xor-parity-calculations of the md-raid5 driver using the offload engines
present in the Intel(R) Xscale series of I/O processors. It also built
on the 'dmaengine' layer developed for offloading memory copies in the
network stack using Intel(R) I/OAT engines. The following design
features surfaced as a result:
1/ implicit synchronous path: users of the API do not need to know if
the platform they are running on has offload capabilities. The
operation will be offloaded when an engine is available and carried out
in software otherwise.
2/ cross channel dependency chains: the API allows a chain of dependent
operations to be submitted, like xor->copy->xor in the raid5 case. The
API automatically handles cases where the transition from one operation
to another implies a hardware channel switch.
3/ dmaengine extensions to support multiple clients and operation types
beyond 'memcpy'
3 USAGE
3.1 General format of the API:
struct dma_async_tx_descriptor *
async_<operation>(<op specific parameters>, struct async_submit ctl *submit)
3.2 Supported operations:
memcpy - memory copy between a source and a destination buffer
memset - fill a destination buffer with a byte value
xor - xor a series of source buffers and write the result to a
destination buffer
xor_val - xor a series of source buffers and set a flag if the
result is zero. The implementation attempts to prevent
writes to memory
pq - generate the p+q (raid6 syndrome) from a series of source buffers
pq_val - validate that a p and or q buffer are in sync with a given series of
sources
datap - (raid6_datap_recov) recover a raid6 data block and the p block
from the given sources
2data - (raid6_2data_recov) recover 2 raid6 data blocks from the given
sources
3.3 Descriptor management:
The return value is non-NULL and points to a 'descriptor' when the operation
has been queued to execute asynchronously. Descriptors are recycled
resources, under control of the offload engine driver, to be reused as
operations complete. When an application needs to submit a chain of
operations it must guarantee that the descriptor is not automatically recycled
before the dependency is submitted. This requires that all descriptors be
acknowledged by the application before the offload engine driver is allowed to
recycle (or free) the descriptor. A descriptor can be acked by one of the
following methods:
1/ setting the ASYNC_TX_ACK flag if no child operations are to be submitted
2/ submitting an unacknowledged descriptor as a dependency to another
async_tx call will implicitly set the acknowledged state.
3/ calling async_tx_ack() on the descriptor.
3.4 When does the operation execute?
Operations do not immediately issue after return from the
async_<operation> call. Offload engine drivers batch operations to
improve performance by reducing the number of mmio cycles needed to
manage the channel. Once a driver-specific threshold is met the driver
automatically issues pending operations. An application can force this
event by calling async_tx_issue_pending_all(). This operates on all
channels since the application has no knowledge of channel to operation
mapping.
3.5 When does the operation complete?
There are two methods for an application to learn about the completion
of an operation.
1/ Call dma_wait_for_async_tx(). This call causes the CPU to spin while
it polls for the completion of the operation. It handles dependency
chains and issuing pending operations.
2/ Specify a completion callback. The callback routine runs in tasklet
context if the offload engine driver supports interrupts, or it is
called in application context if the operation is carried out
synchronously in software. The callback can be set in the call to
async_<operation>, or when the application needs to submit a chain of
unknown length it can use the async_trigger_callback() routine to set a
completion interrupt/callback at the end of the chain.
3.6 Constraints:
1/ Calls to async_<operation> are not permitted in IRQ context. Other
contexts are permitted provided constraint #2 is not violated.
2/ Completion callback routines cannot submit new operations. This
results in recursion in the synchronous case and spin_locks being
acquired twice in the asynchronous case.
3.7 Example:
Perform a xor->copy->xor operation where each operation depends on the
result from the previous operation:
void callback(void *param)
{
struct completion *cmp = param;
complete(cmp);
}
void run_xor_copy_xor(struct page **xor_srcs,
int xor_src_cnt,
struct page *xor_dest,
size_t xor_len,
struct page *copy_src,
struct page *copy_dest,
size_t copy_len)
{
struct dma_async_tx_descriptor *tx;
addr_conv_t addr_conv[xor_src_cnt];
struct async_submit_ctl submit;
addr_conv_t addr_conv[NDISKS];
struct completion cmp;
init_async_submit(&submit, ASYNC_TX_XOR_DROP_DST, NULL, NULL, NULL,
addr_conv);
tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, &submit)
submit->depend_tx = tx;
tx = async_memcpy(copy_dest, copy_src, 0, 0, copy_len, &submit);
init_completion(&cmp);
init_async_submit(&submit, ASYNC_TX_XOR_DROP_DST | ASYNC_TX_ACK, tx,
callback, &cmp, addr_conv);
tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, &submit);
async_tx_issue_pending_all();
wait_for_completion(&cmp);
}
See include/linux/async_tx.h for more information on the flags. See the
ops_run_* and ops_complete_* routines in drivers/md/raid5.c for more
implementation examples.
4 DRIVER DEVELOPMENT NOTES
4.1 Conformance points:
There are a few conformance points required in dmaengine drivers to
accommodate assumptions made by applications using the async_tx API:
1/ Completion callbacks are expected to happen in tasklet context
2/ dma_async_tx_descriptor fields are never manipulated in IRQ context
3/ Use async_tx_run_dependencies() in the descriptor clean up path to
handle submission of dependent operations
4.2 "My application needs exclusive control of hardware channels"
Primarily this requirement arises from cases where a DMA engine driver
is being used to support device-to-memory operations. A channel that is
performing these operations cannot, for many platform specific reasons,
be shared. For these cases the dma_request_channel() interface is
provided.
The interface is:
struct dma_chan *dma_request_channel(dma_cap_mask_t mask,
dma_filter_fn filter_fn,
void *filter_param);
Where dma_filter_fn is defined as:
typedef bool (*dma_filter_fn)(struct dma_chan *chan, void *filter_param);
When the optional 'filter_fn' parameter is set to NULL
dma_request_channel simply returns the first channel that satisfies the
capability mask. Otherwise, when the mask parameter is insufficient for
specifying the necessary channel, the filter_fn routine can be used to
disposition the available channels in the system. The filter_fn routine
is called once for each free channel in the system. Upon seeing a
suitable channel filter_fn returns DMA_ACK which flags that channel to
be the return value from dma_request_channel. A channel allocated via
this interface is exclusive to the caller, until dma_release_channel()
is called.
The DMA_PRIVATE capability flag is used to tag dma devices that should
not be used by the general-purpose allocator. It can be set at
initialization time if it is known that a channel will always be
private. Alternatively, it is set when dma_request_channel() finds an
unused "public" channel.
A couple caveats to note when implementing a driver and consumer:
1/ Once a channel has been privately allocated it will no longer be
considered by the general-purpose allocator even after a call to
dma_release_channel().
2/ Since capabilities are specified at the device level a dma_device
with multiple channels will either have all channels public, or all
channels private.
5 SOURCE
include/linux/dmaengine.h: core header file for DMA drivers and api users
drivers/dma/dmaengine.c: offload engine channel management routines
drivers/dma/: location for offload engine drivers
include/linux/async_tx.h: core header file for the async_tx api
crypto/async_tx/async_tx.c: async_tx interface to dmaengine and common code
crypto/async_tx/async_memcpy.c: copy offload
crypto/async_tx/async_xor.c: xor and xor zero sum offload

View file

@ -0,0 +1,352 @@
Below is the original README file from the descore.shar package.
------------------------------------------------------------------------------
des - fast & portable DES encryption & decryption.
Copyright (C) 1992 Dana L. How
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU Library General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Library General Public License for more details.
You should have received a copy of the GNU Library General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Author's address: how@isl.stanford.edu
$Id: README,v 1.15 1992/05/20 00:25:32 how E $
==>> To compile after untarring/unsharring, just `make' <<==
This package was designed with the following goals:
1. Highest possible encryption/decryption PERFORMANCE.
2. PORTABILITY to any byte-addressable host with a 32bit unsigned C type
3. Plug-compatible replacement for KERBEROS's low-level routines.
This second release includes a number of performance enhancements for
register-starved machines. My discussions with Richard Outerbridge,
71755.204@compuserve.com, sparked a number of these enhancements.
To more rapidly understand the code in this package, inspect desSmallFips.i
(created by typing `make') BEFORE you tackle desCode.h. The latter is set
up in a parameterized fashion so it can easily be modified by speed-daemon
hackers in pursuit of that last microsecond. You will find it more
illuminating to inspect one specific implementation,
and then move on to the common abstract skeleton with this one in mind.
performance comparison to other available des code which i could
compile on a SPARCStation 1 (cc -O4, gcc -O2):
this code (byte-order independent):
30us per encryption (options: 64k tables, no IP/FP)
33us per encryption (options: 64k tables, FIPS standard bit ordering)
45us per encryption (options: 2k tables, no IP/FP)
48us per encryption (options: 2k tables, FIPS standard bit ordering)
275us to set a new key (uses 1k of key tables)
this has the quickest encryption/decryption routines i've seen.
since i was interested in fast des filters rather than crypt(3)
and password cracking, i haven't really bothered yet to speed up
the key setting routine. also, i have no interest in re-implementing
all the other junk in the mit kerberos des library, so i've just
provided my routines with little stub interfaces so they can be
used as drop-in replacements with mit's code or any of the mit-
compatible packages below. (note that the first two timings above
are highly variable because of cache effects).
kerberos des replacement from australia (version 1.95):
53us per encryption (uses 2k of tables)
96us to set a new key (uses 2.25k of key tables)
so despite the author's inclusion of some of the performance
improvements i had suggested to him, this package's
encryption/decryption is still slower on the sparc and 68000.
more specifically, 19-40% slower on the 68020 and 11-35% slower
on the sparc, depending on the compiler;
in full gory detail (ALT_ECB is a libdes variant):
compiler machine desCore libdes ALT_ECB slower by
gcc 2.1 -O2 Sun 3/110 304 uS 369.5uS 461.8uS 22%
cc -O1 Sun 3/110 336 uS 436.6uS 399.3uS 19%
cc -O2 Sun 3/110 360 uS 532.4uS 505.1uS 40%
cc -O4 Sun 3/110 365 uS 532.3uS 505.3uS 38%
gcc 2.1 -O2 Sun 4/50 48 uS 53.4uS 57.5uS 11%
cc -O2 Sun 4/50 48 uS 64.6uS 64.7uS 35%
cc -O4 Sun 4/50 48 uS 64.7uS 64.9uS 35%
(my time measurements are not as accurate as his).
the comments in my first release of desCore on version 1.92:
68us per encryption (uses 2k of tables)
96us to set a new key (uses 2.25k of key tables)
this is a very nice package which implements the most important
of the optimizations which i did in my encryption routines.
it's a bit weak on common low-level optimizations which is why
it's 39%-106% slower. because he was interested in fast crypt(3) and
password-cracking applications, he also used the same ideas to
speed up the key-setting routines with impressive results.
(at some point i may do the same in my package). he also implements
the rest of the mit des library.
(code from eay@psych.psy.uq.oz.au via comp.sources.misc)
fast crypt(3) package from denmark:
the des routine here is buried inside a loop to do the
crypt function and i didn't feel like ripping it out and measuring
performance. his code takes 26 sparc instructions to compute one
des iteration; above, Quick (64k) takes 21 and Small (2k) takes 37.
he claims to use 280k of tables but the iteration calculation seems
to use only 128k. his tables and code are machine independent.
(code from glad@daimi.aau.dk via alt.sources or comp.sources.misc)
swedish reimplementation of Kerberos des library
108us per encryption (uses 34k worth of tables)
134us to set a new key (uses 32k of key tables to get this speed!)
the tables used seem to be machine-independent;
he seems to have included a lot of special case code
so that, e.g., `long' loads can be used instead of 4 `char' loads
when the machine's architecture allows it.
(code obtained from chalmers.se:pub/des)
crack 3.3c package from england:
as in crypt above, the des routine is buried in a loop. it's
also very modified for crypt. his iteration code uses 16k
of tables and appears to be slow.
(code obtained from aem@aber.ac.uk via alt.sources or comp.sources.misc)
``highly optimized'' and tweaked Kerberos/Athena code (byte-order dependent):
165us per encryption (uses 6k worth of tables)
478us to set a new key (uses <1k of key tables)
so despite the comments in this code, it was possible to get
faster code AND smaller tables, as well as making the tables
machine-independent.
(code obtained from prep.ai.mit.edu)
UC Berkeley code (depends on machine-endedness):
226us per encryption
10848us to set a new key
table sizes are unclear, but they don't look very small
(code obtained from wuarchive.wustl.edu)
motivation and history
a while ago i wanted some des routines and the routines documented on sun's
man pages either didn't exist or dumped core. i had heard of kerberos,
and knew that it used des, so i figured i'd use its routines. but once
i got it and looked at the code, it really set off a lot of pet peeves -
it was too convoluted, the code had been written without taking
advantage of the regular structure of operations such as IP, E, and FP
(i.e. the author didn't sit down and think before coding),
it was excessively slow, the author had attempted to clarify the code
by adding MORE statements to make the data movement more `consistent'
instead of simplifying his implementation and cutting down on all data
movement (in particular, his use of L1, R1, L2, R2), and it was full of
idiotic `tweaks' for particular machines which failed to deliver significant
speedups but which did obfuscate everything. so i took the test data
from his verification program and rewrote everything else.
a while later i ran across the great crypt(3) package mentioned above.
the fact that this guy was computing 2 sboxes per table lookup rather
than one (and using a MUCH larger table in the process) emboldened me to
do the same - it was a trivial change from which i had been scared away
by the larger table size. in his case he didn't realize you don't need to keep
the working data in TWO forms, one for easy use of half the sboxes in
indexing, the other for easy use of the other half; instead you can keep
it in the form for the first half and use a simple rotate to get the other
half. this means i have (almost) half the data manipulation and half
the table size. in fairness though he might be encoding something particular
to crypt(3) in his tables - i didn't check.
i'm glad that i implemented it the way i did, because this C version is
portable (the ifdef's are performance enhancements) and it is faster
than versions hand-written in assembly for the sparc!
porting notes
one thing i did not want to do was write an enormous mess
which depended on endedness and other machine quirks,
and which necessarily produced different code and different lookup tables
for different machines. see the kerberos code for an example
of what i didn't want to do; all their endedness-specific `optimizations'
obfuscate the code and in the end were slower than a simpler machine
independent approach. however, there are always some portability
considerations of some kind, and i have included some options
for varying numbers of register variables.
perhaps some will still regard the result as a mess!
1) i assume everything is byte addressable, although i don't actually
depend on the byte order, and that bytes are 8 bits.
i assume word pointers can be freely cast to and from char pointers.
note that 99% of C programs make these assumptions.
i always use unsigned char's if the high bit could be set.
2) the typedef `word' means a 32 bit unsigned integral type.
if `unsigned long' is not 32 bits, change the typedef in desCore.h.
i assume sizeof(word) == 4 EVERYWHERE.
the (worst-case) cost of my NOT doing endedness-specific optimizations
in the data loading and storing code surrounding the key iterations
is less than 12%. also, there is the added benefit that
the input and output work areas do not need to be word-aligned.
OPTIONAL performance optimizations
1) you should define one of `i386,' `vax,' `mc68000,' or `sparc,'
whichever one is closest to the capabilities of your machine.
see the start of desCode.h to see exactly what this selection implies.
note that if you select the wrong one, the des code will still work;
these are just performance tweaks.
2) for those with functional `asm' keywords: you should change the
ROR and ROL macros to use machine rotate instructions if you have them.
this will save 2 instructions and a temporary per use,
or about 32 to 40 instructions per en/decryption.
note that gcc is smart enough to translate the ROL/R macros into
machine rotates!
these optimizations are all rather persnickety, yet with them you should
be able to get performance equal to assembly-coding, except that:
1) with the lack of a bit rotate operator in C, rotates have to be synthesized
from shifts. so access to `asm' will speed things up if your machine
has rotates, as explained above in (3) (not necessary if you use gcc).
2) if your machine has less than 12 32-bit registers i doubt your compiler will
generate good code.
`i386' tries to configure the code for a 386 by only declaring 3 registers
(it appears that gcc can use ebx, esi and edi to hold register variables).
however, if you like assembly coding, the 386 does have 7 32-bit registers,
and if you use ALL of them, use `scaled by 8' address modes with displacement
and other tricks, you can get reasonable routines for DesQuickCore... with
about 250 instructions apiece. For DesSmall... it will help to rearrange
des_keymap, i.e., now the sbox # is the high part of the index and
the 6 bits of data is the low part; it helps to exchange these.
since i have no way to conveniently test it i have not provided my
shoehorned 386 version. note that with this release of desCore, gcc is able
to put everything in registers(!), and generate about 370 instructions apiece
for the DesQuickCore... routines!
coding notes
the en/decryption routines each use 6 necessary register variables,
with 4 being actively used at once during the inner iterations.
if you don't have 4 register variables get a new machine.
up to 8 more registers are used to hold constants in some configurations.
i assume that the use of a constant is more expensive than using a register:
a) additionally, i have tried to put the larger constants in registers.
registering priority was by the following:
anything more than 12 bits (bad for RISC and CISC)
greater than 127 in value (can't use movq or byte immediate on CISC)
9-127 (may not be able to use CISC shift immediate or add/sub quick),
1-8 were never registered, being the cheapest constants.
b) the compiler may be too stupid to realize table and table+256 should
be assigned to different constant registers and instead repetitively
do the arithmetic, so i assign these to explicit `m' register variables
when possible and helpful.
i assume that indexing is cheaper or equivalent to auto increment/decrement,
where the index is 7 bits unsigned or smaller.
this assumption is reversed for 68k and vax.
i assume that addresses can be cheaply formed from two registers,
or from a register and a small constant.
for the 68000, the `two registers and small offset' form is used sparingly.
all index scaling is done explicitly - no hidden shifts by log2(sizeof).
the code is written so that even a dumb compiler
should never need more than one hidden temporary,
increasing the chance that everything will fit in the registers.
KEEP THIS MORE SUBTLE POINT IN MIND IF YOU REWRITE ANYTHING.
(actually, there are some code fragments now which do require two temps,
but fixing it would either break the structure of the macros or
require declaring another temporary).
special efficient data format
bits are manipulated in this arrangement most of the time (S7 S5 S3 S1):
003130292827xxxx242322212019xxxx161514131211xxxx080706050403xxxx
(the x bits are still there, i'm just emphasizing where the S boxes are).
bits are rotated left 4 when computing S6 S4 S2 S0:
282726252423xxxx201918171615xxxx121110090807xxxx040302010031xxxx
the rightmost two bits are usually cleared so the lower byte can be used
as an index into an sbox mapping table. the next two x'd bits are set
to various values to access different parts of the tables.
how to use the routines
datatypes:
pointer to 8 byte area of type DesData
used to hold keys and input/output blocks to des.
pointer to 128 byte area of type DesKeys
used to hold full 768-bit key.
must be long-aligned.
DesQuickInit()
call this before using any other routine with `Quick' in its name.
it generates the special 64k table these routines need.
DesQuickDone()
frees this table
DesMethod(m, k)
m points to a 128byte block, k points to an 8 byte des key
which must have odd parity (or -1 is returned) and which must
not be a (semi-)weak key (or -2 is returned).
normally DesMethod() returns 0.
m is filled in from k so that when one of the routines below
is called with m, the routine will act like standard des
en/decryption with the key k. if you use DesMethod,
you supply a standard 56bit key; however, if you fill in
m yourself, you will get a 768bit key - but then it won't
be standard. it's 768bits not 1024 because the least significant
two bits of each byte are not used. note that these two bits
will be set to magic constants which speed up the encryption/decryption
on some machines. and yes, each byte controls
a specific sbox during a specific iteration.
you really shouldn't use the 768bit format directly; i should
provide a routine that converts 128 6-bit bytes (specified in
S-box mapping order or something) into the right format for you.
this would entail some byte concatenation and rotation.
Des{Small|Quick}{Fips|Core}{Encrypt|Decrypt}(d, m, s)
performs des on the 8 bytes at s into the 8 bytes at d. (d,s: char *).
uses m as a 768bit key as explained above.
the Encrypt|Decrypt choice is obvious.
Fips|Core determines whether a completely standard FIPS initial
and final permutation is done; if not, then the data is loaded
and stored in a nonstandard bit order (FIPS w/o IP/FP).
Fips slows down Quick by 10%, Small by 9%.
Small|Quick determines whether you use the normal routine
or the crazy quick one which gobbles up 64k more of memory.
Small is 50% slower then Quick, but Quick needs 32 times as much
memory. Quick is included for programs that do nothing but DES,
e.g., encryption filters, etc.
Getting it to compile on your machine
there are no machine-dependencies in the code (see porting),
except perhaps the `now()' macro in desTest.c.
ALL generated tables are machine independent.
you should edit the Makefile with the appropriate optimization flags
for your compiler (MAX optimization).
Speeding up kerberos (and/or its des library)
note that i have included a kerberos-compatible interface in desUtil.c
through the functions des_key_sched() and des_ecb_encrypt().
to use these with kerberos or kerberos-compatible code put desCore.a
ahead of the kerberos-compatible library on your linker's command line.
you should not need to #include desCore.h; just include the header
file provided with the kerberos library.
Other uses
the macros in desCode.h would be very useful for putting inline des
functions in more complicated encryption routines.