libmsnumpress-1.0.0/0000755000175000017500000000000013211321623014333 5ustar rusconirusconilibmsnumpress-1.0.0/README.md0000644000175000017500000001032313211321504015607 0ustar rusconirusconiMS Numpress =========== Implementations of two compression schemes for numeric data from mass spectrometers. The library provides implementations of 3 different algorithms, 1 designed to compress first order smooth data like retention time or M/Z arrays, and 2 for compressing non smooth data with lower requirements on precision like ion count arrays. Implementations and unit test are provided in C++, Java, and C#: for Python bindings exist. ### C++ library tests For C++, move to `src/main/cpp` and compile and run tests (on LINUX) with g++ MSNumpress.cpp MSNumpressTest.cpp -o test && ./test ### Java (maven) library tests Ensure that maven (2.2+) is installed. Then, in this directory, run mvn test ### Python library tests Ensure that Cython and the Python headers are installed on your system. Then move to `src/main/python` and compile and run tests (on LINUX) with python setup.py build_ext --inplace nosetests test_pymsnumpress.py ### C# library tests Ensure that a version of Visual Studio is installed on your system. Then open a Visual Studio Cross Tools Command Prompt, move to `src\main\csharp` and compile\run tests (on WINDOWS) with csc /target:library MSNumpress.cs MSNumpressTest.cs /reference:"C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\PublicAssemblies\Microsoft.VisualStudio.QualityTools.UnitTestFramework.dll" MSTest /testcontainer:MSNumpress.dll NOTE: The example above is for Visual Studio Community 2015 (v14.0). If you use a different version, your path to the unit test reference DLL will be slightly different. Numpress Pic ------------ ### MS Numpress positive integer compression Intended for ion count data, this compression simply rounds values to the nearest integer, and stores these integers in a truncated form which is effective for values relatively close to zero. Numpress Slof ------------- ### MS Numpress short logged float compression Also targeting ion count data, this compression takes the natural logarithm of values, multiplies by a scaling factor and rounds to the nearest integer. For typical ion count dynamic range these values fits into two byte integers, so only the two least significant bytes of the integer are stored. The scaling factor can be chosen manually, but the library also contains a function for retrieving the optimal Slof scaling factor for a given data array. Since the scaling factor is variable, it is stored as a regular double precision float first in the encoding, and automatically parsed during decoding. Numpress Lin ------------ ### MS Numpress linear prediction compression This compression uses a fixed point representation, achieve by multiplication by a scaling factor and rounding to the nearest integer. To exploit the assumed linearity of the data, linear prediction is then used in the following way. The first two values are stored without compression as 4 byte integers. For each following value a linear prediction is made from the two previous values: Xpred = (X(n) - X(n-1)) + X(n) Xres = Xpred - X(n+1) The residual `Xres` is then stored, using the same truncated integer representation as in Numpress Pic. The scaling factor can be chosen manually, but the library also contains a function for retrieving the optimal Lin scaling factor for a given data array. Since the scaling factor is variable, it is stored as a regular double precision float first in the encoding, and automatically parsed during decoding. Truncated integer representation --------------------------------- This encoding works on a 4 byte integer, by truncating initial zeros or ones. If the initial (most significant) half byte is 0x0 or 0xf, the number of such halfbytes starting from the most significant is stored in a halfbyte. This initial count is then followed by the rest of the ints halfbytes, in little-endian order. A count halfbyte c of 0 <= c <= 8 is interpreted as an initial c 0x0 halfbytes 9 <= c <= 15 is interpreted as an initial (c-8) 0xf halfbytes Examples: int c rest 0 => 0x8 -1 => 0xf 0xf 23 => 0x6 0x7 0x1 License ------- This code is open source. It is dual licenced under the Apache 2.0 license as well as the 3-clause BSD licence. See the LICENCE-BSD and the LICENCE-APACHE file for the licences. libmsnumpress-1.0.0/src/0000755000175000017500000000000013211321545015125 5ustar rusconirusconilibmsnumpress-1.0.0/src/main/0000755000175000017500000000000013211321530016043 5ustar rusconirusconilibmsnumpress-1.0.0/src/main/cpp/0000755000175000017500000000000013211321504016626 5ustar rusconirusconilibmsnumpress-1.0.0/src/main/cpp/MSNumpressTest.cpp0000644000175000017500000005176513211321504022264 0ustar rusconirusconi/* MSNumpressTest.cpp johan.teleman@immun.lth.se Copyright 2013 Johan Teleman Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ /* Compile and run tests (on LINUX) with > g++ MSNumpress.cpp MSNumpressTest.cpp -o test && ./test */ #include "MSNumpress.hpp" #include #include #include #include #include using std::cout; using std::endl; using std::abs; using std::max; double ENC_TWO_BYTE_FIXED_POINT = 3000.0; void encodeLinear1() { double mzs[1]; mzs[0] = 100.0; size_t nMzs = 1; unsigned char encoded[12]; size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], 100000.0); assert(12 == encodedBytes); assert(0x80 == encoded[8]); assert(0x96 == encoded[9]); assert(0x98 == encoded[10]); assert(0x00 == encoded[11]); cout << "+ pass encodeLinear1 " << endl << endl; } void encodeLinear() { double mzs[4]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 300.00005; mzs[3] = 400.00010; size_t nMzs = 4; unsigned char encoded[20]; size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], 100000.0); assert(18 == encodedBytes); assert(0x80 == encoded[8]); assert(0x96 == encoded[9]); assert(0x98 == encoded[10]); assert(0x00 == encoded[11]); assert(0x75 == encoded[16]); assert(0x80 == encoded[17]); cout << "+ pass encodeLinear " << endl << endl; } void decodeLinearNice() { double mzs[4]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 300.00005; mzs[3] = 400.00010; size_t nMzs = 4; unsigned char encoded[28]; double fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPoint(&mzs[0], nMzs); size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], fixedPoint); double decoded[4]; size_t numDecoded = ms::numpress::MSNumpress::decodeLinear(&encoded[0], encodedBytes, &decoded[0]); assert(4 == numDecoded); assert(abs(100.0 - decoded[0]) < 0.000005); assert(abs(200.0 - decoded[1]) < 0.000005); assert(abs(300.00005 - decoded[2]) < 0.000005); assert(abs(400.00010 - decoded[3]) < 0.000005); cout << "+ pass decodeLinearNice " << endl << endl; } void decodeLinearNiceLowFP() { double mzs[7]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 300.00005; mzs[3] = 400.00010; mzs[4] = 450.00010; mzs[5] = 455.00010; mzs[6] = 700.00010; size_t nMzs = 7; unsigned char encoded[33]; // max length is 33 bytes // check for fixed points { double fixedPoint; fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, 0.1); assert( abs(5 - fixedPoint) < 0.000005); fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, 1e-3); assert( abs(500 - fixedPoint) < 0.000005); fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, 1e-5); assert( abs(50000 - fixedPoint) < 0.000005); fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, 1e-7); assert( abs(5000000 - fixedPoint) < 0.000005); // cannot fulfill accuracy of 1e-8 fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, 1e-8); assert( abs(-1 - fixedPoint) < 0.000005); } { double fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, 0.001); size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], fixedPoint); double decoded[7]; size_t numDecoded = ms::numpress::MSNumpress::decodeLinear(&encoded[0], encodedBytes, &decoded[0]); assert(25 == encodedBytes); assert(7 == numDecoded); assert(abs(100.0 - decoded[0]) < 0.001); assert(abs(200.0 - decoded[1]) < 0.001); assert(abs(300.00005 - decoded[2]) < 0.001); assert(abs(400.00010 - decoded[3]) < 0.001); } double mz_err[5]; double encodedLength[5]; // for higher accuracy, we get longer encoded lengths mz_err[0] = 0.1; encodedLength[0] = 22; mz_err[1] = 1e-3; encodedLength[1] = 25; mz_err[2] = 1e-5; encodedLength[2] = 29; mz_err[3] = 1e-6; encodedLength[3] = 30; mz_err[4] = 1e-7; encodedLength[4] = 31; for (int k = 0; k < 4; k++) { double fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPointMass(&mzs[0], nMzs, mz_err[k]); size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], fixedPoint); double decoded[7]; size_t numDecoded = ms::numpress::MSNumpress::decodeLinear(&encoded[0], encodedBytes, &decoded[0]); assert( encodedLength[k] == encodedBytes); assert(7 == numDecoded); assert(abs(100.0 - decoded[0]) < mz_err[k]); assert(abs(200.0 - decoded[1]) < mz_err[k]); assert(abs(300.00005 - decoded[2]) < mz_err[k]); assert(abs(400.00010 - decoded[3]) < mz_err[k]); assert(abs(450.00010 - decoded[4]) < mz_err[k]); assert(abs(455.00010 - decoded[5]) < mz_err[k]); assert(abs(700.00010 - decoded[6]) < mz_err[k]); } cout << "+ pass decodeLinearNiceFP " << endl << endl; } void decodeLinearWierd() { double mzs[4]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 300.00005; mzs[3] = 0.00010; size_t nMzs = 4; unsigned char encoded[28]; double fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPoint(&mzs[0], nMzs); size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], fixedPoint); double decoded[4]; size_t numDecoded = ms::numpress::MSNumpress::decodeLinear(&encoded[0], encodedBytes, &decoded[0]); assert(4 == numDecoded); assert(abs(100.0 - decoded[0]) < 0.000005); assert(abs(200.0 - decoded[1]) < 0.000005); assert(abs(300.00005 - decoded[2]) < 0.000005); assert(abs(0.00010 - decoded[3]) < 0.000005); cout << "+ pass decodeLinearWierd " << endl << endl; } void decodeLinearWierd_llong_overflow() { double mzs[4]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 30000000.00005; mzs[3] = 0.000010; size_t nMzs = 4; unsigned char encoded[28]; try { ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], 1000000000000); cout << "- fail test decodeLinearWierd_llong_overflow: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { assert( std::string(err) == std::string("[MSNumpress::encodeLinear] Next number overflows LLONG_MAX.")); } cout << "+ pass decodeLinearWierd_llong_overflow " << endl << endl; } void decodeLinearWierd_int_overflow() { double mzs[4]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 30000000.00005; mzs[3] = 0.00006; size_t nMzs = 4; unsigned char encoded[28]; try { ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], 1000000000); cout << "- fail test decodeLinearWierd_int_overflow: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { assert( std::string(err) == std::string("[MSNumpress::encodeLinear] Cannot encode a number that exceeds the bounds of [-INT_MAX, INT_MAX].")); } cout << "+ pass decodeLinearWierd3 " << endl << endl; } void decodeLinearWierd_int_underflow() { double mzs[4]; mzs[0] = 30000000.00005; mzs[1] = 60000000.00005; mzs[2] = 30000000.00005; mzs[3] = 0.00006; size_t nMzs = 4; unsigned char encoded[28]; try { ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], 1000000000); cout << "- fail test decodeLinearWierd_int_underflow: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { assert( std::string(err) == std::string("[MSNumpress::encodeLinear] Cannot encode a number that exceeds the bounds of [-INT_MAX, INT_MAX].")); } cout << "+ pass decodeLinearWierd3 " << endl << endl; } void decodeLinearCorrupt1() { unsigned char encoded[20]; double decoded[4]; try { ms::numpress::MSNumpress::decodeLinear(&encoded[0], 11, &decoded[0]); cout << "- fail decodeLinearCorrupt1: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { } try { ms::numpress::MSNumpress::decodeLinear(&encoded[0], 14, &decoded[0]); cout << "- fail decodeLinearCorrupt1: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { } cout << "+ pass decodeLinearCorrupt 1 " << endl << endl; } void decodeLinearCorrupt2() { double mzs[4]; mzs[0] = 100.0; mzs[1] = 200.0; mzs[2] = 300.00005; mzs[3] = 0.00010; size_t nMzs = 4; unsigned char encoded[28]; double fixedPoint = ms::numpress::MSNumpress::optimalLinearFixedPoint(&mzs[0], nMzs); size_t encodedBytes = ms::numpress::MSNumpress::encodeLinear(&mzs[0], nMzs, &encoded[0], fixedPoint); double decoded[4]; try { ms::numpress::MSNumpress::decodeLinear(&encoded[0], encodedBytes-1, &decoded[0]); cout << "- fail decodeLinearCorrupt2: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { } cout << "+ pass decodeLinearCorrupt 2 " << endl << endl; } void optimalLinearFixedPoint() { srand(123459); size_t n = 1000; double mzs[1000]; mzs[0] = 300 + (rand() % 1000) / 1000.0; for (size_t i=1; i= mLim) { cout << "error " << error << " above limit " << mLim << endl; assert(error < mLim); } } cout << "+ size compressed: " << encodedBytes / double(n*8) * 100 << "% " << endl; cout << "+ max error: " << m << " limit: " << mLim << endl; cout << "+ pass encodeDecodeLinearStraight " << endl << endl; } void encodeDecodeSafeStraight() { double error; double eLim = 1.0e-300; size_t n = 15; double mzs[15]; for (size_t i=0; i= eLim) { cout << "error " << error << " is non-zero ( >= " << eLim << " )" << endl; assert(error == 0); } } cout << "+ pass encodeDecodeSafeStraight " << endl << endl; } void encodeDecodeSafe() { srand(123459); double error; double eLim = 1.0e-300; size_t n = 1000; double mzs[1000]; mzs[0] = 300 + rand() / double(RAND_MAX); for (size_t i=1; i= eLim) { cout << "error " << error << " is non-zero ( >= " << eLim << " )" << endl; assert(error == 0); } } cout << "+ pass encodeDecodeSafe " << endl << endl; } void encodeDecodeLinear() { srand(123459); size_t n = 1000; double mzs[1000]; mzs[0] = 300 + rand() / double(RAND_MAX); for (size_t i=1; i= mLim) { cout << "error " << error << " above limit " << mLim << endl; assert(error < mLim); } } cout << "+ size compressed: " << encodedBytes / double(n*8) * 100 << "% " << endl; cout << "+ max error: " << m << " limit: " << mLim << endl; cout << "+ pass encodeDecodeLinear " << endl << endl; } void encodeDecodeLinear5() { srand(123662); size_t n = 1000; double mzs[1000]; mzs[0] = 100 + (rand() % 1000) / 1000.0; for (size_t i=1; i= mLim) { cout << endl << ics[i] << " " << decoded[i] << endl; assert(error < mLim); } } else { error = abs((ics[i] - decoded[i]) / ((ics[i] + decoded[i])/2)); rm = max(rm, error); if (error >= rmLim) { cout << endl << ics[i] << " " << decoded[i] << endl; assert(error < rmLim); } } cout << "+ max error: " << m << " limit: " << mLim << endl; cout << "+ max rel error: " << rm << " limit: " << rmLim << endl; cout << "+ pass encodeDecodeSlof " << endl << endl; } void encodeDecodeSlof5() { srand(123459); size_t n = 1000; double ics[1000]; for (size_t i=0; i result; // set data to [ 100, 102, 140, 92, 33, 80, 145 ]; // Base64 is "ZGaMXCFQkQ==" std::vector data; data.resize(32); data[0] = 100; data[1] = 102; data[2] = 140; data[3] = 92; data[4] = 33; data[5] = 80; data[6] = 145; try { ms::numpress::MSNumpress::decodePic(data, result); cout << "- fail testErroneousDecodePic: didn't throw exception for corrupt input " << endl << endl; assert(0 == 1); } catch (const char *err) { } cout << "+ pass testErroneousDecodePic " << endl << endl; } int main(int argc, const char* argv[]) { optimalLinearFixedPoint(); optimalLinearFixedPointMass(); encodeLinear1(); encodeLinear(); decodeLinearNice(); decodeLinearNiceLowFP(); decodeLinearWierd(); decodeLinearCorrupt1(); decodeLinearCorrupt2(); encodeDecodeLinearStraight(); encodeDecodeLinear(); encodeDecodePic(); encodeDecodeSafeStraight(); encodeDecodeSafe(); optimalSlofFixedPoint(); encodeDecodeSlof(); encodeDecodeLinear5(); encodeDecodePic5(); encodeDecodeSlof5(); testErroneousDecodePic(); cout << "=== all tests succeeded! ===" << endl; return 0; } libmsnumpress-1.0.0/src/main/cpp/MSNumpress.hpp0000644000175000017500000002654613211321504021430 0ustar rusconirusconi/* MSNumpress.hpp johan.teleman@immun.lth.se Copyright 2013 Johan Teleman Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ /* ==================== encodeInt ==================== Some of the encodings described below use a integer compression referred to simply as encodeInt() This encoding works on a 4 byte integer, by truncating initial zeros or ones. If the initial (most significant) half byte is 0x0 or 0xf, the number of such halfbytes starting from the most significant is stored in a halfbyte. This initial count is then followed by the rest of the ints halfbytes, in little-endian order. A count halfbyte c of 0 <= c <= 8 is interpreted as an initial c 0x0 halfbytes 9 <= c <= 15 is interpreted as an initial (c-8) 0xf halfbytes Ex: int c rest 0 => 0x8 -1 => 0xf 0xf 23 => 0x6 0x7 0x1 */ #ifndef _MSNUMPRESS_HPP_ #define _MSNUMPRESS_HPP_ #include #include // defines whether to throw an exception when a number cannot be encoded safely // with the given parameters #ifndef THROW_ON_OVERFLOW #define THROW_ON_OVERFLOW true #endif namespace ms { namespace numpress { namespace MSNumpress { /** * Compute the maximal linear fixed point that prevents integer overflow. * * @data pointer to array of double to be encoded (need memorycont. repr.) * @dataSize number of doubles from *data to encode * * @return the linear fixed point safe to use */ double optimalLinearFixedPoint( const double *data, size_t dataSize); /** * Compute the optimal linear fixed point with a desired m/z accuracy. * * @note If the desired accuracy cannot be reached without overflowing 64 * bit integers, then a negative value is returned. You need to check for * this and in that case abandon numpress or use optimalLinearFixedPoint * which returns the largest safe value. * * @data pointer to array of double to be encoded (need memorycont. repr.) * @dataSize number of doubles from *data to encode * @mass_acc desired m/z accuracy in Th * * @return the linear fixed point that satisfies the accuracy requirement (or -1 in case of failure). */ double optimalLinearFixedPointMass( const double *data, size_t dataSize, double mass_acc); /** * Encodes the doubles in data by first using a * - lossy conversion to a 4 byte 5 decimal fixed point representation * - storing the residuals from a linear prediction after first two values * - encoding by encodeInt (see above) * * The resulting binary is maximally 8 + dataSize * 5 bytes, but much less if the * data is reasonably smooth on the first order. * * This encoding is suitable for typical m/z or retention time binary arrays. * On a test set, the encoding was empirically show to be accurate to at least 0.002 ppm. * * @data pointer to array of double to be encoded (need memorycont. repr.) * @dataSize number of doubles from *data to encode * @result pointer to where resulting bytes should be stored * @fixedPoint the scaling factor used for getting the fixed point repr. * This is stored in the binary and automatically extracted * on decoding. * @return the number of encoded bytes */ size_t encodeLinear( const double *data, const size_t dataSize, unsigned char *result, double fixedPoint); /** * Calls lower level encodeLinear while handling vector sizes appropriately * * @data vector of doubles to be encoded * @result vector of resulting bytes (will be resized to the number of bytes) */ void encodeLinear( const std::vector &data, std::vector &result, double fixedPoint); /** * Decodes data encoded by encodeLinear. * * result vector guaranteed to be shorter or equal to (|data| - 8) * 2 * * Note that this method may throw a const char* if it deems the input data to be corrupt, i.e. * that the last encoded int does not use the last byte in the data. In addition the last encoded * int need to use either the last halfbyte, or the second last followed by a 0x0 halfbyte. * * @data pointer to array of bytes to be decoded (need memorycont. repr.) * @dataSize number of bytes from *data to decode * @result pointer to were resulting doubles should be stored * @return the number of decoded doubles, or -1 if dataSize < 4 or 4 < dataSize < 8 */ size_t decodeLinear( const unsigned char *data, const size_t dataSize, double *result); /** * Calls lower level decodeLinear while handling vector sizes appropriately * * Note that this method may throw a const char* if it deems the input data to be corrupt, i.e.. * that the last encoded int does not use the last byte in the data. In addition the last encoded * int need to use either the last halfbyte, or the second last followed by a 0x0 halfbyte. * * @data vector of bytes to be decoded * @result vector of resulting double (will be resized to the number of doubles) */ void decodeLinear( const std::vector &data, std::vector &result); ///////////////////////////////////////////////////////////// /** * Encodes the doubles in data by storing the residuals from a linear prediction after first two values. * * The resulting binary is the same size as the input data. * * This encoding is suitable for typical m/z or retention time binary arrays, and is * intended to be used before zlib compression to improve compression. * * @data pointer to array of doubles to be encoded (need memorycont. repr.) * @dataSize number of doubles from *data to encode * @result pointer to were resulting bytes should be stored */ size_t encodeSafe( const double *data, const size_t dataSize, unsigned char *result); /** * Decodes data encoded by encodeSafe. * * result vector is the same size as the input data. * * Might throw const char* is something goes wrong during decoding. * * @data pointer to array of bytes to be decoded (need memorycont. repr.) * @dataSize number of bytes from *data to decode * @result pointer to were resulting doubles should be stored * @return the number of decoded bytes */ size_t decodeSafe( const unsigned char *data, const size_t dataSize, double *result); ///////////////////////////////////////////////////////////// /** * Encodes ion counts by simply rounding to the nearest 4 byte integer, * and compressing each integer with encodeInt. * * The handleable range is therefore 0 -> 4294967294. * The resulting binary is maximally dataSize * 5 bytes, but much less if the * data is close to 0 on average. * * @data pointer to array of double to be encoded (need memorycont. repr.) * @dataSize number of doubles from *data to encode * @result pointer to were resulting bytes should be stored * @return the number of encoded bytes */ size_t encodePic( const double *data, const size_t dataSize, unsigned char *result); /** * Calls lower level encodePic while handling vector sizes appropriately * * @data vector of doubles to be encoded * @result vector of resulting bytes (will be resized to the number of bytes) */ void encodePic( const std::vector &data, std::vector &result); /** * Decodes data encoded by encodePic * * result vector guaranteed to be shorter of equal to |data| * 2 * * Note that this method may throw a const char* if it deems the input data to be corrupt, i.e. * that the last encoded int does not use the last byte in the data. In addition the last encoded * int need to use either the last halfbyte, or the second last followed by a 0x0 halfbyte. * * @data pointer to array of bytes to be decoded (need memorycont. repr.) * @dataSize number of bytes from *data to decode * @result pointer to were resulting doubles should be stored * @return the number of decoded doubles */ size_t decodePic( const unsigned char *data, const size_t dataSize, double *result); /** * Calls lower level decodePic while handling vector sizes appropriately * * Note that this method may throw a const char* if it deems the input data to be corrupt, i.e. * that the last encoded int does not use the last byte in the data. In addition the last encoded * int need to use either the last halfbyte, or the second last followed by a 0x0 halfbyte. * * @data vector of bytes to be decoded * @result vector of resulting double (will be resized to the number of doubles) */ void decodePic( const std::vector &data, std::vector &result); ///////////////////////////////////////////////////////////// double optimalSlofFixedPoint( const double *data, size_t dataSize); /** * Encodes ion counts by taking the natural logarithm, and storing a * fixed point representation of this. This is calculated as * * unsigned short fp = log(d + 1) * fixedPoint + 0.5 * * the result vector is exactly |data| * 2 + 8 bytes long * * @data pointer to array of double to be encoded (need memorycont. repr.) * @dataSize number of doubles from *data to encode * @result pointer to were resulting bytes should be stored * @return the number of encoded bytes */ size_t encodeSlof( const double *data, const size_t dataSize, unsigned char *result, double fixedPoint); /** * Calls lower level encodeSlof while handling vector sizes appropriately * * @data vector of doubles to be encoded * @result vector of resulting bytes (will be resized to the number of bytes) */ void encodeSlof( const std::vector &data, std::vector &result, double fixedPoint); /** * Decodes data encoded by encodeSlof * * The return will include exactly (|data| - 8) / 2 doubles. * * Note that this method may throw a const char* if it deems the input data to be corrupt. * * @data pointer to array of bytes to be decoded (need memorycont. repr.) * @dataSize number of bytes from *data to decode * @result pointer to were resulting doubles should be stored * @return the number of decoded doubles */ size_t decodeSlof( const unsigned char *data, const size_t dataSize, double *result); /** * Calls lower level decodeSlof while handling vector sizes appropriately * * Note that this method may throw a const char* if it deems the input data to be corrupt. * * @data vector of bytes to be decoded * @result vector of resulting double (will be resized to the number of doubles) */ void decodeSlof( const std::vector &data, std::vector &result); } // namespace MSNumpress } // namespace msdata } // namespace pwiz #endif // _MSNUMPRESS_HPP_ libmsnumpress-1.0.0/src/main/cpp/MSNumpress.cpp0000644000175000017500000004303413211321504021412 0ustar rusconirusconi/* MSNumpress.cpp johan.teleman@immun.lth.se Copyright 2013 Johan Teleman Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #include #include #include #include #include "MSNumpress.hpp" namespace ms { namespace numpress { namespace MSNumpress { using std::cout; using std::cerr; using std::endl; using std::min; using std::max; using std::abs; // This is only valid on systems were ints use more bytes than chars... const int ONE = 1; static bool is_little_endian() { return *((char*)&(ONE)) == 1; } bool IS_LITTLE_ENDIAN = is_little_endian(); ///////////////////////////////////////////////////////////// static void encodeFixedPoint( double fixedPoint, unsigned char *result ) { int i; unsigned char *fp = (unsigned char*)&fixedPoint; for (i=0; i<8; i++) { result[i] = fp[IS_LITTLE_ENDIAN ? (7-i) : i]; } } static double decodeFixedPoint( const unsigned char *data ) { int i; double fixedPoint; unsigned char *fp = (unsigned char*)&fixedPoint; for (i=0; i<8; i++) { fp[i] = data[IS_LITTLE_ENDIAN ? (7-i) : i]; } return fixedPoint; } ///////////////////////////////////////////////////////////// /** * Encodes the int x as a number of halfbytes in res. * res_length is incremented by the number of halfbytes, * which will be 1 <= n <= 9 */ static void encodeInt( const unsigned int x, unsigned char* res, size_t *res_length ) { // get the bit pattern of a signed int x_inp unsigned int m; unsigned char i, l; // numbers between 0 and 9 unsigned int mask = 0xf0000000; unsigned int init = x & mask; if (init == 0) { l = 8; for (i=0; i<8; i++) { m = mask >> (4*i); if ((x & m) != 0) { l = i; break; } } res[0] = l; for (i=l; i<8; i++) { res[1+i-l] = static_cast( x >> (4*(i-l)) ); } *res_length += 1+8-l; } else if (init == mask) { l = 7; for (i=0; i<8; i++) { m = mask >> (4*i); if ((x & m) != m) { l = i; break; } } res[0] = l + 8; for (i=l; i<8; i++) { res[1+i-l] = static_cast( x >> (4*(i-l)) ); } *res_length += 1+8-l; } else { res[0] = 0; for (i=0; i<8; i++) { res[1+i] = static_cast( x >> (4*i) ); } *res_length += 9; } } /** * Decodes an int from the half bytes in bp. Lossless reverse of encodeInt * * @param data ptr to the char data to decode * @param di position in the char data array to start decoding (will be advanced) * @param max_di size of data array * @param half helper variable (do not change between multiple calls) * @param res result (a 32 bit integer) * * @note the helper variable indicates whether we look at the first half byte * or second half byte of the current data (thus whether to interpret the first * half byte of data[*di] or the second half byte). * */ static void decodeInt( const unsigned char *data, size_t *di, size_t max_di, size_t *half, unsigned int *res ) { size_t n, i; unsigned int mask, m; unsigned char head; unsigned char hb; // Extract the first half byte, specifying the number of leading zero half // bytes of the final integer. // If half is zero, we look at the first half byte, otherwise we look at // the second (lower) half byte and advance the counter to the next char. if (*half == 0) { head = data[*di] >> 4; } else { head = data[*di] & 0xf; (*di)++; } *half = 1-(*half); // switch to other half byte *res = 0; if (head <= 8) { n = head; } else { // we have n leading ones, fill n half bytes in res with 0xf n = head - 8; mask = 0xf0000000; for (i=0; i> (4*i); *res = *res | m; } } if (n == 8) { return; } if (*di + ((8 - n) - (1 - *half)) / 2 >= max_di) { throw "[MSNumpress::decodeInt] Corrupt input data! "; } for (i=n; i<8; i++) { if (*half == 0) { hb = data[*di] >> 4; } else { hb = data[*di] & 0xf; (*di)++; } *res = *res | ( static_cast(hb) << ((i-n)*4)); *half = 1 - (*half); } } ///////////////////////////////////////////////////////////// double optimalLinearFixedPointMass( const double *data, size_t dataSize, double mass_acc ) { if (dataSize < 3) return 0; // we just encode the first two points as floats // We calculate the maximal fixedPoint we need to achieve a specific mass // accuracy. Note that the maximal error we will make by encoding as int is // 0.5 due to rounding errors. double maxFP = 0.5 / mass_acc; // There is a maximal value for the FP given by the int length (32bit) // which means we cannot choose a value higher than that. In case we cannot // achieve the desired accuracy, return failure (-1). double maxFP_overflow = optimalLinearFixedPoint(data, dataSize); if (maxFP > maxFP_overflow) return -1; return maxFP; } double optimalLinearFixedPoint( const double *data, size_t dataSize ) { /* * safer impl - apparently not needed though * if (dataSize == 0) return 0; double maxDouble = 0; double x; for (size_t i=0; i(data[0] * fixedPoint + 0.5); for (i=0; i<4; i++) { result[8+i] = (ints[1] >> (i*8)) & 0xff; } if (dataSize == 1) return 12; ints[2] = static_cast(data[1] * fixedPoint + 0.5); for (i=0; i<4; i++) { result[12+i] = (ints[2] >> (i*8)) & 0xff; } halfByteCount = 0; ri = 16; for (i=2; i LLONG_MAX ) { throw "[MSNumpress::encodeLinear] Next number overflows LLONG_MAX."; } ints[2] = static_cast(data[i] * fixedPoint + 0.5); extrapol = ints[1] + (ints[1] - ints[0]); if (THROW_ON_OVERFLOW && ( ints[2] - extrapol > INT_MAX || ints[2] - extrapol < INT_MIN )) { throw "[MSNumpress::encodeLinear] Cannot encode a number that exceeds the bounds of [-INT_MAX, INT_MAX]."; } diff = static_cast(ints[2] - extrapol); //printf("%lu %lu %lu, extrapol: %ld diff: %d \n", ints[0], ints[1], ints[2], extrapol, diff); encodeInt( static_cast(diff), &halfBytes[halfByteCount], &halfByteCount ); /* printf("%d (%d): ", diff, (int)halfByteCount); for (size_t j=0; j( (halfBytes[hbi-1] << 4) | (halfBytes[hbi] & 0xf) ); //printf("%x \n", result[ri]); ri++; } if (halfByteCount % 2 != 0) { halfBytes[0] = halfBytes[halfByteCount-1]; halfByteCount = 1; } else { halfByteCount = 0; } } if (halfByteCount == 1) { result[ri] = static_cast(halfBytes[0] << 4); ri++; } return ri; } size_t decodeLinear( const unsigned char *data, const size_t dataSize, double *result ) { size_t i; size_t ri = 0; unsigned int init, buff; int diff; long long ints[3]; //double d; size_t di; size_t half; long long extrapol; long long y; double fixedPoint; //printf("Decoding %d bytes with fixed point %f\n", (int)dataSize, fixedPoint); if (dataSize == 8) return 0; if (dataSize < 8) throw "[MSNumpress::decodeLinear] Corrupt input data: not enough bytes to read fixed point! "; fixedPoint = decodeFixedPoint(data); if (dataSize < 12) throw "[MSNumpress::decodeLinear] Corrupt input data: not enough bytes to read first value! "; ints[1] = 0; for (i=0; i<4; i++) { ints[1] = ints[1] | ((0xff & (init = data[8+i])) << (i*8)); } result[0] = ints[1] / fixedPoint; if (dataSize == 12) return 1; if (dataSize < 16) throw "[MSNumpress::decodeLinear] Corrupt input data: not enough bytes to read second value! "; ints[2] = 0; for (i=0; i<4; i++) { ints[2] = ints[2] | ((0xff & (init = data[12+i])) << (i*8)); } result[1] = ints[2] / fixedPoint; half = 0; ri = 2; di = 16; //printf(" di ri half int[0] int[1] extrapol diff\n"); while (di < dataSize) { if (di == (dataSize - 1) && half == 1) { if ((data[di] & 0xf) == 0x0) { break; } } //printf("%7d %7d %7d %lu %lu %ld", di, ri, half, ints[0], ints[1], extrapol); ints[0] = ints[1]; ints[1] = ints[2]; decodeInt(data, &di, dataSize, &half, &buff); diff = static_cast(buff); extrapol = ints[1] + (ints[1] - ints[0]); y = extrapol + diff; //printf(" %d \n", diff); result[ri++] = y / fixedPoint; ints[2] = y; } return ri; } void encodeLinear( const std::vector &data, std::vector &result, double fixedPoint ) { size_t dataSize = data.size(); result.resize(dataSize * 5 + 8); size_t encodedLength = encodeLinear(&data[0], dataSize, &result[0], fixedPoint); result.resize(encodedLength); } void decodeLinear( const std::vector &data, std::vector &result ) { size_t dataSize = data.size(); result.resize((dataSize - 8) * 2); size_t decodedLength = decodeLinear(&data[0], dataSize, &result[0]); result.resize(decodedLength); } ///////////////////////////////////////////////////////////// size_t encodeSafe( const double *data, const size_t dataSize, unsigned char *result ) { size_t i, j, ri = 0; double latest[3]; double extrapol, diff; const unsigned char *fp; //printf("d0 d1 d2 extrapol diff\n"); if (dataSize == 0) return ri; latest[1] = data[0]; fp = (unsigned char*)data; for (i=0; i<8; i++) { result[ri++] = fp[IS_LITTLE_ENDIAN ? (7-i) : i]; } if (dataSize == 1) return ri; latest[2] = data[1]; fp = (unsigned char*)&(data[1]); for (i=0; i<8; i++) { result[ri++] = fp[IS_LITTLE_ENDIAN ? (7-i) : i]; } fp = (unsigned char*)&diff; for (i=2; i INT_MAX || data[i] < -0.5) ){ throw "[MSNumpress::encodePic] Cannot use Pic to encode a number larger than INT_MAX or smaller than 0."; } x = static_cast(data[i] + 0.5); //printf("%d %d %d, extrapol: %d diff: %d \n", ints[0], ints[1], ints[2], extrapol, diff); encodeInt(x, &halfBytes[halfByteCount], &halfByteCount); for (hbi=1; hbi < halfByteCount; hbi+=2) { result[ri] = static_cast( (halfBytes[hbi-1] << 4) | (halfBytes[hbi] & 0xf) ); //printf("%x \n", result[ri]); ri++; } if (halfByteCount % 2 != 0) { halfBytes[0] = halfBytes[halfByteCount-1]; halfByteCount = 1; } else { halfByteCount = 0; } } if (halfByteCount == 1) { result[ri] = static_cast(halfBytes[0] << 4); ri++; } return ri; } size_t decodePic( const unsigned char *data, const size_t dataSize, double *result ) { size_t ri; unsigned int x; size_t di; size_t half; //printf("ri di half dSize count\n"); half = 0; ri = 0; di = 0; while (di < dataSize) { if (di == (dataSize - 1) && half == 1) { if ((data[di] & 0xf) == 0x0) { break; } } decodeInt(&data[0], &di, dataSize, &half, &x); //printf("%7d %7d %7d %7d %7d\n", ri, di, half, dataSize, count); //printf("count: %d \n", count); result[ri++] = static_cast(x); } return ri; } void encodePic( const std::vector &data, std::vector &result ) { size_t dataSize = data.size(); result.resize(dataSize * 5); size_t encodedLength = encodePic(&data[0], dataSize, &result[0]); result.resize(encodedLength); } void decodePic( const std::vector &data, std::vector &result ) { size_t dataSize = data.size(); result.resize(dataSize * 2); size_t decodedLength = decodePic(&data[0], dataSize, &result[0]); result.resize(decodedLength); } ///////////////////////////////////////////////////////////// double optimalSlofFixedPoint( const double *data, size_t dataSize ) { if (dataSize == 0) return 0; double maxDouble = 1; double x; double fp; for (size_t i=0; i USHRT_MAX ) { throw "[MSNumpress::encodeSlof] Cannot encode a number that overflows USHRT_MAX."; } x = static_cast(temp + 0.5); result[ri++] = x & 0xff; result[ri++] = (x >> 8) & 0xff; } return ri; } size_t decodeSlof( const unsigned char *data, const size_t dataSize, double *result ) { size_t i, ri; unsigned short x; double fixedPoint; if (dataSize < 8) throw "[MSNumpress::decodeSlof] Corrupt input data: not enough bytes to read fixed point! "; ri = 0; fixedPoint = decodeFixedPoint(data); for (i=8; i(data[i] | (data[i+1] << 8)); result[ri++] = exp(x / fixedPoint) - 1; } return ri; } void encodeSlof( const std::vector &data, std::vector &result, double fixedPoint ) { size_t dataSize = data.size(); result.resize(dataSize * 2 + 8); size_t encodedLength = encodeSlof(&data[0], dataSize, &result[0], fixedPoint); result.resize(encodedLength); } void decodeSlof( const std::vector &data, std::vector &result ) { size_t dataSize = data.size(); result.resize((dataSize - 8) / 2); size_t decodedLength = decodeSlof(&data[0], dataSize, &result[0]); result.resize(decodedLength); } } } // namespace numpress } // namespace ms