Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IOTDB-5443] Implement Chimp encoding in IoTDB #8766

Merged
merged 15 commits into from
Jan 30, 2023

Conversation

panagiotisl
Copy link
Contributor

@panagiotisl panagiotisl commented Jan 5, 2023

Description

This PR adds the Chimp compression algorithm for double and single precision floating point data.

Algorithm.

Chimp was recently presented in VLDB 2022 (https://www.vldb.org/pvldb/vol15/p3058-liakos.pdf):
Panagiotis Liakos, Katia Papakonstantinopoulou, Yannis Kotidis:
Chimp: Efficient Lossless Floating Point Compression for Time Series Databases. Proc. VLDB Endow. 15(11): 3058-3070 (2022)
The algorithm focuses exclusively on floating point data and takes advantage of more than one earlier values encountered to significantly outperform the state-of-the-art Gorilla algorithm in terms of compression ratio, while preserving its speed.
The implementations provided here focus on the Chimp128 variation for double precision, that uses 128 previous earlier values, and Chimp64 for single precision, that uses 64 earlier values.

Indicative results for many different time series datasets, that highlight the significant benefits expected with the adoption of Chimp in terms of space savings:
chimp-gorilla

Adoption

Chimp is already used in the latest releases of DuckDB (duckdb/duckdb#4878)

Implementation

The algorithm is implemented to reuse code from GorillaEncoderV2.java and GorillaDecoderV2.java, and Long, DoublePrecision, Int and Float versions have been built on top of it. Method organization, design, and naming has been based on the respective Gorilla classes.

Testing

A new class named ChimpDecoderTest executes all tests implemented in the GorillaDecoderV2Test class, ensuring the threshold for code coverage


This PR has:

  • [X ] been self-reviewed.
    • concurrent read
    • concurrent write
    • concurrent read and write
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods.
  • added or updated version, license, or notice information
  • added comments explaining the "why" and the intent of the code wherever would not be obvious
    for an unfamiliar reader.
  • [X ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold
    for code coverage.
  • added integration tests.
  • been tested in a test IoTDB cluster.

Key changed/added classes (or packages if there are too many classes) in this PR

tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/IntChimpEncoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/LongChimpEncoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/SinglePrecisionChimpEncoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/DoublePrecisionChimpEncoder.java

tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/IntChimpDecoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/LongChimpDecoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/SinglePrecisionChimpDecoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/DoublePrecisionChimpDecoder.java

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is your first pull request in IoTDB project. Thanks for your contribution! IoTDB will be better because of you.

@panagiotisl
Copy link
Contributor Author

Some checks are failing but I don't think my PR has anything to do with the failures.

@qiaojialin
Copy link
Member

Hi Panagiotis, thanks for contribution! Please update the UserGuide, then users could see this encoding:

docs/UserGuide/Data-Concept/Encoding.md
docs/zh/UserGuide/Data-Concept/Encoding.md

You could just copy the English version into docs/zh/UserGuide/Data-Concept/Encoding.md, then we could translate in the review.

@HTHou
Copy link
Contributor

HTHou commented Jan 7, 2023

Hi, there are more code to be update, in order to make Chimp can be used in IoTDB.

server/src/main/java/org/apache/iotdb/db/utils/SchemaUtils.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/encoder/TSEncodingBuilder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/encoding/decoder/Decoder.java
tsfile/src/main/java/org/apache/iotdb/tsfile/file/metadata/enums/TSEncoding.java getTsEncoding method
client-cpp/src/main/Session.h
client-py/iotdb/utils/IoTDBConstants.py

@HTHou
Copy link
Contributor

HTHou commented Jan 10, 2023

I did a simple test. After executed the following sql.

136561673316878_ pic

The tsfile size with Chimp is 243 bytes, the tsfile size with Gorilla is 248 bytes.

@panagiotisl
Copy link
Contributor Author

I have used the above example to make tests with 4 different configurations using the first 1,000 values of the basel-wind dataset (see Figure in the PR text). The results are:

Chimp (with snappy): 4695 bytes
Gorilla (with snappy): 8527 bytes
Chimp (uncompressed): 4812 bytes
Gorilla (uncompressed): 8653 bytes

@panagiotisl
Copy link
Contributor Author

panagiotisl commented Jan 10, 2023

Also some timing comparisons/experiments in milliseconds with three different datasets:

5,000,000 values (Stocks-Germany)

GORILLA Encoding time: 1163
GORILLA Decoding time: 158
CHIMP Encoding time: 553
CHIMP Decoding time: 114

2,905,887 values (city-temperature)

GORILLA Encoding time: 855
GORILLA Decoding time: 97
CHIMP Encoding time: 412
CHIMP Decoding time: 99

8,927 values (SSD-benchmark)

GORILLA Encoding time: 28
GORILLA Decoding time: 7
CHIMP Encoding time: 7
CHIMP Decoding time: 6

These timings refer to encoding the values to a byte array and decoding them using the Java code only. If writing the data to disk is involved the speed up that Chimp can offer will be more evident, as it usually needs to write and read much less bytes to and from the disk.

@HTHou HTHou merged commit 2f63c44 into apache:master Jan 30, 2023
@HTHou HTHou changed the title Chimp compression [IOTDB-5443] Implement Chimp encoding in IoTDB Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants