2
\$\begingroup\$

I have altered a Huffman C++ code that I found, specifically the decoding part. It runs magnitudes of order faster, especially noticeable for large input. Are there any other optimizations to be made?

https://github.com/sjhalayka/qc_for_cs/blob/main/pd_10_3_2.cpp

bool get_decoded_vector(const string& encoded_string, vector<T>& decoded_vector)
{
    decoded_vector.clear();

    if (huffman_codes.size() == 0 || encoded_string == "")
        return false;

    if (huffman_codes.size() == 1)
    {
        const T c = huffman_codes.begin()->first; // should be '0'

        for (size_t i = 0; i < encoded_string.size(); i++)
            decoded_vector.push_back(c);

        return true;
    }

    // Get the minimum and maximum token size
    size_t min_bits = static_cast<size_t>(-1); // Casting the number -1 transforms it into the largest possible value that can be held by a size_t
    size_t max_bits = 0;

    for (const auto pair : huffman_codes)
    {
        if (pair.second.size() < min_bits)
            min_bits = pair.second.size();

        if (pair.second.size() > max_bits)
            max_bits = pair.second.size();
    }

    const size_t encoded_len = encoded_string.length();

    // Use a sliding window of variable length
    size_t begin_index = 0; // Start at the very beginning of the string
    size_t len = min_bits; // Don't waste time by starting at len = 1 if it's not necessary

    // While stuff to parse 
    while (is_valid_window(encoded_len, begin_index, len))
    {
        if (len > max_bits)
        {
            // No match was found up to here, which is exceptionally weird
            return false;
        }

        // Match token with map element
        const string token = encoded_string.substr(begin_index, len);

        bool found_token = false;

        for (const auto pair : huffman_codes)
        {
            if (pair.second == token)
            {
                decoded_vector.push_back(pair.first);
                found_token = true;
                break;
            }
        }

        if (found_token)
        {
            // Slide window by token size number of steps,
            // then reset window length to the minimum
            begin_index += token.size();
            len = min_bits;
        }
        else
        {
            // Expand window length by 1
            len++;
        }
    }

    return true;
}
\$\endgroup\$
1
  • \$\begingroup\$ Are you interested only in optimizations the stay within the realm of what you're doing here (taking what is presumably a string of '1' and '0' characters, not proper binary data), or in efficient Huffman decoding in general \$\endgroup\$ Commented yesterday

3 Answers 3

3
\$\begingroup\$

Structured bindings

You have a few loops like the following in your code where you're iterating over a std::map<T, std::string>1 and you would benefit from structured bindings rather than calling first and second.

    for (const auto pair : huffman_codes)
    {
        if (pair.second.size() < min_bits)
            min_bits = pair.second.size();

        if (pair.second.size() > max_bits)
            max_bits = pair.second.size();
    }

Becomes:

    for (const auto [k, v] : huffman_codes)
    {
        if (v.size() < min_bits)
            min_bits = v.size();

        if (v.size() > max_bits)
            max_bits = v.size();
    }

You also have the following.

        const T c = huffman_codes.begin()->first;

Which might be cleaned up with a structured binding.

        const auto [c, _] = *huffman_codes.begin();

Views

But since you only access the values from the map you might apply a std::views::values view on that map when iterating.

    for (const auto v : huffman_codes | std::views::values)
    {
        if (v.size() < min_bits)
            min_bits = v.size();

        if (v.size() > max_bits)
            max_bits = v.size();
    }

Min/max values

    size_t min_bits = static_cast<size_t>(-1); // Casting the number -1 transforms it into the largest possible value that can be held by a size_t

You probably shouldn't be doing this. Better would be to use std::numeric_limits.

    std::size_t min_bits = std::numeric_limits<std::size_t>::max();

1 Avoid using namespace std;

\$\endgroup\$
0
2
\$\begingroup\$
  1. Since C++17, we have std::string_view, which is preferable to std::string const& for function arguments in nearly all cases, as it is more flexible.

  2. Are you sure an empty input should be an error, instead of just resulting in an empty output?

  3. Better use .empty() to check for empty string/container, it is clearer about intent and potentially more efficient.

  4. The block handling a size-one huffman dictionary could be simplified and optimized by using the appropriate call:

    if (huffman_codes.size() == 1)
    {
        decoded_vector.assign(encoded_string.size(), huffman_codes.begin()->first);
        return true;
    }
    
  5. Getting the minimum and maximum token size is also a single call:

    auto [min, max] = std::ranges::minmax(huffman_codes, {}, [](auto& a){ return a.second().size(); });
    
\$\endgroup\$
1
  • \$\begingroup\$ Super! Thanks a lot. \$\endgroup\$ Commented 8 hours ago
1
\$\begingroup\$
for (const auto pair : huffman_codes)
{
    if (pair.second == token)
    {
        decoded_vector.push_back(pair.first);
        found_token = true;
        break;
    }
}

(map<T, string> huffman_codes; is found outside of the snippet submitted for review)

Ideally don't search a map by value. Huffman codes are unique so you can have a reverse-map which can be searched by key.

Furthermore if you have a reasonable limit on the maximum Huffman code length (k), you could make a reverse-map from every possible window of length k to the symbol at the start of that window and its length. Then there is no need to try different lengths: you could take a window of k "bits" and decode that with a single lookup. Instead of searching for the right window length, the lookup tells you what that length would have been. Be careful with windows of length k that extend past the end of the input data.

Instead of working with a strings a lot, you could translate the "bits" of the window into actual bits and use that to directly index into a plain array/vector, doing away with the usual map overhead and most string business. That's also how the plain old Huffman decoders (the simple version, not the more sophisticated algorithms that can use a smaller lookup table) that decode binary data tend to work, except they don't need to translate the window into an integer, they work with raw bits the whole way.

\$\endgroup\$
1
  • \$\begingroup\$ Thank you for the insight. \$\endgroup\$ Commented 8 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.