JPEG2000 byte modifications with Python and Docker
At the beginning of the year I worked on a Python project for Harvard University: the library system runs a mega digital repository, and a subset of JPEG 2000 images had a very particular kind of ICC profile corruption detected by Kakadu but not by jpylyzer, two validation tools they use.
I wrote jp2_remediator to validate and (when safe) repair those files so downstream systems could read them. I made a command-line app in a Docker image that scans a single file, a directory, or an S3 bucket.
When it’s put in production, Apache Airflow pulls this image and runs the CLI inside the container for each image, giving us reproducible runs across environments and easy horizontal scaling.
The specs I used and actually really enjoyed studying were:
- ISO/IEC 15444-1:2019 (E) — the JPEG 2000 Core Coding System spec, which defines how JP2 boxes are structured, including the colr box and how ICC profiles are embedded. https://www.iso.org/standard/78321.html
- ICC.1:2022 (Profile version 4.4.0.0) — the Specification ICC.1:2022-05 (Image technology colour management — Architecture, profile format, and data structure), which defines how ICC profiles are laid out, including the header, tag table, and curveType encoding used for tone-response curves (TRCs). https://www.color.org/specification/ICC1v44_2022-05.pdf
At the time of this writing—and hopefully forever and ever—all tests pass with 100% coverage.
Why JP2?
JP2 (JPEG 2000) shows up a lot in cultural heritage and large repositories as a preservation format because it’s:
- Visually flexible: supports both lossless (5/3 reversible) and lossy (9/7) wavelets, plus high bit depths.
- Zoom-friendly: built-in multi-resolution pyramids and tiling make deep zoom and partial reads efficient.
- Streamable & robust: codestream is designed for region-of-interest delivery and error resilience.
- Metadata-aware: clean hooks for color management (embedded ICC profiles) and other box-level metadata.
- Ecosystem-ready: widely supported in IIIF servers and viewers, so it slots into modern delivery stacks.
What problem it solves
Some JPEG 2000 (JP2) images embed an ICC profile that defines tone-response curves (TRCs) for red, green, and blue—rTRC, gTRC, bTRC. Those TRCs are stored using the ICC curveType structure (aka curv). In a subset of files we saw, metadata in the ICC tag table didn’t match the actual curve payload: e.g., a tag claimed one size or “shape,” while the bytes encoded a different one. Imaging libraries would then error out or silently misinterpret the gamma. jp2_remediator walks the raw bytes, checks those invariants, fixes the mismatches, and optionally logs “needs-review” cases for humans.
How it works (at a glance)
A minimal mental model of the ICC profile sections that jp2_remediator touches:
[ ICC Profile ]
├─ Header (128 bytes, ICC v4)
├─ Tag Count (uInt32)
└─ Tag Table (TagCount × 12 bytes)
├─ Entry 0: [ signature(4) | offset(4) | size(4) ]
├─ Entry 1: [ signature(4) | offset(4) | size(4) ]
└─ ...
signatures of interest: 'rTRC', 'gTRC', 'bTRC'
... Tag Data Blocks live elsewhere in the profile ...
For each TRC tag (rTRC/gTRC/bTRC), the payload should be an ICC curveType. Here’s the curveType “curv” layout:
Offset Size Field
+0 4 type signature = 'curv'
+4 4 reserved (0)
+8 4 count = n (uInt32)
+12 2*n curve data (n × uInt16)
└─ pad to 4-byte alignment
Locate the ICC profile inside the JP2.
jp2_remediator inspects the embedded ICC profile inside the JP2 until thecolrbox carrying an embedded ICC profile is found. Extract the ICC blob for analysis.Parse the ICC header and tag table.
ICC profiles have a header followed by a tag table: for each tag there’s a signature (likerTRC), an offset (where the tag’s data lives inside the profile), and a size (how many bytes to read). Index this table so you can jump straight torTRC,gTRC, andbTRC(the specification provides a calculation for where it can be found).Read each TRC tag’s
curveTypepayload.
Validates that each entry’s payload is a well‑formed ICCcurveType(curv) that matches what the tag table claims.
Semantics
n = 0→ linear TRC (no points)n = 1→ single gamma as u8.8 fixed‑point →gamma = curve[0] / 256.0n > 1→ LUT ofnsamples (non‑parametric curve)
Size check
expected_size = 12 + 2*n (then round up to a multiple of 4 bytes)
Examples: n=0 → 12; n=1 → 16; n=257 → 528
What the tool fixes
- recomputes
expected_sizeand corrects the tag tablesizeif it’s wrong (lossless metadata edit) - for
n != 1, flags/logs for review instead of guessing - for
n = 1, decodes/validates the gamma and continues
Cross-check what the tag table claims vs. what the payload is.
Givenn, the expected size of thecurveTypedata is:expected_size = 4 (type) + 4 (reserved) + 4 (count) + 2*n (data) + pad_to_4B- For
n = 1, that’s12 + 2 + 2 bytes pad = 16. - For
n = 0, that’s12(already 4-byte aligned). - For larger
n, it’s12 + 2*n, rounded up to the next multiple of 4.
If the ICC tag table’s recorded size for
rTRC,gTRC, orbTRCdoesn’t match this computedexpected_size, the tool corrects the size in the tag table so the profile is internally consistent.- For
Flag “not-one-gamma” curves for later review.
Whenn != 1, the curve isn’t a simple gamma: it’s either linear (n=0) or a LUT (n>1). The tool doesn’t guess; it flags and (optionally) logs these for a person to inspect later.Write back a fixed profile (losslessly) when appropriate.
If only the table metadata is off (a classic cause of failures in downstream pipelines), jp2_remediator rewrites the profile bytes in place with the corrected table values. The pixel data aren’t touched—only ICC metadata.
TL;DR math: for
n = 1, the gamma value is the single 16-bit curve entry interpreted as u8.8 fixed point, i.e.,gamma = entry / 256.0. The tag’s size must then be 16 bytes (12 bytes header + 2 data + 2 pad). If it’s not, fix the size so it matches the payload.
Why this matters (and where it runs)
We invoked jp2_remediator inside an Apache Airflow workflow, sweeping entire buckets/folders so curv/ICC issues are caught before delivery and before they blow up consumer services. The tool can process:
- one file,
- a whole directory, or
- all
.jp2objects in an S3 bucket (with optional prefix to scope to a “folder”).
Notes from the trenches
- I used Hex Fiend to investigate and sanity-check offsets and sizes.
- You really feel the spec here: calculating
expected_sizebased onnand enforcing 4-byte alignment comes from ICC.1:2022; using TRC semantics coherently with the image pipeline keeps us aligned with ISO/IEC 15444-1:2019 (JPEG 2000). - Most failures we saw were “metadata lies”: the curve really was a single gamma, but the tag table used the wrong size. Fixing just that field stabilized a lot of assets without touching pixels.