José Manuel Requena Plens • June 10, 2026 • Updated June 21, 2026 •

What the Linker Won't Do: Packing i18n Strings on an MCU

Q: What is a packed i18n string pool?

It replaces the naïve table of const char* pointers with two pieces of generated data: a single flash blob holding every NUL-terminated translation, and a per-language table of uint16_t offsets into that blob. An offset is just the byte where a C string begins, so tr() still returns a plain const char*.

Q: Why use uint16_t offsets instead of a pointer table?

On a 32-bit MCU each const char* is 4 bytes, so a 5×273 grid costs 5,460 bytes of pure addressing. A uint16_t offset is 2 bytes, halving the index table to 2,730 bytes (−50%) — a structural win no linker, and not even LTO, will do for you.

Q: Doesn't the linker already deduplicate and tail-merge strings?

Yes — GNU ld tail-merges identical strings and suffixes by default, and ld.lld does so at -O2. So the packed blob is only about 282 B smaller than what the linker produces. The real, optimizer-proof win is structural: halving the index table and erasing the relocations, which the linker cannot do.

Q: How much RAM does the string pool use?

Zero beyond a one-byte current-language index. Both the pool and the offset table are constexpr .rodata, so they live entirely in flash; on the ESP32 they are memory-mapped (execute-in-place) and read with ordinary loads.

Q: How much firmware did packing actually save?

Net firmware shrank 2,224 B at the shipped -Os, with no API or call-site changes. The saving is smaller than the −2,730 B index halving because pooling forfeits about 500 B of cross-module linker dedup; the table halving plus the erased 1,365 relocations carry the net gain, not the blob.

Q: Does the technique scale and work on other MCUs?

The relocation elimination scales without limit (one per cell), but the −50% table win holds only while the pool fits a 16-bit offset (about 1,300 IDs); past that a uint32 offset matches the pointer width. It is not ESP32-specific: any 32-bit Cortex-M gets the same −50%, and a 64-bit target saves −75% per cell.

A build-time generator packs firmware UI translations into one string pool indexed by uint16 offsets, halving the index table on a 32-bit MCU.

I ship UI text in five languages on a microcontroller with two buttons and a tiny screen. Five languages, 273 string IDs, and a 32-bit chip where every byte of flash is accounted for. The textbook way to store that is a two-dimensional table of const char* pointers — and the first time I looked at the map file, it bothered me that a big chunk of those bytes were spent on addressing, not on a single character of actual text.

So I wrote a generator to out-pack the linker. Then I learned the linker was already doing most of what I’d reinvented — and discovered the one thing it genuinely can’t do, which turned out to be the win worth keeping. This is that story: a packed, tail-merged string pool with uint16_t offsets, what it actually saved, and an honest accounting of where the savings really come from.

TL;DR

The naïve i18n store is a const char* table[langs][ids] — on a 32-bit MCU that’s 5,460 bytes of pointer index for 5×273 strings, independent of the text.
A build-time generator emits one packed, deduplicated, tail-merged string pool plus a uint16_t offset table — the index drops to 2,730 bytes (−50%), and the naïve table’s 1,365 relocations (one address to fix up per cell) become zero.
Net firmware shrank 2,224 B at the shipped -Os, with no API or call-site changes — and packed stays smaller at every optimization level from -O0 to -O3.
Plot twist: modern linkers already tail-merge string literals (GNU ld by default, ld.lld at -O2), so the packed blob is only 282 B smaller than what the linker produces. The real, optimizer-proof win is structural — halving the table and erasing the relocations — which no linker, and not even LTO, can do.
Honest cost: pooling forfeits 500 B of cross-module linker dedup; the table-plus-relocation win nets the gain, not the blob.

The textbook table — and what the index costs

The obvious generated form is a grid: one row per language, one column per string ID, each cell a pointer to a literal.

strings_gen.cpp (the naïve version)

// 5 languages × 273 ids
const char* const kStrings[kLangCount][kStringCount] = {
    /* EN */ { "Kleidos", "OK", "Cancel", /* ... */ },
    /* ES */ { "Kleidos", "OK", "Cancelar", /* ... */ },
    /* ... */
};

const char* tr(uint8_t lang, uint16_t id) { return kStrings[lang][id]; }

It’s correct, it’s readable, and it’s exactly what most projects ship. But notice what the table is: a big array of addresses. On a 32-bit target, a const char* is four bytes, and there’s one per cell whether the string behind it is "OK" or a full sentence.

The cost of the index alone

Languages × IDs: 5 × 273 = 1,365 cells One pointer per language, per string ID.
Pointer size: 4 bytes (32-bit) Each cell is a const char* — a relocatable address, not text.
Index table total: 5,460 bytes 5 × 273 × 4. Pure addressing overhead, before a single character of text — and one relocation per cell.

That 5.5 KB buys zero text — it’s the cost of finding the text. String literals themselves live in flash (.rodata) and are a separate, unavoidable cost; on memory-constrained MCUs, keeping literals out of RAM is a long-standing concern, from Arduino’s PROGMEM/F() idiom (Arduino’s PROGMEM reference) to how embedded C++ stores strings in read-only memory. The text I had to pay for. The index was the part that felt wasteful.

But doesn’t the linker already do this?

Before optimizing, it’s worth asking what the toolchain does for free — and this is where my first assumption fell apart.

The compiler does a little. GCC’s -fmerge-constants (on by default at -O1 and above) will, per GCC’s optimize-options reference, “attempt to merge identical constants” — but only identical, whole ones. The deeper work happens in the linker. String literals land in sections flagged SHF_MERGE | SHF_STRINGS, and the link editor both deduplicates identical strings and eliminates tail strings: given "bigdog" and "dog", the shorter string is dropped and represented by the tail of the longer one (see Oracle’s linker and libraries guide). That’s tail-merging — exactly the trick I thought I was inventing.

How aggressively depends on the linker and the optimization level:

What merges what, and whenStageMerges identical strings?Tail-merges suffixes?
GCC -fmerge-constants (compiler)Yes (per TU)No
GNU ld (linker)YesYes — by default
ld.lld (linker)Yes (-O1, default)Only at -O2

What merges what, and when
Stage	Merges identical strings?	Tail-merges suffixes?
GCC `-fmerge-constants` (compiler)	Yes (per TU)	No
GNU `ld` (linker)	Yes	Yes — by default
`ld.lld` (linker)	Yes (`-O1`, default)	Only at `-O2`

So GNU ld tail-merges SHF_MERGE | SHF_STRINGS sections by default, while ld.lld does it only with -O2 (-O1, the default, merges identical strings only; -O0 disables merging entirely) — see MaskRay’s ld vs lld notes and the ld.lld manual. It’s also “permitted, not required” — the generic section-merge algorithm operates on whole elements (O’Dwyer on ELF string merging), with string tail-merging being a separate, string-specific path that not every link is guaranteed to run.

That reframes the whole exercise. I’m not doing something the linker can’t. So why generate it at all?

The first reason is nice-to-have. The second is the headline.

The packed pool and the offset table

The design is two pieces of generated data. First, every translation is concatenated into a single flash blob, each string NUL-terminated — so an “offset” is just the start of an ordinary C string. Second, a per-language table of uint16_t offsets into that blob, replacing the pointer grid.

Inside kPool — strings stored end to end

@0 string A 6 B @6 1 B @7 string B 9 B @16 1 B @17+ more entries …

string A · 6 B · @0
\0 · 1 B · @6 — NUL terminator
string B · 9 B · @7
\0 · 1 B · @16 — NUL terminator
more entries · … · @17+

Inside kPool — strings stored end to end
Field	Offset	Size
string A	@0	6 B
\0	@6	1 B
string B	@7	9 B
\0	@16	1 B
more entries	@17+	…

kPool is one blob: entries sit back-to-back, each NUL-terminated. An offset is just the byte where a C string begins — which is why tr() can hand back a plain const char*.

The pool above holds the text — and it stays roughly the same size whether you pack it or not. The change that actually pays off is one level up, in the index: every entry that used to be a 4-byte pointer becomes a 2-byte offset.

One index entry — the part that halves

pointer

const char* (an address)

const char* (an address) — 4 B

offset

uint16_t

uint16_t — 2 B

One index entry — the part that halves
Region	Segment	Size
pointer	const char* (an address)	4 B
offset	uint16_t	2 B

Same text in the pool either way — what changes is the index: every entry drops from a 4-byte pointer to a 2-byte offset, halving the whole table.

Build time

strings.csv id, en, es, fr, de, it

generator dedup + tail-merge

kPool one packed blob

kOffsets uint16 table

gen::string(lang, id) kPool.data() + kOffsets[lang][id]

i18n::tr(StringId)

UI render

Runtime

Build-time generation, runtime lookup

The generated output is two constexpr std::arrays: the blob, and a nested array of uint16_t offsets into it. Here is a real excerpt — note how, in the offset table, BRAND ("Kleidos") and CommonOk ("OK") hold the same offset in every language (they deduplicate to one stored copy), while CommonCancel diverges per language:

strings_gen.cpp (packed)

// All translations concatenated into one flash blob — deduplicated and
// tail-merged, emitted longest-first.
constexpr std::array<char, 13531> kPool = {
    "A:avanti  B:modifica  tieni B:indietro\0"  // @0
    "A:weiter  B:öffnen  halten B:zurück\0"      // @39
    "A:weiter  B:bearb.  halten B:zurück\0"      // @77
    "j/k:nav  Espace:ouvrir  S:réglages\0"       // @114
    // ... ~1,000 more entries (longest-first) ...
    // "OK" lands at @1164 (shared by all five languages); "Kleidos" at @9379.
};

// 2-byte offsets into kPool instead of 4-byte pointers. Identical strings
// collapse to one offset; a suffix points partway into a longer entry.
constexpr std::array<std::array<uint16_t, kStringCount>, kLangCount> kOffsets = {{
    //          BRAND  CommonOk  CommonCancel       (… 266 more)
    /* EN */ {{  9379,    1164,        12542, /* … */ }},  // Kleidos · OK · "Cancel"
    /* ES */ {{  9379,    1164,        11279, /* … */ }},  // Kleidos · OK · "Cancelar"
    /* FR */ {{  9379,    1164,        11901, /* … */ }},  // Kleidos · OK · "Annuler"
    /* DE */ {{  9379,    1164,        10295, /* … */ }},  // Kleidos · OK · "Abbrechen"
    /* IT */ {{  9379,    1164,        11909, /* … */ }},  // Kleidos · OK · "Annulla"
}};

static_assert(kPool.size() <= UINT16_MAX, "pool exceeds uint16_t offset range");

A few things make this safe and cheap to adopt:

An offset points at the start of a NUL-terminated string, so the public tr() still returns a plain const char*. Every call site is unchanged — the swap is invisible to the rest of the firmware.
Lookup is barely more expensive than the pointer table: one uint16_t load plus one add (kPool.data() + offset), versus one pointer load. That’s a single extra integer add, with no second indirection — you don’t trade flash for cycles.
Zero RAM cost. Both the pool and the offset table are constexpr .rodata — they live entirely in flash. The only runtime state is a one-byte “current language” index. (This is also why the returned const char* is safe forever: it points into immortal flash, never freed.)
A static_assert (cppreference’s static_assert) enforces, at compile time, that the pool stays under 64 KiB so the uint16_t offsets can address all of it. Grow the table past that and the build fails loudly instead of silently truncating.
std::array (cppreference’s std::array) is an aggregate with the same layout as a raw T[N] — zero overhead versus the C array it replaces, and data() returns a pointer to contiguous storage, so kPool.data() + offset is well-defined.
Because the offsets are emitted as C++ integer literals and compiled for the target, there’s no endianness or serialization concern — unlike a binary blob format such as gettext’s .mo.

Where it lives — naïve vs packed (flash)

Naïve (pointer table)

literal blob pointer table (4 B)

literal blob — ≈ 13.8 KB
pointer table (4 B) — ≈ 5.5 KB

Packed (offset table)

kPool

kPool — ≈ 13.5 KB
kOffsets (2 B) — ≈ 2.7 KB

Where it lives — naïve vs packed (flash)
Region	Segment	Size
Naïve (pointer table)	literal blob	≈ 13.8 KB
Naïve (pointer table)	pointer table (4 B)	≈ 5.5 KB
Packed (offset table)	kPool	≈ 13.5 KB
Packed (offset table)	kOffsets (2 B)	≈ 2.7 KB

Same text either way — the two blobs land within 282 B of each other. The visible gap is the index: a 5,460 B pointer table against a 2,730 B offset table. Both sit entirely in flash (.rodata) and spend 0 extra RAM beyond a 1-byte language index.

Reading a string back

Two small functions sit behind every lookup. The low-level gen::string() does the actual work — one uint16_t load plus one add — and trusts its caller. The public tr() adds the range check and an empty-to-English fallback, so callers can render the result unconditionally:

strings_gen.cpp + i18n.cpp

namespace i18n::gen {
// Low-level accessor: one uint16 load + one add. No bounds check — caller-guaranteed.
const char* string(uint8_t lang, uint16_t index) {
    return kPool.data() + kOffsets[lang][index];
}
}  // namespace i18n::gen

namespace i18n {
// Public API: range-check, then fall back to English for an empty translation.
const char* tr(StringId id) {
    const uint16_t index = static_cast<uint16_t>(id);
    if (index >= kStringCount) return "";          // out of range → empty
    const char* text = gen::string(currentLang(), index);
    return text[0] != '\0' ? text : gen::string(/*EN*/ 0, index);
}
}  // namespace i18n

Every tr() call walks the same short path — from a UI call site down to a single memory-mapped flash read, with no heap and no RAM copy:

Where a tr() call lands

UI call site
tr(StringId)
i18n::tr()
range check + EN fallback
i18n::gen::string()
kPool.data() + kOffsets[lang][idx]
kOffsets · kPool
constexpr .rodata
Flash (XIP)
memory-mapped · 0 RAM

One uint16 load and one add against the pool base — then the caller renders the returned const char*. No second indirection, no decode buffer.

Call sites pull the names into scope once, so they read cleanly — no i18n::, StringId::, or Lang:: noise. Resolve a StringId into a variable and use it, exactly as before:

src/ui/… (real call sites)

using i18n::tr;
using enum i18n::StringId;   // PinEnter, PopBattLow, CommonCancel, …

const char* prompt = tr(PinEnter);            // resolve once…
display::drawString(prompt, cx, titleY);      // …then draw the label

popup::toast(tr(PopBattLow), nullptr, CAUTION, 3000);   // or inline, for a toast

// What tr() does under the hood — Spanish "Cancel" (lang ES = 1, CommonCancel = 2):
const char* cancel = i18n::gen::string(1, 2);  // kPool.data() + 11279 → "Cancelar"

That gen::string(1, 2) is exactly what the disassembly does: the naïve table would l32i a 32-bit pointer; the packed accessor l16uis a 16-bit offset and adds it to the pool base. Both are a single table load — there is no extra indirection — and the offset table being half the size touches fewer D-cache lines on a hot menu redraw. The contract (tr() hands back a borrowed, program-lifetime const char*) is what lets the format swap stay invisible to all 210 tr() call sites across the UI, popups, onboarding and admin portal: not one of them changed.

Dedup and tail-merge: the generator’s core

The packing itself is a dozen lines of Python. The trick is to place strings longest-first, then, for each one, check whether its bytes (plus the NUL) already appear anywhere in the blob so far. If they do, it’s a duplicate or a suffix of something already placed — reuse that position. If not, append it.

scripts/gen_i18n.py (the packer)

# Unique strings, sorted longest-first so suffixes collapse onto longer strings.
uniq = sorted(
    {entry[lang] for _, entry in rows for lang in LANGS},
    key=lambda s: (-len(s.encode("utf-8")), s),
)

blob = bytearray()
offsets = {}
for s in uniq:
    needle = s.encode("utf-8") + b"\0"
    pos = bytes(blob).find(needle)   # already present (identical OR a tail)?
    if pos >= 0:
        offsets[s] = pos             # reuse — dedup or tail-merge
    else:
        offsets[s] = len(blob)       # new string — append
        blob += needle

Sorting longest-first is what makes tail-merging fall out for free: by the time a short string is considered, any longer string that ends with it has already been placed, so find() locates the shared tail. This is the same shape as a decades-old format — GNU gettext’s compiled .mo files store translations as exactly this kind of offset table into a contiguous string block (the gettext .mo format).

On the current table — which has grown to 273 string IDs and counting — the effect is concrete:

Today's table: 5 × 273 = 1,365 cellsStageDistinct stringsPool size
All cells1,365Every (language, ID) pair
After dedup1,07113,813 B — "Kleidos" ×15, "OK" ×14 → stored once each
After tail-merge1,071*13,531 B — "Tresor gelöscht" reuses the tail of "Verlust = Tresor gelöscht"

Today's table: 5 × 273 = 1,365 cells
Stage	Distinct strings	Pool size
All cells	1,365	Every (language, ID) pair
After dedup	1,071	13,813 B — `"Kleidos"` ×15, `"OK"` ×14 → stored once each
After tail-merge	1,071*	13,531 B — `"Tresor gelöscht"` reuses the tail of `"Verlust = Tresor gelöscht"`

*Tail-merge doesn’t drop strings — they’re all still distinct — it overlaps their storage, so the count stays 1,071 while the byte count falls.

Notice the proportions: dedup collapses 1,365 cells to 1,071 distinct strings (−294); tail-merge then overlaps suffixes for a further −282 bytes (13,813 → 13,531 B). On real UI text, fully identical strings (brand names, "OK", shared labels) are far more common than shared suffixes — so dedup does the heavy lifting and tail-merge is a small bonus on top. That matches the honesty thesis: the headline win is the index halving, not the blob — which, as we’ll measure, the linker gets to within 282 bytes of on its own.

That bonus does raise a fair question, though: if "Tresor gelöscht" is never stored on its own, how do you read it back? Its offset simply points partway into the longer entry it shares bytes with. Because the blob is NUL-terminated, reading from that offset yields exactly the shorter string — no special case, the same kPool.data() + offset as everything else:

A tail-merged string is just an offset pointing into a longer one

Want to try it on your own data? The string-pool packer takes a list of strings (or a CSV of IDs × languages), shows the pointer-table vs packed-pool cost, and reveals each string’s offset — including how a tail-merged one resolves inside its longer neighbor.

What it actually saved

Here are the numbers from the live sticks3 build (5 languages, 273 IDs), packed against the naïve pointer table, both at the shipped -Os:

Packed vs the naïve pointer table (-Os)

Index table

5,460 B

2,730 B

▼ -50% ( -2,730 B)

i18n data (pool + table)

19,273 B

16,261 B

▼ -15.6% ( -3,012 B)

Firmware image

1,564,784 B

1,562,560 B

▼ -0.1% ( -2,224 B)

Packed vs the naïve pointer table (-Os)
Metric	Before	After	Change
Index table	5,460 B	2,730 B	-50%
i18n data (pool + table)	19,273 B	16,261 B	-15.6%
Firmware image	1,564,784 B	1,562,560 B	-0.1%

Naïve (before) vs packed (after) at -Os. The index halves by a flat −2,730 B (−50%); the i18n data drops −3,012 B; the whole firmware shrinks −2,224 B — every metric in the packed design's favor.

The index halving is the clean, structural win — a flat −2,730 B (−50%) that, as we’ll see, doesn’t move with the optimization level. But notice the firmware delta (−2,224 B) is smaller than the data delta (−3,012 B), and there’s an honest reason:

Because the ESP32’s flash .rodata is memory-mapped (execute-in-place), the firmware reads the pool with ordinary loads — kPool.data() + offset is a plain pointer dereference, with none of the pgm_read_* dance that classic AVR PROGMEM needs.

One address per cell — or none

There’s a second, structural difference the byte counts don’t show. A pointer table is address-constant data: every cell holds the absolute address of a string, and each address is a relocation the linker must resolve. The offset table holds plain uint16_t integers — position-independent, nothing to fix up.

Relocations in the i18n index (per object)

relocations

1,365

▼ -100% ( -1,365 )

Relocations in the i18n index (per object)
Metric	Before	After	Change
relocations	1,365	0	-100%

Naïve (before): one R_XTENSA_32 record per cell — 1,365 of them, 16,380 B of .rela in the object. Packed (after): the offset table is integers, so its data section carries zero relocations (the only two on the packed side are the accessor's base-address loads — a constant that doesn't scale).

In this firmware it’s a clean structural signal, not a runtime cost. The image is statically linked and execute-in-place, so the linker resolves all 1,365 naïve records into absolute addresses at link time — neither format carries surviving relocations into the final .bin, and what survives is exactly the +2,730 B of pointer table. But the count is the honest proxy for “how many addresses must be materialized”: packed is O(1) (two base loads in the accessor), naïve is O(cells) — one per string, growing with every ID and every language you add. It also makes the table position-independent, which is the one place it earns its keep on a field-updated device: under delta OTA (shipping a binary diff) an absolute-pointer table churns whenever anything upstream in flash shifts its addresses, bloating the patch, while the integer offsets stay put — a structural advantage, though I haven’t benchmarked the patch sizes.

Does it scale?

It does — and the part that matters is exact arithmetic, not a guess. Every saving is a closed form in the cell count N, which is just your CSV’s shape:

N (cells)     = ids × langs        ← CSV rows × language columns
pointer table = 4 × N bytes        ← one 32-bit pointer per cell
offset table  = w × N bytes        ← w = 2  (uint16, pool ≤ 64 KiB)
                                         4  (uint32, larger pools)
table saved   = (4 − w) × N bytes  ← (a hand-packed uint24 would make w = 3)
relocations   = N eliminated       ← one per pointer cell

The table-saved and relocation numbers are therefore computed, not assumed — you can read them straight off ids × langs. The only estimated input is which offset width w applies, because that depends on the pool size; past today’s measured point (273 IDs → 13,531 B, ≈ 50 B/ID) the pool size is extrapolated linearly. Projecting out (w shown per row):

Projected savings as the table grows (× 5 languages)String IDsCellsPointer tableOffset tableTable savedRelocations erased
273 (today)1,3655,460 B2,730 B (uint16)2,730 B1,365
5002,50010,000 B5,000 B (uint16)5,000 B2,500
1,0005,00020,000 B10,000 B (uint16)10,000 B5,000
2,00010,00040,000 B40,000 B (uint32)†0 B10,000
10,00050,000200,000 B200,000 B (uint32)†0 B50,000

Projected savings as the table grows (× 5 languages)
String IDs	Cells	Pointer table	Offset table	Table saved	Relocations erased
273 (today)	1,365	5,460 B	2,730 B (`uint16`)	2,730 B	1,365
500	2,500	10,000 B	5,000 B (`uint16`)	5,000 B	2,500
1,000	5,000	20,000 B	10,000 B (`uint16`)	10,000 B	5,000
2,000	10,000	40,000 B	40,000 B (`uint32`)†	0 B	10,000
10,000	50,000	200,000 B	200,000 B (`uint32`)†	0 B	50,000

† Past roughly 1,300 IDs the pool outgrows 64 KiB, so a uint16 offset can no longer reach the far end and the static_assert(kPool.size() <= UINT16_MAX) fires. The generator’s minimal fix is a uint32 offset — which ties the 32-bit pointer on width, so the table edge vanishes and only the eliminated relocations and the tail-merged blob keep paying. (Packing 3-byte uint24 offsets would recover a 1-byte-per-cell edge — they address 16 MB — but the generator doesn’t emit them today.)

So the honest scaling answer has two halves. The relocation elimination scales without limit — one per cell, always — 50,000 of them gone at 10,000 IDs. The table −50% has a ceiling: it holds only while the pool fits a 16-bit offset (≲ 1,300 IDs at this text profile), and past that a plain uint32 offset matches the pointer width, so the index saving flattens to zero and the value shifts entirely to position-independence (relocations, delta-OTA friendliness) and the tail-merged blob. That 64 KiB pool is exactly the boundary the static_assert exists to force you to reconsider — not a place the technique stops helping, but the place its shape has to change.

Other MCUs: often a bigger win than here

Nothing in this is ESP32-specific — the index is a table of pointers versus a table of integers, so the win tracks two things: how wide a pointer is on your target, and how tight your flash budget is. A uint16 offset is the same 2 bytes everywhere, so it saves more the larger the pointer it replaces:

The offset win by target pointer widthTargetFlash pointeruint16 offset saves / cell
8-bit AVR ≤ 64 KB (ATmega328)2 B0 B — already 2 B; only dedup + tail-merge help
8-bit AVR, far pointers (ATmega2560)3 B−1 B (or a uint24 offset)
32-bit Cortex-M — STM32, RP2040, nRF4 B−2 B (−50%) — identical to here
32-bit RISC-V (ESP32-C3, GD32V)4 B−2 B (−50%)
64-bit (Cortex-A app processor / SBC)8 B−6 B (−75%)

The offset win by target pointer width
Target	Flash pointer	`uint16` offset saves / cell
8-bit AVR ≤ 64 KB (ATmega328)	2 B	0 B — already 2 B; only dedup + tail-merge help
8-bit AVR, far pointers (ATmega2560)	3 B	−1 B (or a `uint24` offset)
32-bit Cortex-M — STM32, RP2040, nRF	4 B	−2 B (−50%) — identical to here
32-bit RISC-V (ESP32-C3, GD32V)	4 B	−2 B (−50%)
64-bit (Cortex-A app processor / SBC)	8 B	−6 B (−75%)

So on any 32-bit Cortex-M — a small STM32, a Raspberry Pi Pico’s RP2040, a Nordic nRF — you get exactly the −50% measured here with no porting; on a 64-bit target the offset table is a quarter the size of the pointer table.

The bigger lever, though, is what fraction of flash you reclaim. The same 2.7 KB is rounding error on an 8–16 MB ESP32-S3, but it’s ~1% of a 256 KB STM32G0 and ~4% of a 64 KB STM32L0 — there, the table is the difference between fitting your translations and not. So the honest answer to “is 2 KB worth it?” depends entirely on the chip: barely, here; decisively on a flash-starved part shipping several languages. If you’re putting five-language UI text on a microcontroller with tens of KB of flash, this is exactly where it earns its place — and the relocation-freedom matters more there too, on any target that ships relocatable images or delta-OTA updates.

What no optimizer will do for you

The title isn’t rhetorical. Before trusting the win, I checked whether a smarter toolchain could close the gap — at every optimization level, and with the two passes that theoretically might: link-time optimization and aggressive constant merging.

The packed data is byte-identical at every -O level

Extract kPool and kOffsets from the object at -O0, -O1, -Os, -O2 and -O3 and hash them: one distinct hash each. The compiler emits the arrays verbatim — it never re-packs, re-aligns or merges them — so the packed i18n data is byte-for-byte identical at every level: 13,531 B pool + 2,730 B table = 16,261 B, always. Choosing -O is a no-op for the i18n data; it only moves code and the naïve side’s literal alignment. (Bumping the whole image to -O2 for its own sake would add roughly 160 KB of flash for zero i18n benefit — -Os is the right default.)

Packed wins at every level — and -Os is the smallest margin

I linked the whole firmware both ways at all five levels. Packed is smaller every time:

Full firmware: packed vs naïve at each -O level-Opacked firmware.binnaïve firmware.binpacked saves
-O01,812,736 B1,817,312 B+4,576 B
-O11,622,688 B1,626,656 B+3,968 B
-Os (shipped)1,562,560 B1,564,784 B+2,224 B
-O21,723,040 B1,726,864 B+3,824 B
-O31,716,576 B1,720,544 B+3,968 B

Full firmware: packed vs naïve at each -O level
`-O`	packed `firmware.bin`	naïve `firmware.bin`	packed saves
`-O0`	1,812,736 B	1,817,312 B	+4,576 B
`-O1`	1,622,688 B	1,626,656 B	+3,968 B
`-Os` (shipped)	1,562,560 B	1,564,784 B	+2,224 B
`-O2`	1,723,040 B	1,726,864 B	+3,824 B
`-O3`	1,716,576 B	1,720,544 B	+3,968 B

No level ties, none favors naïve. The smallest margin is at the shipped -Os (+2,224 B) — the one level where the linker’s .str1.1 literal dedup gets the naïve blob nearly as tight as our pool, so the gap narrows to roughly the table difference alone. Every other level widens it: the naïve literals inflate (alignment-4 padding at -O1/-O2/-O3, no merging at -O0) while our tables never move.

Not even LTO closes the gap

Link-time optimization is the one pass that could, in principle, do the cross-module merging the plain linker can’t. I applied -flto to the project component — where the i18n table and all its callers live — and, for good measure, the non-standard -fmerge-all-constants:

Can a smarter optimizer rescue the naïve table? (-Os)Buildfirmware.bini18n tableΔ vs its baseline
packed (baseline)1,562,560 B2,730 B—
packed + -flto1,562,560 B2,730 B0 B
naïve (baseline)1,564,784 B5,460 B—
naïve + -flto1,564,720 B5,460 B−64 B
naïve + -fmerge-all-constants1,564,160 B5,460 B−624 B

Can a smarter optimizer rescue the naïve table? (-Os)
Build	`firmware.bin`	i18n table	Δ vs its baseline
packed (baseline)	1,562,560 B	2,730 B	—
packed + `-flto`	1,562,560 B	2,730 B	0 B
naïve (baseline)	1,564,784 B	5,460 B	—
naïve + `-flto`	1,564,720 B	5,460 B	−64 B
naïve + `-fmerge-all-constants`	1,564,160 B	5,460 B	−624 B

LTO shaves a negligible 64 B off the naïve image, 0 B off packed, and — the point — leaves the 5,460 B pointer table untouched. It can merge code and fold constants, but it cannot turn a 32-bit pointer table into a 16-bit offset table, and it cannot tail-merge string suffixes. Even -fmerge-all-constants — which is non-conforming, since it can merge distinct objects that happen to share a value and break pointer identity — only trims the blob by 624 B and still loses by +1,600 B. The structural win survives every optimizer I threw at it, which is the whole point: the shape of your index is the one thing the toolchain won’t choose for you.

Why generate it at build time

None of this would be worth hand-maintaining — and hand-maintaining it would be the bug. The translations live in a plain strings.csv (one row per ID, one column per language: diff-friendly, editable by a non-programmer, and adding a language is adding a column). A generator turns that into C++, wired as a PlatformIO pre-build hook:

platformio.ini

[env]
extra_scripts = pre:scripts/gen_i18n.py

The pre: prefix (PlatformIO’s extra_scripts option) runs the script before the platform build, so the generated .cpp/.h are always in sync with the CSV before compilation. The script writes its output only when the content actually changes, so it doesn’t trigger needless recompiles, and it’s a documented part of PlatformIO’s advanced scripting, not a bolt-on. The CSV is the single source of truth; the generated files are never hand-edited.

This is also where the “before compile, regardless of -O2” property pays off: the packing is deterministic and lives in the source you commit. (The sort key is (-byte_length, string) — that second term breaks ties lexicographically, giving a total order, so the same CSV always produces byte-identical output regardless of hash-set iteration order.) The savings don’t depend on which optimization level the final link happens to run.

Adding a language or a string

The whole workflow stays a CSV edit — no C++ touched by hand:

Add a translation

Add a column for a new language (or a row for a new string ID) to strings.csv.
Build. The pre: hook regenerates strings_gen.{h,cpp}, so the new StringId enum value and its per-language offsets appear automatically.
Use it: tr(StringId::MyNewLabel). A missing translation isn’t a crash or a blank — tr() falls back to the English column, so a half-translated language still renders.

The complete generator

The packing loop above is the heart of it, but the full generator is worth reading end to end: CSV parsing with English fallback, the StringId enum and header emission, the clang-format-stable output, and the write_if_changed guard that avoids needless recompiles. Grab it, or unfold it below.

gen_i18n.py PlatformIO pre-build i18n codegen · .py · 8.3 KB · 231 lines

Download

The complete gen_i18n.py

scripts/gen_i18n.py

#!/usr/bin/env python3
# Kleidos — i18n codegen.
#
# Reads i18n/strings.csv and emits:
#   src/i18n/Strings_gen.h   (StringId enum + gen::string accessor decl)
#   src/i18n/Strings_gen.cpp (packed, tail-merged string pool + offset table)
#
# Hooked from platformio.ini as `extra_scripts = pre:scripts/gen_i18n.py`.
# Stays a no-op when the generated files are already up-to-date.

import csv
import os
import sys
from pathlib import Path

# `__file__` may not be defined under SCons exec(); fall back to PROJECT_DIR.
try:
    SCRIPT_DIR = Path(__file__).resolve().parent
except NameError:
    SCRIPT_DIR = Path(os.environ.get("PROJECT_DIR", os.getcwd())) / "scripts"
ROOT = SCRIPT_DIR.parent
CSV_PATH = ROOT / "i18n" / "strings.csv"
HDR_PATH = ROOT / "src" / "i18n" / "Strings_gen.h"
CPP_PATH = ROOT / "src" / "i18n" / "Strings_gen.cpp"

LANGS = ["en", "es", "fr", "de", "it"]


def cpp_escape(s: str) -> str:
    out = []
    for ch in s:
        if ch == "\\":
            out.append("\\\\")
        elif ch == '"':
            out.append('\\"')
        elif ch == "\n":
            out.append("\\n")
        elif ch == "\r":
            out.append("\\r")
        elif ch == "\t":
            out.append("\\t")
        else:
            out.append(ch)
    return "".join(out)


def parse_csv(path: Path):
    """Return list of (id, {lang: text}). Skips comment rows starting with '#'."""
    rows = []
    with path.open("r", encoding="utf-8", newline="") as f:
        reader = csv.reader(f)
        header = next(reader, None)
        if not header or header[0].lower() != "id":
            raise SystemExit(f"i18n: missing or unexpected header {header}")
        col = {h.lower(): i for i, h in enumerate(header)}
        # English is the fallback column, so it must exist before parsing rows.
        if "en" not in col:
            raise SystemExit("i18n: CSV must contain an 'en' column")
        for r in reader:
            if not r or not r[0].strip() or r[0].strip().startswith("#"):
                continue
            sid = r[0].strip()
            entry = {}
            for lang in LANGS:
                idx = col.get(lang)
                val = r[idx] if (idx is not None and idx < len(r)) else ""
                # Fall back to English when a translation is missing.
                entry[lang] = val if val.strip() else (entry.get("en") or r[col["en"]])
            rows.append((sid, entry))
    return rows


def render_header(rows):
    lines = []
    lines.append("// AUTO-GENERATED by scripts/gen_i18n.py — DO NOT EDIT MANUALLY.")
    lines.append("// Regenerated from i18n/strings.csv on every build.")
    lines.append("#pragma once")
    lines.append("#include <cstdint>")
    lines.append("")
    lines.append("namespace i18n {")
    lines.append("")
    lines.append("enum class StringId : uint16_t {")
    for sid, _ in rows:
        lines.append(f"    {sid},")
    lines.append("    COUNT")
    lines.append("};")
    lines.append("")
    lines.append("constexpr uint16_t kStringCount = static_cast<uint16_t>( StringId::COUNT );")
    lines.append("constexpr uint8_t  kLangCount   = 5;  // EN, ES, FR, DE, IT")
    lines.append("")
    lines.append("namespace gen {")
    lines.append("")
    lines.append("/**")
    lines.append(" * @brief Return translation (@p lang, @p index) as a NUL-terminated string.")
    lines.append(" *")
    lines.append(" * Points into the packed flash pool; the returned pointer stays valid for")
    lines.append(" * the program lifetime. No bounds checking — callers must ensure")
    lines.append(" * @p lang < @c kLangCount and @p index < @c kStringCount.")
    lines.append(" */")
    lines.append("const char* string( uint8_t lang, uint16_t index );")
    lines.append("")
    lines.append("}  // namespace gen")
    lines.append("")
    lines.append("}  // namespace i18n")
    lines.append("")
    return "\n".join(lines)


def build_pool(rows):
    """Build a tail-merged string pool.

    Returns (emit, offsets) where `emit` is the list of strings appended to the
    pool in order (each contributes its UTF-8 bytes + a NUL) and `offsets` maps
    every unique string to its byte offset into the concatenated blob.

    Strings are placed longest-first so that any string which is the suffix of
    another collapses onto the longer one's bytes (e.g. "ancel" reuses the tail
    of "Cancel"); identical strings are deduplicated outright.
    """
    uniq = sorted(
        {entry[lang] for _, entry in rows for lang in LANGS},
        key=lambda s: (-len(s.encode("utf-8")), s),
    )
    blob = bytearray()
    offsets = {}
    emit = []
    for s in uniq:
        needle = s.encode("utf-8") + b"\0"
        pos = bytes(blob).find(needle)
        if pos >= 0:
            offsets[s] = pos
        else:
            offsets[s] = len(blob)
            blob += needle
            emit.append(s)
    return emit, offsets, len(blob)


def render_cpp(rows):
    emit, offsets, blob_len = build_pool(rows)
    # +1 keeps the string literal's implicit terminator so the array size is not
    # one short of the initializer (which -Werror rejects).
    pool_size = blob_len + 1

    lines = []
    lines.append("// AUTO-GENERATED by scripts/gen_i18n.py — DO NOT EDIT MANUALLY.")
    lines.append('#include "Strings_gen.h"')
    lines.append("")
    lines.append("#include <array>")
    lines.append("#include <cstdint>")
    lines.append("")
    lines.append("namespace i18n {")
    lines.append("namespace gen {")
    lines.append("")
    lines.append("// Packed translation pool: all strings concatenated into one flash blob,")
    lines.append("// deduplicated and tail-merged (a string that is the suffix of another")
    lines.append("// shares its bytes). Adjacent string literals concatenate; the offsets")
    lines.append("// below index into the resulting bytes.")
    lines.append("// clang-format off")
    lines.append(f"constexpr std::array<char, {pool_size}> kPool = {{")
    running = 0
    for s in emit:
        lines.append(f'    "{cpp_escape(s)}\\0"  // @{running}')
        running += len(s.encode("utf-8")) + 1
    lines.append("};")
    lines.append("// clang-format on")
    lines.append("")
    lines.append("// Per-language byte offsets into kPool. uint16_t keeps this table half the")
    lines.append("// size of a 32-bit pointer table and free of load-time relocations.")
    lines.append("// clang-format off")
    lines.append(
        "constexpr std::array<std::array<uint16_t, kStringCount>, kLangCount> kOffsets = { {"
    )
    for lang in LANGS:
        offs = [offsets[entry[lang]] for _, entry in rows]
        lines.append(f"    /* {lang.upper()} */ {{ {{")
        for i in range(0, len(offs), 12):
            chunk = ", ".join(str(o) for o in offs[i : i + 12])
            lines.append(f"        {chunk},")
        lines.append("    } },")
    lines.append("} };")
    lines.append("// clang-format on")
    lines.append("")
    lines.append(
        'static_assert( kPool.size() <= UINT16_MAX, "pool exceeds uint16_t offset range" );'
    )
    lines.append("")
    lines.append("const char* string( uint8_t lang, uint16_t index ) {")
    lines.append("    return kPool.data() + kOffsets[lang][index];")
    lines.append("}")
    lines.append("")
    lines.append("}  // namespace gen")
    lines.append("}  // namespace i18n")
    lines.append("")
    return "\n".join(lines)


def write_if_changed(path: Path, content: str) -> bool:
    """Write only if content differs (avoids triggering recompilation)."""
    path.parent.mkdir(parents=True, exist_ok=True)
    if path.exists():
        old = path.read_text(encoding="utf-8")
        if old == content:
            return False
    path.write_text(content, encoding="utf-8")
    return True


def generate():
    if not CSV_PATH.exists():
        print(f"i18n: missing {CSV_PATH}", file=sys.stderr)
        return False
    rows = parse_csv(CSV_PATH)
    if not rows:
        print("i18n: CSV produced no rows", file=sys.stderr)
        return False
    h_changed = write_if_changed(HDR_PATH, render_header(rows))
    c_changed = write_if_changed(CPP_PATH, render_cpp(rows))
    if h_changed or c_changed:
        print(f"i18n: regenerated {len(rows)} ids x {len(LANGS)} langs")
    return True


# PlatformIO entry point.
try:
    Import("env")  # type: ignore[name-defined]  # noqa: F821
    generate()
except NameError:
    # Standalone CLI invocation.
    if __name__ == "__main__":
        generate()

The roads not taken

A few alternatives looked tempting and were rejected for concrete reasons:

Alternatives considered
Approach	Why not
Keep the pointer table	Simplest, but spends the 2.7 KB of index overhead the offset table reclaims — and keeps 1,365 relocations the offsets erase.
`std::string_view` table	”More modern”, but each view is `{ptr, size}` = 8 bytes on a 32-bit target — it would double the pointer table to ~10.9 KB (4× the offset table) for a length `tr()` never needs.
X-macros	Elegant all-in-C codegen, but it leans entirely on function-like macros, which violates AUTOSAR C++14 Rule A16-0-1 (now part of MISRA C++:2023): the preprocessor is for includes and conditional compilation only.
Runtime compression	Per-string Huffman saves more flash but needs a RAM decode buffer and breaks the “`tr()` returns a stable pointer” contract the call sites rely on. Not worth it for ~13 KB of text.

The std::string_view option is the tempting one — it’s the modern vocabulary type — so it’s worth seeing exactly what it costs per entry:

One std::string_view entry (32-bit)

@0 ptr @4 size

ptr · const char* · 4 B · @0
size · size_t · 4 B · @4

sizeof = 8 B

One std::string_view entry (32-bit)
Member	Offset	Detail
ptr	0	const char* · 4 B
size	4	size_t · 4 B

A std::string_view is {pointer, length} = 8 bytes per cell — four times the 2-byte offset — to carry a length tr() never reads.

The packed layout is also cleaner against the project’s static-analysis goals than the C array it replaced: std::array throughout (rather than C-style arrays, per AUTOSAR A18-1-1), a compile-time bound check via static_assert, and no preprocessor tricks.

I didn’t invent this

In the spirit of honest accounting: the shape I “discovered” — a contiguous string blob plus a table of offsets into it — is one of the most reused layouts in software. It’s how localized and metadata strings have shipped for decades:

Where this shape already ships
System	Stored as a blob + offset/index table
GNU gettext `.mo`	Each translation is a length + a 32-bit offset into a string block — the canonical i18n precedent.
Android `resources.arsc` (`ResStringPool`)	Your app’s UI strings as one block plus a table of 32-bit offsets.
.NET assemblies (ECMA-335 `#Strings` heap)	A deduplicated UTF-8 string blob, addressed by offset.
Java `.class` files (constant pool)	Each `Utf8` stored once, referenced by a 2-byte index.
ICU resource bundles	Offset-indexed strings, with a shared `pool.res` deduplicating across bundles.
Rust’s compiler (`rustc` `Symbol`)	An interned string is a 32-bit index into an arena — not a pointer.

What’s left to claim is the composition and one firmware-specific squeeze. Most of those formats use 32-bit offsets or indices, sized for desktop-scale data; on a sub-64 KiB MCU pool the offset fits a uint16, which is the entire −50% table win. The relocation-freedom isn’t mine either — it falls out of indexing with an integer instead of a pointer, the general string-interning pattern, and the same reason RELR exists to compress the pointer relocations everyone else still pays. The tail-merge, as the opening admitted, the linker already does. The only real contribution is noticing all three apply to a two-button MCU at once — and wiring them so the CSV stays the single source of truth.

Try it, and the lesson

What I'd take from this

Measure before you optimize. The linker was already doing most of what I set out to “invent.”
The shape of your index is yours. Pointers versus offsets is a structural choice no toolchain — not even LTO — will make for you, and it carries the relocations along with it (1,365 → 0).
Own the optimization up front when you don’t want it riding on a linker flag — codegen makes it deterministic, and the packed data is byte-identical from -O0 to -O3.
Keep the source human-friendly. A CSV plus a generator beats a hand-written table you’ll eventually desync.
Account honestly. The headline (−50% index) is real; the firmware delta (−2,224 B at -Os) is smaller because you hand back 500 B of cross-module dedup — and at -Os the linker already gets the blob within 282 B, so the table-plus-relocations is the load-bearing win, not the text.

This came out of Kleidos, a hardware password manager I’m building — not yet released. The i18n layer is a small corner of it, but it’s a good reminder that on a constrained target the interesting savings often aren’t in the data you store, but in how you index it.

Frequently asked questions

What is a packed i18n string pool?

It replaces the naïve table of const char* pointers with two pieces of generated data: a single flash blob holding every NUL-terminated translation, and a per-language table of uint16_t offsets into that blob. An offset is just the byte where a C string begins, so tr() still returns a plain const char*.

Why use uint16_t offsets instead of a pointer table?

On a 32-bit MCU each const char* is 4 bytes, so a 5×273 grid costs 5,460 bytes of pure addressing. A uint16_t offset is 2 bytes, halving the index table to 2,730 bytes (−50%) — a structural win no linker, and not even LTO, will do for you.

Doesn't the linker already deduplicate and tail-merge strings?

Yes — GNU ld tail-merges identical strings and suffixes by default, and ld.lld does so at -O2. So the packed blob is only about 282 B smaller than what the linker produces. The real, optimizer-proof win is structural: halving the index table and erasing the relocations, which the linker cannot do.

How much RAM does the string pool use?

Zero beyond a one-byte current-language index. Both the pool and the offset table are constexpr .rodata, so they live entirely in flash; on the ESP32 they are memory-mapped (execute-in-place) and read with ordinary loads.

How much firmware did packing actually save?

Net firmware shrank 2,224 B at the shipped -Os, with no API or call-site changes. The saving is smaller than the −2,730 B index halving because pooling forfeits about 500 B of cross-module linker dedup; the table halving plus the erased 1,365 relocations carry the net gain, not the blob.

Does the technique scale and work on other MCUs?

The relocation elimination scales without limit (one per cell), but the −50% table win holds only while the pool fits a 16-bit offset (about 1,300 IDs); past that a uint32 offset matches the pointer width. It is not ESP32-specific: any 32-bit Cortex-M gets the same −50%, and a 64-bit target saves −75% per cell.

Arduino's PROGMEM reference docs.arduino.cc

embedded C++ stores strings blog.feabhas.com

GCC's optimize-options reference gcc.gnu.org

Oracle's linker and libraries guide docs.oracle.com

MaskRay's ld vs lld notes maskray.me

the ld.lld manual man.archlinux.org

O'Dwyer on ELF string merging quuxplusone.github.io

cppreference's static_assert en.cppreference.com

cppreference's std::array en.cppreference.com

the gettext .mo format www.gnu.org

Espressif's part numbers developer.espressif.com

ESP-IDF partition tables docs.espressif.com

PlatformIO pre-build hook docs.platformio.org

PlatformIO's extra_scripts option docs.platformio.org

PlatformIO's advanced scripting docs.platformio.org

AUTOSAR C++14 Rule A16-0-1 www.mathworks.com

AUTOSAR A18-1-1 www.mathworks.com

ECMA-335 heap learn.microsoft.com

constant pool docs.oracle.com

ICU resource bundles icu.unicode.org

doc.rust-lang.org

string-interning en.wikipedia.org

RELR maskray.me

What the Linker Won't Do: Packing i18n Strings on an MCU

TL;DR

The textbook table — and what the index costs

The cost of the index alone

But doesn’t the linker already do this?

The packed pool and the offset table

Reading a string back

Dedup and tail-merge: the generator’s core

What it actually saved

One address per cell — or none

Does it scale?

Other MCUs: often a bigger win than here

What no optimizer will do for you

The packed data is byte-identical at every -O level

Packed wins at every level — and -Os is the smallest margin

Not even LTO closes the gap

Why generate it at build time

Adding a language or a string

The complete generator

The roads not taken

I didn’t invent this

Try it, and the lesson

Frequently asked questions

Further Reading & Resources

Table of Contents