HWP 5.0 Gotchas¶

The four things that take a week to figure out the first time. If you're trying to edit HWP files programmatically and Hancom is rejecting your output or rendering it strangely, start here.

1. Why does my edited HWP open as "corrupted"?¶

You probably edited a BodyText/Section0 stream whose byte length changed, then wrote it back through a naive CFB writer.

HWP is a Microsoft Compound File Binary (MS-CFB) container. Its directory is a red-black tree of named entries — siblings are ordered by name with specific comparison rules (MS-CFB §2.6.4: compare UTF-16 lengths first, then case-folded code points). Hancom validates the tree invariants on open, and most naive implementations get the name comparison wrong.

The standard library workaround olefile.OleFileIO.write_stream sidesteps the problem by only allowing same-size rewrites — but you almost always change size when injecting Korean text.

The fix: read the original 128-byte directory records straight from the file (not through olefile's parsed view) and preserve their sid_left, sid_right, sid_child, and color fields byte-for-byte in the output. The tree topology is already valid; reusing it sidesteps the comparison-rule trap entirely.

hwpkit.cfb.load / hwpkit.cfb.dump does this.

2. Why does my injected text render as a smashed single line?¶

You updated PARA_HEADER.chars and added or replaced PARA_TEXT, but didn't touch the PARA_LINE_SEG record (tag 0x45). Hancom caches the rendered line layout in that record. With stale cache data Hancom draws all your new characters overlaid on the original one-character line.

Things you might try, and why they fail:

Approach	Result
Leave the original LineSeg alone	New text smashes onto a single line — Hancom uses the stale cache
Delete the PARA_LINE_SEG record	Hancom rejects the file as corrupted (record is mandatory)
Build a multi-segment LineSeg covering the new chars	Cell grows vertically, but text still smashes onto the last segment
Replace the body with 36 zero bytes	✅ Hancom recomputes layout from PARA_CHAR_SHAPE metrics

The all-zero LineSeg is pyhwp's documented "dummy LineSeg" fallback (hwp5/xmlmodel.py, comment "더미 LineSeg를 만들어 준다"). Hancom treats it as a sentinel meaning "no cached layout, please recompute."

hwpkit.records.inject_text and replace_text already do this whenever the character count changes. Only when the count stays identical (e.g. a single-character checkbox swap □ → ☑) is it safe to leave the LineSeg untouched — and that's what swap_in_para_text relies on.

3. Why does my English text refuse to change font in Hancom?¶

HWP's CharShape record (DocInfo tag 0x15) has seven per-script font slots:

Slot	Script	When it's used
0	한글 (Hangul)	Korean syllables
1	영문 (Latin)	A–Z, a–z, basic punctuation
2	한자 (Hanja)	CJK Unified Ideographs
3	일어 (Japanese)	Hiragana, Katakana
4	기호 (Symbol)	Symbols
5	사용자 (User)	User-defined script range
6	기타 (Other)	Everything else

Hancom's font dropdown in the toolbar typically binds to one slot — usually Hangul. So if your paragraph mixes Korean and Latin text and the Latin slot points to a different font, changing the font via the toolbar silently fails on the Latin runs. The Korean part updates, the English part doesn't.

Fix in Hancom (GUI): select the text, press Alt+L to open 글자 모양, set 대표 글꼴 and check the "모든 언어" / per-language boxes to propagate.

Fix programmatically: use hwpkit.charshape.flatten_to_face to overwrite all 7 face_name_ids with the same face id. Operate on the DocInfo stream's parsed records, not BodyText.

from hwpkit import cfb, records
from hwpkit.pipeline import docinfo_sid, file_header_compressed
from hwpkit import charshape

entries = cfb.load("template.hwp")
di_sid = docinfo_sid(entries)
raw = entries[di_sid].data
if file_header_compressed(entries):
    raw = records.decompress(raw)
di = records.parse(raw)

i = charshape.find_charshape(di, 18)   # the 19th CharShape in DocInfo
charshape.flatten_to_face(di[i], 0)    # all slots → face_id 0

4. Why does `replace_text("")` corrupt the file?¶

replace_text(records, N, "") sets PARA_HEADER.chars to 1 and the PARA_TEXT body to \r (UTF-16). On its own, this paragraph opens fine in Hancom. But combined with other table-cell edits in the same document, Hancom flags the file as corrupted on open.

The reason: an originally empty HWP paragraph has chars == 1 and no PARA_TEXT record at all. A paragraph wiped via replace_text("") has chars == 1 plus a PARA_TEXT record whose body is \r. The two states are not equivalent. Hancom tolerates the inconsistency in isolation but trips a corruption check when sibling cells in the same table have also grown via narrative injects.

Empirical bisect on one document:

Edits applied	Opens?
Minimum required only	✅
Min + 6 long narrative injects	✅
Min + all `replace_text` calls (including a replace-to-empty)	✅
Min + narratives + replace-to-empty	❌ corrupted
Min + narratives + score boxes	✅
Min + narratives + date/sig replacements	✅

Fix: don't wipe a paragraph to empty. Either leave template-hint text intact, or use " " or "—" as a placeholder. If you're hitting "corrupted" on open and one of your edits is a replace-to-empty, remove it and bisect from there.

5. Why does my embedded image come out the wrong size?¶

You stored the image bytes correctly but guessed at the dimensions, so the picture renders tiny or gigantic. HWP stores picture extents in HWPUNIT = 1/7200 inch (the same unit as everything else — see RECORD_FORMAT.md), not pixels. To convert a bitmap's native pixel size to its 1:1 document extent you assume the standard 96 px/inch screen density:

1 px @ 96 dpi  =  7200 / 96  =  75 HWPUNIT
original_extent (w, h)  =  (px_w × 75, px_h × 75)

So the picture's original size is pixels × 75; the displayed size is whatever you want, reached via the shape's current-size field (binary) or a scale matrix display / native (HWPX). A 200×80 px seal at 1:1 is 15000 × 6000 HWPUNIT (≈ 2.08 × 0.83 in).

This is not an HWPX quirk — 75 is literally 7200/96, so the same arithmetic drives binary SHAPE_COMPONENT_PICTURE extents and HWPX <hp:orgSz> alike. See OBJECT_MODEL.md.