API reference¶
Auto-generated from the source docstrings.
Unified entry point¶
The recommended way in — open any .hwp or .hwpx and get a uniform editor.
hwpkit.hwp.open_document ¶
Open a .hwp or .hwpx and return a uniform editor — HwpFile or
HwpxFile, detected by container content (not extension). Both expose the
same methods, so downstream code doesn't care which format it got.
Editors¶
Both classes expose the same methods, so code written against one works against the other.
hwpkit.hwp.HwpFile ¶
Load a binary .hwp, edit it in memory, and save a new file.
Paragraphs are indexed in document order across every BodyText/Section*
stream, matching HwpxFile's convention.
place_image ¶
Embed image_path and anchor it in paragraph index. width_mm
sets the displayed width (height follows aspect); native size if
omitted. Returns the new bin id. Needs Pillow (hwpkit[image]).
save ¶
Write a new .hwp. Only edited sections (and DocInfo, if an image
was added) are re-serialized; the rest of the container is rebuilt
from the original entries with the directory tree preserved.
hwpkit.hwpx.HwpxFile ¶
In-memory editor for a .hwpx. Mirrors the binary edit API
(inject_text / replace_text / swap_in_para_text) but on OWPML, so
the same calling code works across both Hancom formats.
Paragraphs are indexed in document order across every section (nested
table-cell paragraphs included), matching hwpkit.records.index_paragraphs.
There is no layout cache to invalidate — Hancom recomputes from the XML.
from hwpkit.hwpx import HwpxFile
doc = HwpxFile.open("form.hwpx")
doc.replace_text(12, "홍길동")
doc.swap_in_para_text(8, "□ 동의", "☑ 동의")
doc.save("out.hwpx")
paragraphs ¶
Yield (index, text) for every paragraph — the HWPX analogue of
records.describe, for locating which index is which form field.
replace_text ¶
Rewrite paragraph index's entire text content (keeps the first
run's char formatting; drops any extra runs' text).
inject_text ¶
Fill an empty paragraph. Raises if it already has text (use replace_text to overwrite).
swap_in_para_text ¶
Replace the first occurrence of old with new inside paragraph
index (e.g. a checkbox □ → ☑). old must lie within a single text
run.
place_image ¶
Embed image_path and anchor it (inline, treat-as-char) at the end
of paragraph paragraph_index. width_mm sets the displayed width
(height follows aspect); native size if omitted. Returns the image id.
Needs Pillow (pip install hwpkit[image]).
save ¶
Write the (possibly edited) document. Unmodified parts are copied byte-for-byte; only changed XML parts (sections / content.hpf) are re-serialized; newly added parts (BinData images) are appended. The mimetype entry keeps its original (stored) compression and position.
Text extraction¶
hwpkit.extract.extract_text_from_file ¶
Extract plain text from either a binary .hwp or an XML .hwpx,
dispatching on the actual container (not just the file extension). One
entry point for both Hancom formats — handy for RAG over mixed corpora.
hwpkit.extract.extract_text_from_hwp ¶
Read an HWP and return its plain text content across all sections.
hwpkit.hwpx.extract_text_from_hwpx ¶
Read a .hwpx and return its plain text — one line per paragraph, in document order, across every section (table-cell text included).
hwpkit.hwpx.is_hwpx ¶
True if path looks like a .hwpx (ZIP with an application/hwp+zip
mimetype part), regardless of extension.
Functional helpers (binary .hwp)¶
The original record-level API. fill_hwp hands your callback the parsed record
list; the editors below mutate it in place.
hwpkit.pipeline.fill_hwp ¶
Open input_path, parse Section0, call edit_fn(records) to mutate
in place, then write the new HWP to output_path.
Returns (raw_in, raw_out, comp_in, comp_out) sizes for logging.
hwpkit.records.inject_text ¶
Fill an empty paragraph (chars==1, no PARA_TEXT) with the given text.
The trailing record-end \r is added automatically and counted. Use \n for soft line breaks inside the paragraph. The cached PARA_LINE_SEG is replaced with a 36-byte dummy so Hancom recomputes layout.
hwpkit.records.replace_text ¶
Rewrite a paragraph's PARA_TEXT body entirely. Falls through to inject_text if the paragraph has no PARA_TEXT yet.
WARNING: do not call with text="" on a paragraph that originally had non-empty PARA_TEXT. The resulting (chars=1, PARA_TEXT=\r) state opens fine in isolation but can corrupt the file when combined with other table-cell edits. Use " " or "—" as a placeholder instead.
hwpkit.records.swap_in_para_text ¶
Replace UTF-16LE bytes of old with new inside paragraph N's
PARA_TEXT. Requires len(old) == len(new) so byte length is preserved
and the cached PARA_LINE_SEG remains valid (no dummy needed).
hwpkit.records.extract_text ¶
Extract plain text from a parsed BodyText/Section* record list.
Returns one line per paragraph. Inline controls (tables, images, footnote refs, etc.) are stripped — only literal character content is returned. For semantic / structural conversion (HWP → XML / OWPML) use pyhwp instead; that's a much bigger job and out of scope here.
Soft line breaks (0x0A) become \n. Tabs (0x09) become \t. The paragraph-terminating 0x0D is stripped from each line.
hwpkit.records.describe ¶
Return a human-readable dump (one line per record).
Each PARA_HEADER line is prefixed with Pn where n is the paragraph
index — pass that n to inject_text / replace_text / swap_in_para_text.
Functional helpers (.hwpx)¶
hwpkit.hwpx.fill_hwpx ¶
Open input_path, hand the HwpxFile to edit_fn to mutate, then save
to output_path — the HWPX analogue of hwpkit.fill_hwp.
def edit(doc):
doc.replace_text(12, "홍길동")
fill_hwpx("form.hwpx", "out.hwpx", edit)
Image insertion¶
hwpkit.picture.place_image ¶
place_image(
input_path: str,
output_path: str,
image_path: str,
paragraph_index: int,
width_mm: Optional[float] = None,
section: str = "Section0",
)
Embed image_path into input_path and anchor it in paragraph
paragraph_index of the given BodyText section; write output_path.
width_mm sets the displayed width (height follows the image's aspect
ratio); if omitted, the image is shown at its native pixel size.
Returns the new bin id.
Low-level CFB container¶
The MS-CFB reader/writer that makes corruption-free rewrites possible.
hwpkit.cfb.load ¶
Load a CFB into a dict of sid -> DirEntryOut, preserving tree topology. Stream data is in entry.data; storages have data=None.
hwpkit.cfb.dump ¶
Write a new CFB. If target_sid_to_replace is given, replace that
stream's data with new_data before writing.
hwpkit.cfb.add_stream ¶
Add a stream entry named name holding data (bytes) under parent_sid
(default root). Returns the new sid. Raises if the name already exists.
hwpkit.cfb.add_storage ¶
Add a storage (folder) entry under parent_sid (default root). Returns
its sid. No-op-safe: if a child of that name already exists, returns it.
hwpkit.cfb.find_entry ¶
Resolve a storage/stream path (e.g. find_entry(e, "BinData", "BIN0001.png")) to its sid, or None. Root children are looked up under sid 0.