Skip to content

API reference

Auto-generated from the source docstrings.

Unified entry point

The recommended way in — open any .hwp or .hwpx and get a uniform editor.

hwpkit.hwp.open_document

open_document(path: str)

Open a .hwp or .hwpx and return a uniform editor — HwpFile or HwpxFile, detected by container content (not extension). Both expose the same methods, so downstream code doesn't care which format it got.

Editors

Both classes expose the same methods, so code written against one works against the other.

hwpkit.hwp.HwpFile

Load a binary .hwp, edit it in memory, and save a new file.

Paragraphs are indexed in document order across every BodyText/Section* stream, matching HwpxFile's convention.

place_image

place_image(
    index: int,
    image_path: str,
    width_mm: Optional[float] = None,
) -> int

Embed image_path and anchor it in paragraph index. width_mm sets the displayed width (height follows aspect); native size if omitted. Returns the new bin id. Needs Pillow (hwpkit[image]).

save

save(out_path: str)

Write a new .hwp. Only edited sections (and DocInfo, if an image was added) are re-serialized; the rest of the container is rebuilt from the original entries with the directory tree preserved.

hwpkit.hwpx.HwpxFile

In-memory editor for a .hwpx. Mirrors the binary edit API (inject_text / replace_text / swap_in_para_text) but on OWPML, so the same calling code works across both Hancom formats.

Paragraphs are indexed in document order across every section (nested table-cell paragraphs included), matching hwpkit.records.index_paragraphs. There is no layout cache to invalidate — Hancom recomputes from the XML.

from hwpkit.hwpx import HwpxFile
doc = HwpxFile.open("form.hwpx")
doc.replace_text(12, "홍길동")
doc.swap_in_para_text(8, "□ 동의", "☑ 동의")
doc.save("out.hwpx")

paragraphs

paragraphs()

Yield (index, text) for every paragraph — the HWPX analogue of records.describe, for locating which index is which form field.

replace_text

replace_text(index: int, text: str)

Rewrite paragraph index's entire text content (keeps the first run's char formatting; drops any extra runs' text).

inject_text

inject_text(index: int, text: str)

Fill an empty paragraph. Raises if it already has text (use replace_text to overwrite).

swap_in_para_text

swap_in_para_text(index: int, old: str, new: str)

Replace the first occurrence of old with new inside paragraph index (e.g. a checkbox □ → ☑). old must lie within a single text run.

place_image

place_image(
    paragraph_index: int,
    image_path: str,
    width_mm: float = None,
) -> str

Embed image_path and anchor it (inline, treat-as-char) at the end of paragraph paragraph_index. width_mm sets the displayed width (height follows aspect); native size if omitted. Returns the image id.

Needs Pillow (pip install hwpkit[image]).

save

save(out_path: str)

Write the (possibly edited) document. Unmodified parts are copied byte-for-byte; only changed XML parts (sections / content.hpf) are re-serialized; newly added parts (BinData images) are appended. The mimetype entry keeps its original (stored) compression and position.

Text extraction

hwpkit.extract.extract_text_from_file

extract_text_from_file(path: str) -> str

Extract plain text from either a binary .hwp or an XML .hwpx, dispatching on the actual container (not just the file extension). One entry point for both Hancom formats — handy for RAG over mixed corpora.

hwpkit.extract.extract_text_from_hwp

extract_text_from_hwp(path: str) -> str

Read an HWP and return its plain text content across all sections.

hwpkit.hwpx.extract_text_from_hwpx

extract_text_from_hwpx(path: str) -> str

Read a .hwpx and return its plain text — one line per paragraph, in document order, across every section (table-cell text included).

hwpkit.hwpx.is_hwpx

is_hwpx(path: str) -> bool

True if path looks like a .hwpx (ZIP with an application/hwp+zip mimetype part), regardless of extension.

Functional helpers (binary .hwp)

The original record-level API. fill_hwp hands your callback the parsed record list; the editors below mutate it in place.

hwpkit.pipeline.fill_hwp

fill_hwp(
    input_path: str,
    output_path: str,
    edit_fn: Callable[[List[dict]], None],
)

Open input_path, parse Section0, call edit_fn(records) to mutate in place, then write the new HWP to output_path.

Returns (raw_in, raw_out, comp_in, comp_out) sizes for logging.

hwpkit.records.inject_text

inject_text(records, paragraph_index, text: str)

Fill an empty paragraph (chars==1, no PARA_TEXT) with the given text.

The trailing record-end \r is added automatically and counted. Use \n for soft line breaks inside the paragraph. The cached PARA_LINE_SEG is replaced with a 36-byte dummy so Hancom recomputes layout.

hwpkit.records.replace_text

replace_text(records, paragraph_index, text: str)

Rewrite a paragraph's PARA_TEXT body entirely. Falls through to inject_text if the paragraph has no PARA_TEXT yet.

WARNING: do not call with text="" on a paragraph that originally had non-empty PARA_TEXT. The resulting (chars=1, PARA_TEXT=\r) state opens fine in isolation but can corrupt the file when combined with other table-cell edits. Use " " or "—" as a placeholder instead.

hwpkit.records.swap_in_para_text

swap_in_para_text(
    records, paragraph_index, old: str, new: str
)

Replace UTF-16LE bytes of old with new inside paragraph N's PARA_TEXT. Requires len(old) == len(new) so byte length is preserved and the cached PARA_LINE_SEG remains valid (no dummy needed).

hwpkit.records.extract_text

extract_text(records) -> str

Extract plain text from a parsed BodyText/Section* record list.

Returns one line per paragraph. Inline controls (tables, images, footnote refs, etc.) are stripped — only literal character content is returned. For semantic / structural conversion (HWP → XML / OWPML) use pyhwp instead; that's a much bigger job and out of scope here.

Soft line breaks (0x0A) become \n. Tabs (0x09) become \t. The paragraph-terminating 0x0D is stripped from each line.

hwpkit.records.describe

describe(records, limit=None)

Return a human-readable dump (one line per record).

Each PARA_HEADER line is prefixed with Pn where n is the paragraph index — pass that n to inject_text / replace_text / swap_in_para_text.

Functional helpers (.hwpx)

hwpkit.hwpx.fill_hwpx

fill_hwpx(input_path: str, output_path: str, edit_fn)

Open input_path, hand the HwpxFile to edit_fn to mutate, then save to output_path — the HWPX analogue of hwpkit.fill_hwp.

def edit(doc):
    doc.replace_text(12, "홍길동")
fill_hwpx("form.hwpx", "out.hwpx", edit)

Image insertion

hwpkit.picture.place_image

place_image(
    input_path: str,
    output_path: str,
    image_path: str,
    paragraph_index: int,
    width_mm: Optional[float] = None,
    section: str = "Section0",
)

Embed image_path into input_path and anchor it in paragraph paragraph_index of the given BodyText section; write output_path.

width_mm sets the displayed width (height follows the image's aspect ratio); if omitted, the image is shown at its native pixel size. Returns the new bin id.

Low-level CFB container

The MS-CFB reader/writer that makes corruption-free rewrites possible.

hwpkit.cfb.load

load(path: str)

Load a CFB into a dict of sid -> DirEntryOut, preserving tree topology. Stream data is in entry.data; storages have data=None.

hwpkit.cfb.dump

dump(
    entries,
    out_path,
    target_sid_to_replace=None,
    new_data=None,
)

Write a new CFB. If target_sid_to_replace is given, replace that stream's data with new_data before writing.

hwpkit.cfb.add_stream

add_stream(entries, name, data, parent_sid=0)

Add a stream entry named name holding data (bytes) under parent_sid (default root). Returns the new sid. Raises if the name already exists.

hwpkit.cfb.add_storage

add_storage(entries, name, parent_sid=0)

Add a storage (folder) entry under parent_sid (default root). Returns its sid. No-op-safe: if a child of that name already exists, returns it.

hwpkit.cfb.find_entry

find_entry(entries, *path)

Resolve a storage/stream path (e.g. find_entry(e, "BinData", "BIN0001.png")) to its sid, or None. Root children are looked up under sid 0.