hwpkit¶
The pure-Python toolkit for Korean HWP & HWPX (Hancom Office) documents. Read it, edit it, extract its text — no Hancom, no Windows, no COM automation.
Get started View on GitHub PyPI
Korean government, universities, courts, and most Korean enterprises run on
HWP (한글, Hangul Word Processor) — the .hwp and .hwpx formats from
Hancom Office. The rest of the world's tooling (python-docx, pdfplumber,
unstructured, LibreOffice) can't read them, and the few that try need a
Windows box with Hancom installed driving it over COM.
hwpkit is the missing piece. Pure Python, cross-platform, zero external
apps — install it and start reading, editing, and extracting Korean documents
in three lines.
from hwpkit import extract_text_from_file
print(extract_text_from_file("계약서.hwp")) # …or .hwpx — auto-detected
That's it. No Hancom license. No Windows. No headless office server.
Why hwpkit¶
-
📄 Both formats, one API
Binary
.hwp(HWP 5.0) and XML.hwpx(OWPML).open_document()hands you the same editor either way — you never branch on format. -
🐍 Pure Python, runs anywhere
Linux, macOS, Windows, containers, Lambda. No Hancom, no
pywin32, no COM, no LibreOffice subprocess. -
🤖 Built for LLM / RAG
Clean Korean text out of any
.hwp/.hwpx, ready to chunk and embed — the preprocessing step your retrieval pipeline was missing. -
✍️ Edit without corrupting
Fill government & university forms, tick checkboxes, rewrite cells — and the binary container is rebuilt while preserving the directory tree Hancom validates on open.
-
🖋️ Insert seals & signatures
Stamp a 도장/직인/서명 image into a form cell — into both
.hwpand.hwpx. -
⚖️ MIT licensed, tiny core
Base install is just
olefile;lxmlandPilloware optional extras, loaded lazily only when you use.hwpxor images.
One API, both formats¶
from hwpkit import open_document
doc = open_document("template.hwp") # or "template.hwpx" — auto-detected
print(doc.describe()) # list paragraphs to find field indices
doc.inject_text(24, "홍길동") # fill an empty cell
doc.swap_in_para_text(40, "□ 석사", "☑ 석사") # tick a checkbox
doc.replace_text(75, "2026. 05. 19.") # overwrite a cell
doc.place_image(42, "seal.png", width_mm=30) # stamp a 도장 / signature
doc.save("out.hwp")
open_document returns an HwpFile or
HwpxFile depending on the file — both expose
the same methods.
Who uses this¶
- AI / RAG engineers indexing Korean documents into vector DBs.
- Gov-tech & RPA teams auto-filling 관공서·대학 forms at scale.
- Data engineers migrating HWP archives to text / structured data.
- Anyone who needs to edit a
.hwpwithout clicking through Hancom.
Next steps¶
- Quickstart — extract, edit, and stamp in a few lines.
- hwpkit vs alternatives — how it compares to
pyhwp,pyhwpx, andolefile. - API reference — every public function and class.
- Internals — how
.hwpand.hwpxmap onto one model.
Made by Ebenworks · MIT licensed.