Skip to content

Blog

Tutorials and notes on working with Korean HWP / HWPX (Hancom Office) documents in Python — text extraction for RAG, form automation, and the binary-format internals behind hwpkit.

파이썬으로 HWP·HWPX 파일 읽기와 텍스트 추출하기

한국의 공공기관, 대학, 법원, 그리고 대부분의 기업은 HWP(한글 워드 프로세서, 한컴오피스의 .hwp·.hwpx 형식)로 문서를 주고받습니다. 그런데 python-docx, pdfplumber, unstructured 같은 도구는 HWP를 읽지 못하고, 읽을 수 있는 pyhwpx윈도우 + 한컴 설치 + COM이 필요해 리눅스 서버나 CI, 컨테이너에서는 쓸 수 없습니다.

이 글에서는 hwpkit으로 .hwp.hwpx를 순수 파이썬으로 읽고, 텍스트를 추출하고, 편집하는 방법을 정리합니다. 한컴이나 윈도우 없이 어디서든 동작합니다.

Building a RAG pipeline over Korean HWP documents

Most Retrieval-Augmented Generation (RAG) tutorials assume your corpus is PDFs or Markdown. But if you're building AI over Korean enterprise or government data, your corpus is HWP — the .hwp / .hwpx formats from Hancom Office. And that's where pipelines quietly break: the standard ingestion stack (pdfplumber, python-docx, unstructured) can't read HWP at all, so the documents never make it into your vector store.

This guide walks through a complete RAG ingestion pipeline over Korean HWP documents in Python, using hwpkit as the extraction step — pure Python, no Hancom, no Windows.

How to read and extract text from Korean HWP files in Python

If you've ever tried to open a Korean .hwp file in Python, you already know the problem: python-docx doesn't touch it, pdfplumber is for PDFs, and unstructured skips it. HWP — 한글, the Hangul Word Processor format from Hancom Office — is the default document format across Korean government, universities, courts, and enterprises, yet almost no portable tooling reads it.

This guide shows how to read, extract text from, and edit both .hwp and .hwpx files in pure Python using hwpkit — no Hancom installation and no Windows required, so it runs on a Linux server, in CI, or in a container.