kade.im
๐Ÿงช

Korean Cipher Restoring by AI

Tags
Research
AI
Wrote
2025.01

Main Goal : Find the model that can restore Korean Cipher ( Example : Lr๋Š” ใ„ฑใ… ๋” ๋ˆˆ๋ฌผ์„ ํ˜๋ฆฐใ„ทr... ) into the normal Korean

ย 
Related Media: ChatGPT o1
โ€ฃ*
ย 

Test on AI Services

  • Test Prompt
2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ] ์œ„ ํ…์ŠคํŠธ๋ฅผ ์ •์ƒ์ ์ธ ํ•œ๊ธ€๋กœ ๋ณต์›ํ•ด๋ด ( Restore above text into normal Korean )

Results

Model Name
Model Size
Model Type
Result
API
GPT 4o / o1 / o1-mini
70B +
Multimodal LLM
O
O
Gemini 1.5 / 2.0
70B +
Multimodal LLM
O
O
Clova AI
204B
Multimodal LLM
O
O
ใ…ค
ใ…ค
ใ…ค
ใ…ค
ใ…ค
Detail Results

OpenAI ChatGPT (Success)

notion image
notion image
notion image

Google Gemini (Success)

notion image
ย 
ย 

Naver clova AI (Success) ( OCR , Text )

notion image
ย 

Local Model Test ( Ollama, Hugging Face )

  • Spec : RTX A5000 * 6

Test Resources ( Prompt Engineering)

  • Type A
PROMPT = '''You are a helpful AI assistant specializing in Korean text restoration. Please restore the following fragmented Korean text into its original form: ๋‹น์‹ ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ณต์›์— ํŠนํ™”๋œ ์œ ๋Šฅํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ์ฃผ์–ด์ง„ ๋ถ„๋ฆฌ๋œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์›๋ž˜์˜ ํ˜•ํƒœ๋กœ ๋ณต์›ํ•ด์ฃผ์„ธ์š”.''' input_text = "2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]"
  • Type B
prompt = ( "๋‹ค์Œ ๋ฌธ์žฅ์—์„œ ํฌํ•จ๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‚ด์šฉ์„ ํ•ด์„ํ•˜์„ธ์š”:\n" "- ๋‚ ์งœ: ์ฝ˜ํ…์ธ ๊ฐ€ ์–ธ์ œ ์˜ฌ๋ผ์™”๋Š”์ง€.\n" "- ๋ฐฐ์šฐ: ๋“ฑ์žฅํ•˜๋Š” ๋ฐฐ์šฐ์˜ ์ด๋ฆ„.\n" "- ์ œ๋ชฉ: ์ฝ˜ํ…์ธ ์˜ ์ œ๋ชฉ.\n\n" "์˜ˆ:\n" "์ž…๋ ฅ: 2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]\n" "์ถœ๋ ฅ: ์ด ์ฝ˜ํ…์ธ ๋Š” 2019๋…„ 12์›” ์ดˆ์— ์˜ฌ๋ผ์˜จ ๋‚ด์šฉ์ด๋ฉฐ, ๋ฐฐ์šฐ๋Š” ์ด์‹œ์–ธ๊ณผ ์™•์ง€ํ˜œ์ž…๋‹ˆ๋‹ค. ์ œ๋ชฉ์€ \"์ฃฝ์˜€๋‹ค ์•„๋‚ด๋ฅผ\"์ž…๋‹ˆ๋‹ค.\n\n" f"์ž…๋ ฅ: {pre_text}\n" "์ถœ๋ ฅ:" )
  • OCR TypeA
    • notion image
ย 
  • OCR TypeB
    • notion image
ย 

Results summary (get model from HuggingFace)

Model Name
Model Size
Model Type
Result
KETI-AIR/ke-t5-small
~220M
Text 2 Text Generation
โŒ
ddobokki/ko-trocr
~370M
Image-to-Text
โŒ
gogamza/kobart-base-v2
~125M
Text 2 Text Generation
โŒ
hyunwoongko/kobart
~125M
Text 2 Text Generation
โŒ
google/flan-t5-base
~248M
Text 2 Text Generation
โŒ
naver-clova-ix/donut-base
~1.1B
Image-to-Text
โŒ
MLP-KTLim/llama-3-Korean-Bllossom-8B
~8.03B
Text Generation
โŒ
LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
~2.4B
Text Generation
๐ŸŸข

The Best Result ( LGAI-EXAONE - 2.4B model )

Original Text
Restored Text
GPU Usage
2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]
2019๋…„ 12์›” ์ดˆ ๊ธด๊ธ‰ ์กฐ์น˜ (์ด์‹œ์–ธ ์™•์ง€ํ˜œ)๋กœ ์ธํ•ด ์•„๋‚ด๋ฅผ ์ฃฝ์˜€๋‹ค
Allocated: 4685.94, Reserved: 4844.00
2019.02์›” S.F ํŒํƒ€์ง€ [โ€” ใ……ใ…ฃ . ๊ณต . ๊ฐ„ . ใ…‡ใ…ฃ . ๋™ โ€”]ํ•œ๊ธ€์ž๋ง‰ ์ดˆ๊ณ ํ™”์งˆ 1080P
2019๋…„ 2์›” SF ํŒํƒ€์ง€ [โ€”์‹œ๊ฐ„ ์ด๋™โ€”] ๊ณ ํ™”์งˆ 1080P ์ž๋ง‰ ์ดˆ๊ณ ํ™”์งˆ
Allocated: 4685.94, Reserved: 4880.00
โ‘กOโ‘กO.01์›” - ๊ธด์žฅ๊ฐ ์ตœ๊ณ ์น˜ - [ - ๋ธ”.๋ž™.์•ค.๋“œ.๋ธ”.๋ฃจ - ] ์ˆจ์„๊ณณ์—†๋‹ค
์›” ๊ธด์žฅ๊ฐ ์ตœ๊ณ ์น˜ [-๋ธ”๋ž™์•ค๋“œ๋ธ”๋ฃจ] ์ˆจ์„ ๊ณณ ์—†๋‹คTo make it more coherent:2์›” ๊ธด์žฅ๊ฐ ์ตœ๊ณ  [-๋ธ”๋ž™์•ค๋“œ๋ธ”๋ฃจ] ์ˆจ ์‰ด ๊ณณ ์—†์–ด
Allocated: 4685.94, Reserved: 4880.00
  • After inference - Allocated: 4685.94 MB, Reserved: 4880.00 MB
Before inference - Allocated: 4685.94 MB, Reserved: 4844.00 MB original text: 2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ] decoded text: |system|]You are EXAONE model from LG AI Research, a helpful assistant specialized in restoring distorted Korean text. [|user|]Restore the following text into normal Korean: 2O19.12์›”์ดˆ๊ธด๊ธ‰(์ด์‹œ์–ธ์™•์ง€ํ˜œ)[์ฃฝ์˜€๋‹ค์•„๋‚ด๋ฅผ] [|assistant|]2019๋…„ 12์›” ์ดˆ ๊ธด๊ธ‰ ์กฐ์น˜ (์ด์‹œ์–ธ ์™•์ง€ํ˜œ)๋กœ ์ธํ•ด ์•„๋‚ด๋ฅผ ์ฃฝ์˜€๋‹ค. original text: 2019.02์›” S.F ํŒํƒ€์ง€ [--- ใ……ใ…ฃ . ๊ณต . ๊ฐ„ . ใ…‡ใ…ฃ . ๋™ ---]ํ•œ๊ธ€์ž๋ง‰ ์ดˆ๊ณ ํ™”์งˆ 1080P decoded text: [|system|]You are EXAONE model from LG AI Research, a helpful assistant specialized in restoring distorted Korean text. [|user|]Restore the following text into normal Korean: 2019.02์›”S.FํŒํƒ€์ง€[---์‹œ.๊ณต.๊ฐ„.์ด.๋™---]ํ•œ๊ธ€์ž๋ง‰์ดˆ๊ณ ํ™”์งˆ1080P [|assistant|]2019๋…„ 2์›” SF ํŒํƒ€์ง€ [---์‹œ๊ฐ„ ์ด๋™---] ๊ณ ํ™”์งˆ 1080P ์ž๋ง‰ ์ดˆ๊ณ ํ™”์งˆ original text: โ‘กOโ‘กO.01์›” - ๊ธด์žฅ๊ฐ ์ตœ๊ณ ์น˜ - [ - ๋ธ”.๋ž™.์•ค.๋“œ.๋ธ”.๋ฃจ - ] ์ˆจ์„๊ณณ์—†๋‹ค decoded text: [|system|]You are EXAONE model from LG AI Research, a helpful assistant specialized in restoring distorted Korean text. [|user|]Restore the following text into normal Korean: 2O2O.01์›”-๊ธด์žฅ๊ฐ์ตœ๊ณ ์น˜-[-๋ธ”.๋ž™.์•ค.๋“œ.๋ธ”.๋ฃจ-]์ˆจ์„๊ณณ์—†๋‹ค [|assistant|]The restored text in normal Korean appears as follows: 2์›” ๊ธด์žฅ๊ฐ ์ตœ๊ณ ์น˜ [-๋ธ”๋ž™์•ค๋“œ๋ธ”๋ฃจ] ์ˆจ์„ ๊ณณ ์—†๋‹ค or 2์›” ๊ธด์žฅ๊ฐ ์ตœ๊ณ  [-๋ธ”๋ž™์•ค๋“œ๋ธ”๋ฃจ] ์ˆจ ์‰ด ๊ณณ ์—†์–ด After inference - Allocated: 4685.94 MB, Reserved: 4880.00 MB

Test Results

Detail results
  • MLP-KTLim/llama-3-Korean-Bllossom-8B
python3 korean_blossom_llama.py # input_text = "2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]" Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:00<00:00, 5.74it/s] The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ๋ณต์›๋œ ํ…์ŠคํŠธ: 2020๋…„ 1์›” ์ดˆ ๊ธด ํœด์‹(์ด์‚ฌ์œค ์™• ์ž๊ณ  ๋ง๊ณ  ์‰ฌ์—ˆ๋‹ค) [ ์ฃฝ์—ˆ๋‹ค๊ณ  ๋“ค์—ˆ๋‹ค๋Š” ๊ฒƒ ]
python3 korean_blossom_llama.py # input_text = "2O19๋…„12์›” ((Oi์™„๋ฉ•๊ทธใ„น1๊ฑฐ)) []ใ…ก ๋‹ฅ . Eใ…“. ์Šฌ . ๋ฆฝ ใ…ก[] -์ดˆํ˜„์‹ค๋Šฅ๋ ฅ์ž- ์™„๋ฒฝ์ž์ฒด" Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:00<00:00, 5.55it/s] The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ๋ณต์›๋œ ํ…์ŠคํŠธ: Output: "2019๋…„ 12์›” ((Oi์™„๋ฒฝํ•œ1๋“ฑ)) [์ด๊ฒŒ ๋‹ค. ์—ฌ๊ธฐ. ์Šฌํ”„๋‹ค. ๋ฆฝ ์•ˆํ•ด[] -์ดˆํ˜„์‹ค๋Šฅ๋ ฅ์ž- ์™„๋ฒฝ์ž์ฒด"
ย 
ex. KETI-AIR ke-t5-small
from transformers import T5ForConditionalGeneration, T5Tokenizer # ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ model_name = "KETI-AIR/ke-t5-small" # Hugging Face์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•œ๊ตญ์–ด T5 ๋ชจ๋ธ tokenizer = T5Tokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name) def restore_text_ai(text): # ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ T5 ์ž…๋ ฅ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ input_text = f"๋ณต์›: {text}" inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True) # ๋ชจ๋ธ ์ถ”๋ก  outputs = model.generate(inputs["input_ids"], max_length=512, num_beams=5, early_stopping=True) # ๊ฒฐ๊ณผ ๋””์ฝ”๋”ฉ restored_text = tokenizer.decode(outputs[0], skip_special_tokens=True) return restored_text # ํ…Œ์ŠคํŠธ test_text = "2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]" restored = restore_text_ai(test_text) print("๋ณต์›๋œ ํ…์ŠคํŠธ:", restored)
โ†’ Fail, No response

ย 
  • google-t5/t5-small : Fail, No response
  • KETI-AIR/ke-t5-small : Fail, Hallucinated
raw_text = "2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]โ€
๋ณต์›๋œ ํ…์ŠคํŠธ: ๋งˆ์ฐฌ๊ฐ€์ง€๋‹คๅŽŸๅŽŸๅŽŸ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹คๅŽŸๅฑฑๅŽŸๅŽŸhankookilbohankookilbohankookilbo ๋ฏผ์ •์ˆ˜์„hankookilbohankookilbo ์ข‹๊ฒ ์ง€๋งŒhankookilbohankookilbo ๋งŽ๊ฒŒ๋Š”hankookilbohankookilbo ์œ„๋ฉ”ํ”„hankookilbohankookilbo accidenthankookilbohankookilbo ํŠนํ™œ๋น„hankookilbohankookilbo Stamfordhankookilbohankookilbo ๋ฐ”๋ฅด์…€๋กœ๋‚˜hankookilbohankookilboํฌ์Šคhankookilbohankookilbo ridehankookilbohankookilbo ํŠน์ˆ˜ํ™œ๋™๋น„hankookilbohankookilbo ๊ธˆ๊ณ hankookilbohankookilbo ์กฐ์น˜ํ–ˆ๋‹คhankookilbohankookilbo ๋ฌผ๊ฐ€hankookilbohankookilbo WTO Stamfordhankookilbo Orange Stamford Stamfordhankookilbo ํ™˜์ „ Stamfordgnradhankookilbohankookilbo door Stamford Stamford Stamford Bangkokhankookilbohankookilbo workshophankookilbohankookilbo ๊ฐ์‚ฌ์›์€hankookilbohankookilbo ship Stamfordhankookilbo ์„ธ๊ณ„๋žญํ‚นhankookilbohankookilbo ์žŠํ˜€์ง€hankookilbohankookilbo Numberhankookilbohankookilbo ๊ท“hankookilbohankookilbo LHhankookilbohankookilbo wheneverhankookilbohankookilbo ํ‰์ฐฝhankookilbohankookilbo ์›”๋“œhankookilbohankookilbo ๋ชจ๋‹ˆํ„ฐhankookilbohankookilbo ์™”์ง€๋งŒhankookilbohankookilbo ์ˆ˜์ˆ˜๋ฃŒ๋ฅผhankookilbohankookilbo ์šฐ๋ฆฌ๊ธˆ์œต์ง€์ฃผhankookilbohankookilbo Princetonhankookilbohankookilbo ์ผ์œผํ‚ค Stamfordhankookilbognradhankookilbo Stamford
ย 
  • model_name = "hyunwoongko/kobartโ€, Fail
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2. ์‘๋‹ต ์ตœ์†Œ๊ธธ์ด : 18, ์ตœ๋Œ€ ๊ธธ์ด : 58 ์ž…๋ ฅ: 2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ] ๋ณต์› ๊ฒฐ๊ณผ: ๋‹ค์Œ ๋‹ค์Œ ๋ฌธ์žฅ์„ ์˜ฌ๋ฐ”๋ฅธ ํ•œ๊ตญ์–ด๋กœ ๋ณต์›ํ•˜์„ธ์š”: ์ž…๋ ฅ: 201912์›”์ดˆ๊ธด๊ธ‰์ด์‹œ์–ธ์–ด๋กœ ๋ณตํ•˜์„ธ์š”:0 ์ถœ๋ ฅ. ์‘๋‹ต ์ตœ์†Œ๊ธธ์ด : 27, ์ตœ๋Œ€ ๊ธธ์ด : 67 ์ž…๋ ฅ: 2019.02์›” S.F ํŒํƒ€์ง€ [--- ใ……ใ…ฃ . ๊ณต . ๊ฐ„ . ใ…‡ใ…ฃ . ๋™ ---]ํ•œ๊ธ€์ž๋ง‰ ์ดˆ๊ณ ํ™”์งˆ 1080P ๋ณต์› ๊ฒฐ๊ณผ: ๋‹ค์Œ ๋‹ค์Œ ๋ฌธ์žฅ์„ ์˜ฌ๋ฐ”๋ฅธ ํ•œ๊ตญ์–ด๋กœ ๋ณต์›ํ•˜์„ธ์š”: ์ž…๋ ฅ: 201902์›”SFํŒํƒ€์ง€---์‹œ๊ณต๊ฐ„์ด๋™---1-ํ•œ๊ธ€์ž๋ง‰์ดˆ๊ณ ๊ณ -- ํ•œ๊ธ€ ์ž๋ง‰ ์ดˆ๊ณ  ๊ณ ๊ณ ํ™”์งˆ1080P
ย 

ย 

OCR test (Fail)

ย 
notion image
ย 
notion image
ย 
ย 

ย 
ย 
ย 

ย 
ย 

If not using AI ?

  • Need all type of regex matching dictionary for cipher transformed Korean words.
  • sample
############################################################################### # 0) ํ•œ๊ธ€ ์ž๋ชจ ๋ฆฌ์ŠคํŠธ (Korean Jamo Lists for composition) ############################################################################### CHOSUNG_LIST = list("ใ„ฑใ„ฒใ„ดใ„ทใ„ธใ„นใ…ใ…‚ใ…ƒใ……ใ…†ใ…‡ใ…ˆใ…‰ใ…Šใ…‹ใ…Œใ…ใ…Ž") JUNGSEONG_LIST = list("ใ…ใ…ใ…‘ใ…’ใ…“ใ…”ใ…•ใ…–ใ…—ใ…˜ใ…™ใ…šใ…›ใ…œใ…ใ…žใ…Ÿใ… ใ…กใ…ขใ…ฃ") JONGSEONG_LIST = [""] + list("ใ„ฑใ„ฒใ„ณใ„ดใ„ตใ„ถใ„ทใ„นใ„บใ„ปใ„ผใ„ฝใ„พใ„ฟใ…€ใ…ใ…‚ใ…„ใ……ใ…†ใ…‡ใ…ˆใ…Šใ…‹ใ…Œใ…ใ…Ž") ############################################################################### # 0-1) ๋ณ€ํ˜• ๊ธฐํ˜ธ -> ์ž๋ชจ ๋งคํ•‘ (Variant symbols to Jamo) ############################################################################### VARIANT_JAMO_MAP = { 'ใ…‡': ['o','O','0'], 'ใ„น': ['l','L','|','I'], 'ใ„ฑ': ['g','G'], 'ใ……': ['s','S'], 'ใ…ˆ': ['j','J','z','Z'], 'ใ…Ž': ['h','H'], # ๋ชจ์Œ ์˜ˆ์‹œ (Add more if needed) 'ใ…ฃ': ['i','I','|'], 'ใ…': ['a','A'], 'ใ…“': ['eo','EO'], 'ใ…—': ['O','o'], # ์ฃผ์˜: ใ…‡(โ€˜Oโ€™)๊ณผ ํ˜ผ๋™๋  ์ˆ˜ ์žˆ์œผ๋‹ˆ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์กฐ์ • }
import re def restore_text(text): # 1. ์ž์Œ/๋ชจ์Œ ๋ถ„๋ฆฌ -> ํ•ฉ์น˜๊ธฐ # ์ž์Œ๊ณผ ๋ชจ์Œ์ด ๋ถ„๋ฆฌ๋œ ๊ฒฝ์šฐ ์œ ๋‹ˆ์ฝ”๋“œ ์กฐํ•ฉ def combine_jamo(chars): CHO = "ใ„ฑใ„ดใ„ทใ„นใ…ใ…‚ใ……ใ…‡ใ…ˆใ…Šใ…‹ใ…Œใ…ใ…Ž" JUNG = "ใ…ใ…‘ใ…“ใ…•ใ…—ใ…›ใ…œใ… ใ…กใ…ฃ" JONG = "ใ„ฑใ„ฒใ„ณใ„ดใ„ตใ„ถใ„ทใ„นใ„บใ„ปใ„ผใ„ฝใ„พใ„ฟใ…€ใ…ใ…‚ใ…„ใ……ใ…†ใ…‡ใ…ˆใ…Šใ…‹ใ…Œใ…ใ…Ž" if len(chars) == 2: # ์ดˆ์„ฑ + ์ค‘์„ฑ cho, jung = chars return chr(0xAC00 + CHO.index(cho) * 588 + JUNG.index(jung) * 28) elif len(chars) == 3: # ์ดˆ์„ฑ + ์ค‘์„ฑ + ์ข…์„ฑ cho, jung, jong = chars return chr(0xAC00 + CHO.index(cho) * 588 + JUNG.index(jung) * 28 + JONG.index(jong) + 1) return ''.join(chars) text = re.sub(r'\s+', '', text) # ๊ณต๋ฐฑ ์ œ๊ฑฐ text = re.sub(r'[^\w๊ฐ€-ํžฃ]', '', text) # ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ text = re.sub(r'[O]', '0', text) # O โ†’ 0 text = re.sub(r'[lI]', '1', text) # l/I โ†’ 1 text = re.sub(r'[B]', '8', text) # B โ†’ 8 # ์ž์Œ/๋ชจ์Œ ๋ถ„๋ฆฌ๋œ ๋ฌธ์ž ํ•ฉ์น˜๊ธฐ result = [] temp = [] for char in text: if 'ใ„ฑ' <= char <= 'ใ…Ž' or 'ใ…' <= char <= 'ใ…ฃ': # ํ•œ๊ธ€ ์ž๋ชจ temp.append(char) if len(temp) == 3 or (len(temp) == 2 and temp[1] in "ใ…ใ…‘ใ…“ใ…•ใ…—ใ…›ใ…œใ… ใ…กใ…ฃ"): result.append(combine_jamo(temp)) temp = [] else: if temp: result.extend(temp) temp = [] result.append(char) if temp: result.extend(temp) return ''.join(result) # ํ…Œ์ŠคํŠธ test_text = "2O19.12์›” ์ดˆ ๊ธด ๊ธ‰ ( ใ…‡ใ…ฃใ……ใ…ฃ์–ธ ์™• ใ…ˆ ใ…ฃ ใ…Ž ใ…– ) [ ์ฃฝ ์˜€ ใ„ทใ… ใ…‡ใ…ใ„ดใ…๋ฅผ ]" restored = restore_text(test_text) print("๋ณต์›๋œ ํ…์ŠคํŠธ:", restored)