Java編碼檢測工具:juniversalchardet

jopen 11年前發布 | 41K 次閱讀 Java開發 juniversalchardet

Mozilla在很多年前就做了一個非常優秀的編碼檢測工具,叫chardet(java版jchardet ),后來有發布了算法更加優秀的universalchardet,用于Firefox的自動編碼識別。另外Apache內容抽取項目Tika的發布包tika-app-1.*.jar(自1.2及以后版本)其中打包了juniversalchardet。

注意:如果試圖識別幾個字節的短文本編碼,可能會出現了識別錯誤,這應該是算法實現本身的缺陷,但識別稍大一點文本編碼,正確率則非常高,尤其較chardet要高的多。

Encodings that can be detected

  • Chinese
    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-23121
  • Cyrillic
    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MACCYRILLIC
    • IBM866
    • IBM855
  • Greek
    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew
    • ISO-8859-8
    • WINDOWS-1255
  • Japanese
    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean
    • ISO-2022-KR
    • EUC-KR
  • Unicode
    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Others
    • WINDOWS-1252

 

 

Related Works

   jchardet

   jchardet is another Java port of the Mozilla's encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the 'chardet' module that has long existed. juniversalchardet is based on the 'universalchardet' module that is new and generally provides better accuracy on detection results.

 

Sample Code

 

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];
    String fileName = args[0];
    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
  }
}

項目主頁:http://www.baiduhome.net/lib/view/home/1373879020591

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!