charset_normalizer package

Subpackages

Submodules

charset_normalizer.api module

charset_normalizer.api.from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches

Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.

The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance.

You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose.

This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.

charset_normalizer.api.from_fp(fp: BinaryIO, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches

Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer.

charset_normalizer.api.from_path(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches

Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError.

charset_normalizer.api.normalize(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True) charset_normalizer.models.CharsetMatch

Take a (text-based) file path and try to create another file next to it, this time using UTF-8.

charset_normalizer.cd module

charset_normalizer.cd.alpha_unicode_split(decoded_sequence: str) List[str]

Given a decoded text sequence, return a list of str. Unicode range / alphabet separation. Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list; One containing the latin letters and the other hebrew.

charset_normalizer.cd.alphabet_languages(characters: List[str], ignore_non_latin: bool = False) List[str]

Return associated languages associated to given characters.

charset_normalizer.cd.characters_popularity_compare(language: str, ordered_characters: List[str]) float

Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language. The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit). Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.)

charset_normalizer.cd.coherence_ratio(decoded_sequence: str, threshold: float = 0.1, lg_inclusion: Optional[str] = None) List[Tuple[str, float]]

Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges.

charset_normalizer.cd.encoding_languages(iana_name: str) List[str]

Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence.

charset_normalizer.cd.encoding_unicode_range(iana_name: str) List[str]

Return associated unicode ranges in a single byte code page.

charset_normalizer.cd.mb_encoding_languages(iana_name: str) List[str]

Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence.

charset_normalizer.cd.merge_coherence_ratios(results: List[List[Tuple[str, float]]]) List[Tuple[str, float]]

This function merge results previously given by the function coherence_ratio. The return type is the same as coherence_ratio.

charset_normalizer.cd.unicode_range_languages(primary_range: str) List[str]

Return inferred languages used with a unicode range.

charset_normalizer.constant module

charset_normalizer.legacy module

class charset_normalizer.legacy.CharsetDetector(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: charset_normalizer.legacy.CharsetNormalizerMatches

class charset_normalizer.legacy.CharsetDoctor(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: charset_normalizer.legacy.CharsetNormalizerMatches

class charset_normalizer.legacy.CharsetNormalizerMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)

Bases: charset_normalizer.models.CharsetMatch

class charset_normalizer.legacy.CharsetNormalizerMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: charset_normalizer.models.CharsetMatches

static from_bytes(*args, **kwargs)
static from_fp(*args, **kwargs)
static from_path(*args, **kwargs)
static normalize(*args, **kwargs)
charset_normalizer.legacy.detect(byte_str: bytes) Dict[str, Optional[Union[str, float]]]

chardet legacy method Detect the encoding of the given byte string. It should be mostly backward-compatible. Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it) This function is deprecated and should be used to migrate your project easily, consult the documentation for further information. Not planned for removal.

Parameters

byte_str – The byte sequence to examine.

charset_normalizer.md module

class charset_normalizer.md.ArchaicUpperLowerPlugin

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.CjkInvalidStopPlugin

Bases: charset_normalizer.md.MessDetectorPlugin

GB(Chinese) based encoding often render the stop incorrectly when the content does not fit and can be easily detected. Searching for the overuse of ‘丅’ and ‘丄’.

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.MessDetectorPlugin

Bases: object

Base abstract class used for mess detection plugins. All detectors MUST extend and implement given methods.

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.SuperWeirdWordPlugin

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.SuspiciousDuplicateAccentPlugin

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.SuspiciousRange

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.TooManyAccentuatedPlugin

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.TooManySymbolOrPunctuationPlugin

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

class charset_normalizer.md.UnprintablePlugin

Bases: charset_normalizer.md.MessDetectorPlugin

eligible(character: str) bool

Determine if given character should be fed in.

feed(character: str) None

The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float

Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() None

Permit to reset the plugin to the initial state.

charset_normalizer.md.is_suspiciously_successive_range(unicode_range_a: Optional[str], unicode_range_b: Optional[str]) bool

Determine if two Unicode range seen next to each other can be considered as suspicious.

charset_normalizer.md.mess_ratio(decoded_sequence: str, maximum_threshold: float = 0.2, debug: bool = False) float

Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier.

charset_normalizer.models module

class charset_normalizer.models.CharsetMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)

Bases: object

add_submatch(other: charset_normalizer.models.CharsetMatch) None
property alphabets: List[str]
best() charset_normalizer.models.CharsetMatch

Kept for BC reasons. Will be removed in 3.0.

property bom: bool
property byte_order_mark: bool
property chaos: float
property chaos_secondary_pass: float

Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0

property coherence: float
property coherence_non_latin: float

Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0

property could_be_from_charset: List[str]

The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property ‘encoding’.

property encoding: str
property encoding_aliases: List[str]

Encoding name are known by many name, using this could help when searching for IBM855 when it’s listed as CP855.

property fingerprint: str

Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.

first() charset_normalizer.models.CharsetMatch

Kept for BC reasons. Will be removed in 3.0.

property has_submatch: bool
property language: str

Most probable language found in decoded sequence. If none were detected or inferred, the property will return “Unknown”.

property languages: List[str]

Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if ‘language’ property return something != ‘Unknown’.

property multi_byte_usage: float
output(encoding: str = 'utf_8') bytes

Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced.

property percent_chaos: float
property percent_coherence: float
property raw: bytes

Original untouched bytes.

property submatch: List[charset_normalizer.models.CharsetMatch]
property w_counter: collections.Counter

Word counter instance on decoded text. Notice: Will be removed in 3.0

class charset_normalizer.models.CharsetMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: object

Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.

append(item: charset_normalizer.models.CharsetMatch) None

Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.

best() Optional[charset_normalizer.models.CharsetMatch]

Simply return the first match. Strict equivalent to matches[0].

first() Optional[charset_normalizer.models.CharsetMatch]

Redundant method, call the method best(). Kept for BC reasons.

class charset_normalizer.models.CliDetectionResult(path: str, encoding: Optional[str], encoding_aliases: List[str], alternative_encodings: List[str], language: str, alphabets: List[str], has_sig_or_bom: bool, chaos: float, coherence: float, unicode_path: Optional[str], is_preferred: bool)

Bases: object

to_json() str

charset_normalizer.utils module

charset_normalizer.utils.any_specified_encoding(sequence: bytes, search_zone: int = 4096) Optional[str]

Extract using ASCII-only decoder any specified encoding in the first n-bytes.

charset_normalizer.utils.cp_similarity(iana_name_a: str, iana_name_b: str) float
charset_normalizer.utils.iana_name(cp_name: str, strict: bool = True) str
charset_normalizer.utils.identify_sig_or_bom(sequence: bytes) Tuple[Optional[str], bytes]

Identify and extract SIG/BOM in given sequence.

charset_normalizer.utils.is_accentuated(character: str) bool
charset_normalizer.utils.is_ascii(character: str) bool
charset_normalizer.utils.is_case_variable(character: str) bool
charset_normalizer.utils.is_cjk(character: str) bool
charset_normalizer.utils.is_cp_similar(iana_name_a: str, iana_name_b: str) bool

Determine if two code page are at least 80% similar. IANA_SUPPORTED_SIMILAR dict was generated using the function cp_similarity.

charset_normalizer.utils.is_emoticon(character: str) bool
charset_normalizer.utils.is_hangul(character: str) bool
charset_normalizer.utils.is_hiragana(character: str) bool
charset_normalizer.utils.is_katakana(character: str) bool
charset_normalizer.utils.is_latin(character: str) bool
charset_normalizer.utils.is_multi_byte_encoding(name: str) bool

Verify is a specific encoding is a multi byte one based on it IANA name

charset_normalizer.utils.is_private_use_only(character: str) bool
charset_normalizer.utils.is_punctuation(character: str) bool
charset_normalizer.utils.is_separator(character: str) bool
charset_normalizer.utils.is_symbol(character: str) bool
charset_normalizer.utils.is_thai(character: str) bool
charset_normalizer.utils.is_unicode_range_secondary(range_name: str) bool
charset_normalizer.utils.range_scan(decoded_sequence: str) List[str]
charset_normalizer.utils.remove_accent(character: str) str
charset_normalizer.utils.should_strip_sig_or_bom(iana_encoding: str) bool
charset_normalizer.utils.unicode_range(character: str) Optional[str]

Retrieve the Unicode range official name from a single character.

charset_normalizer.version module

Expose version

Module contents

Charset-Normalizer

The Real First Universal Charset Detector. A library that helps you read text from an unknown charset encoding. Motivated by chardet, This package is trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.

Basic usage:
>>> from charset_normalizer import from_bytes
>>> results = from_bytes('Bсеки човек има право на образование. Oбразованието!'.encode('utf_8'))
>>> best_guess = results.best()
>>> str(best_guess)
'Bсеки човек има право на образование. Oбразованието!'

Others methods and usages are available - see the full documentation at <https://github.com/Ousret/charset_normalizer>. :copyright: (c) 2021 by Ahmed TAHRI :license: MIT, see LICENSE for more details.

class charset_normalizer.CharsetDetector(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: charset_normalizer.legacy.CharsetNormalizerMatches

class charset_normalizer.CharsetDoctor(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: charset_normalizer.legacy.CharsetNormalizerMatches

class charset_normalizer.CharsetMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)

Bases: object

add_submatch(other: charset_normalizer.models.CharsetMatch) None
property alphabets: List[str]
best() charset_normalizer.models.CharsetMatch

Kept for BC reasons. Will be removed in 3.0.

property bom: bool
property byte_order_mark: bool
property chaos: float
property chaos_secondary_pass: float

Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0

property coherence: float
property coherence_non_latin: float

Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0

property could_be_from_charset: List[str]

The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property ‘encoding’.

property encoding: str
property encoding_aliases: List[str]

Encoding name are known by many name, using this could help when searching for IBM855 when it’s listed as CP855.

property fingerprint: str

Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.

first() charset_normalizer.models.CharsetMatch

Kept for BC reasons. Will be removed in 3.0.

property has_submatch: bool
property language: str

Most probable language found in decoded sequence. If none were detected or inferred, the property will return “Unknown”.

property languages: List[str]

Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if ‘language’ property return something != ‘Unknown’.

property multi_byte_usage: float
output(encoding: str = 'utf_8') bytes

Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced.

property percent_chaos: float
property percent_coherence: float
property raw: bytes

Original untouched bytes.

property submatch: List[charset_normalizer.models.CharsetMatch]
property w_counter: collections.Counter

Word counter instance on decoded text. Notice: Will be removed in 3.0

class charset_normalizer.CharsetMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: object

Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.

append(item: charset_normalizer.models.CharsetMatch) None

Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.

best() Optional[charset_normalizer.models.CharsetMatch]

Simply return the first match. Strict equivalent to matches[0].

first() Optional[charset_normalizer.models.CharsetMatch]

Redundant method, call the method best(). Kept for BC reasons.

class charset_normalizer.CharsetNormalizerMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)

Bases: charset_normalizer.models.CharsetMatch

class charset_normalizer.CharsetNormalizerMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)

Bases: charset_normalizer.models.CharsetMatches

static from_bytes(*args, **kwargs)
static from_fp(*args, **kwargs)
static from_path(*args, **kwargs)
static normalize(*args, **kwargs)
charset_normalizer.detect(byte_str: bytes) Dict[str, Optional[Union[str, float]]]

chardet legacy method Detect the encoding of the given byte string. It should be mostly backward-compatible. Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it) This function is deprecated and should be used to migrate your project easily, consult the documentation for further information. Not planned for removal.

Parameters

byte_str – The byte sequence to examine.

charset_normalizer.from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches

Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.

The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance.

You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose.

This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.

charset_normalizer.from_fp(fp: BinaryIO, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches

Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer.

charset_normalizer.from_path(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches

Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError.

charset_normalizer.normalize(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True) charset_normalizer.models.CharsetMatch

Take a (text-based) file path and try to create another file next to it, this time using UTF-8.