charset_normalizer package
Subpackages
Submodules
charset_normalizer.api module
- charset_normalizer.api.from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches
Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance.
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose.
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
- charset_normalizer.api.from_fp(fp: BinaryIO, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches
Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer.
- charset_normalizer.api.from_path(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches
Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError.
- charset_normalizer.api.normalize(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True) charset_normalizer.models.CharsetMatch
Take a (text-based) file path and try to create another file next to it, this time using UTF-8.
charset_normalizer.cd module
- charset_normalizer.cd.alpha_unicode_split(decoded_sequence: str) List[str]
Given a decoded text sequence, return a list of str. Unicode range / alphabet separation. Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list; One containing the latin letters and the other hebrew.
- charset_normalizer.cd.alphabet_languages(characters: List[str], ignore_non_latin: bool = False) List[str]
Return associated languages associated to given characters.
- charset_normalizer.cd.characters_popularity_compare(language: str, ordered_characters: List[str]) float
Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language. The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit). Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
- charset_normalizer.cd.coherence_ratio(decoded_sequence: str, threshold: float = 0.1, lg_inclusion: Optional[str] = None) List[Tuple[str, float]]
Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges.
- charset_normalizer.cd.encoding_languages(iana_name: str) List[str]
Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence.
- charset_normalizer.cd.encoding_unicode_range(iana_name: str) List[str]
Return associated unicode ranges in a single byte code page.
- charset_normalizer.cd.mb_encoding_languages(iana_name: str) List[str]
Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence.
- charset_normalizer.cd.merge_coherence_ratios(results: List[List[Tuple[str, float]]]) List[Tuple[str, float]]
This function merge results previously given by the function coherence_ratio. The return type is the same as coherence_ratio.
- charset_normalizer.cd.unicode_range_languages(primary_range: str) List[str]
Return inferred languages used with a unicode range.
charset_normalizer.constant module
charset_normalizer.legacy module
- class charset_normalizer.legacy.CharsetDetector(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
- class charset_normalizer.legacy.CharsetDoctor(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
- class charset_normalizer.legacy.CharsetNormalizerMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)
- class charset_normalizer.legacy.CharsetNormalizerMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
Bases:
charset_normalizer.models.CharsetMatches
- static from_bytes(*args, **kwargs)
- static from_fp(*args, **kwargs)
- static from_path(*args, **kwargs)
- static normalize(*args, **kwargs)
- charset_normalizer.legacy.detect(byte_str: bytes) Dict[str, Optional[Union[str, float]]]
chardet legacy method Detect the encoding of the given byte string. It should be mostly backward-compatible. Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it) This function is deprecated and should be used to migrate your project easily, consult the documentation for further information. Not planned for removal.
- Parameters
byte_str – The byte sequence to examine.
charset_normalizer.md module
- class charset_normalizer.md.ArchaicUpperLowerPlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.CjkInvalidStopPlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
GB(Chinese) based encoding often render the stop incorrectly when the content does not fit and can be easily detected. Searching for the overuse of ‘丅’ and ‘丄’.
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.MessDetectorPlugin
Bases:
object
Base abstract class used for mess detection plugins. All detectors MUST extend and implement given methods.
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.SuperWeirdWordPlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.SuspiciousDuplicateAccentPlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.SuspiciousRange
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.TooManyAccentuatedPlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.TooManySymbolOrPunctuationPlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- class charset_normalizer.md.UnprintablePlugin
Bases:
charset_normalizer.md.MessDetectorPlugin
- eligible(character: str) bool
Determine if given character should be fed in.
- feed(character: str) None
The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.
- property ratio: float
Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.
- reset() None
Permit to reset the plugin to the initial state.
- charset_normalizer.md.is_suspiciously_successive_range(unicode_range_a: Optional[str], unicode_range_b: Optional[str]) bool
Determine if two Unicode range seen next to each other can be considered as suspicious.
- charset_normalizer.md.mess_ratio(decoded_sequence: str, maximum_threshold: float = 0.2, debug: bool = False) float
Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier.
charset_normalizer.models module
- class charset_normalizer.models.CharsetMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)
Bases:
object
- add_submatch(other: charset_normalizer.models.CharsetMatch) None
- property alphabets: List[str]
- best() charset_normalizer.models.CharsetMatch
Kept for BC reasons. Will be removed in 3.0.
- property bom: bool
- property byte_order_mark: bool
- property chaos: float
- property chaos_secondary_pass: float
Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0
- property coherence: float
- property coherence_non_latin: float
Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0
- property could_be_from_charset: List[str]
The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property ‘encoding’.
- property encoding: str
- property encoding_aliases: List[str]
Encoding name are known by many name, using this could help when searching for IBM855 when it’s listed as CP855.
- property fingerprint: str
Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.
- first() charset_normalizer.models.CharsetMatch
Kept for BC reasons. Will be removed in 3.0.
- property has_submatch: bool
- property language: str
Most probable language found in decoded sequence. If none were detected or inferred, the property will return “Unknown”.
- property languages: List[str]
Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if ‘language’ property return something != ‘Unknown’.
- property multi_byte_usage: float
- output(encoding: str = 'utf_8') bytes
Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced.
- property percent_chaos: float
- property percent_coherence: float
- property raw: bytes
Original untouched bytes.
- property submatch: List[charset_normalizer.models.CharsetMatch]
- property w_counter: collections.Counter
Word counter instance on decoded text. Notice: Will be removed in 3.0
- class charset_normalizer.models.CharsetMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
Bases:
object
Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.
- append(item: charset_normalizer.models.CharsetMatch) None
Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.
- best() Optional[charset_normalizer.models.CharsetMatch]
Simply return the first match. Strict equivalent to matches[0].
- first() Optional[charset_normalizer.models.CharsetMatch]
Redundant method, call the method best(). Kept for BC reasons.
- class charset_normalizer.models.CliDetectionResult(path: str, encoding: Optional[str], encoding_aliases: List[str], alternative_encodings: List[str], language: str, alphabets: List[str], has_sig_or_bom: bool, chaos: float, coherence: float, unicode_path: Optional[str], is_preferred: bool)
Bases:
object
- to_json() str
charset_normalizer.utils module
- charset_normalizer.utils.any_specified_encoding(sequence: bytes, search_zone: int = 4096) Optional[str]
Extract using ASCII-only decoder any specified encoding in the first n-bytes.
- charset_normalizer.utils.cp_similarity(iana_name_a: str, iana_name_b: str) float
- charset_normalizer.utils.iana_name(cp_name: str, strict: bool = True) str
- charset_normalizer.utils.identify_sig_or_bom(sequence: bytes) Tuple[Optional[str], bytes]
Identify and extract SIG/BOM in given sequence.
- charset_normalizer.utils.is_accentuated(character: str) bool
- charset_normalizer.utils.is_ascii(character: str) bool
- charset_normalizer.utils.is_case_variable(character: str) bool
- charset_normalizer.utils.is_cjk(character: str) bool
- charset_normalizer.utils.is_cp_similar(iana_name_a: str, iana_name_b: str) bool
Determine if two code page are at least 80% similar. IANA_SUPPORTED_SIMILAR dict was generated using the function cp_similarity.
- charset_normalizer.utils.is_emoticon(character: str) bool
- charset_normalizer.utils.is_hangul(character: str) bool
- charset_normalizer.utils.is_hiragana(character: str) bool
- charset_normalizer.utils.is_katakana(character: str) bool
- charset_normalizer.utils.is_latin(character: str) bool
- charset_normalizer.utils.is_multi_byte_encoding(name: str) bool
Verify is a specific encoding is a multi byte one based on it IANA name
- charset_normalizer.utils.is_private_use_only(character: str) bool
- charset_normalizer.utils.is_punctuation(character: str) bool
- charset_normalizer.utils.is_separator(character: str) bool
- charset_normalizer.utils.is_symbol(character: str) bool
- charset_normalizer.utils.is_thai(character: str) bool
- charset_normalizer.utils.is_unicode_range_secondary(range_name: str) bool
- charset_normalizer.utils.range_scan(decoded_sequence: str) List[str]
- charset_normalizer.utils.remove_accent(character: str) str
- charset_normalizer.utils.should_strip_sig_or_bom(iana_encoding: str) bool
- charset_normalizer.utils.unicode_range(character: str) Optional[str]
Retrieve the Unicode range official name from a single character.
charset_normalizer.version module
Expose version
Module contents
Charset-Normalizer
The Real First Universal Charset Detector. A library that helps you read text from an unknown charset encoding. Motivated by chardet, This package is trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.
- Basic usage:
>>> from charset_normalizer import from_bytes >>> results = from_bytes('Bсеки човек има право на образование. Oбразованието!'.encode('utf_8')) >>> best_guess = results.best() >>> str(best_guess) 'Bсеки човек има право на образование. Oбразованието!'
Others methods and usages are available - see the full documentation at <https://github.com/Ousret/charset_normalizer>. :copyright: (c) 2021 by Ahmed TAHRI :license: MIT, see LICENSE for more details.
- class charset_normalizer.CharsetDetector(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
- class charset_normalizer.CharsetDoctor(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
- class charset_normalizer.CharsetMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)
Bases:
object
- add_submatch(other: charset_normalizer.models.CharsetMatch) None
- property alphabets: List[str]
- best() charset_normalizer.models.CharsetMatch
Kept for BC reasons. Will be removed in 3.0.
- property bom: bool
- property byte_order_mark: bool
- property chaos: float
- property chaos_secondary_pass: float
Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0
- property coherence: float
- property coherence_non_latin: float
Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0
- property could_be_from_charset: List[str]
The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property ‘encoding’.
- property encoding: str
- property encoding_aliases: List[str]
Encoding name are known by many name, using this could help when searching for IBM855 when it’s listed as CP855.
- property fingerprint: str
Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.
- first() charset_normalizer.models.CharsetMatch
Kept for BC reasons. Will be removed in 3.0.
- property has_submatch: bool
- property language: str
Most probable language found in decoded sequence. If none were detected or inferred, the property will return “Unknown”.
- property languages: List[str]
Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if ‘language’ property return something != ‘Unknown’.
- property multi_byte_usage: float
- output(encoding: str = 'utf_8') bytes
Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced.
- property percent_chaos: float
- property percent_coherence: float
- property raw: bytes
Original untouched bytes.
- property submatch: List[charset_normalizer.models.CharsetMatch]
- property w_counter: collections.Counter
Word counter instance on decoded text. Notice: Will be removed in 3.0
- class charset_normalizer.CharsetMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
Bases:
object
Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.
- append(item: charset_normalizer.models.CharsetMatch) None
Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.
- best() Optional[charset_normalizer.models.CharsetMatch]
Simply return the first match. Strict equivalent to matches[0].
- first() Optional[charset_normalizer.models.CharsetMatch]
Redundant method, call the method best(). Kept for BC reasons.
- class charset_normalizer.CharsetNormalizerMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: Optional[str] = None)
- class charset_normalizer.CharsetNormalizerMatches(results: Optional[List[charset_normalizer.models.CharsetMatch]] = None)
Bases:
charset_normalizer.models.CharsetMatches
- static from_bytes(*args, **kwargs)
- static from_fp(*args, **kwargs)
- static from_path(*args, **kwargs)
- static normalize(*args, **kwargs)
- charset_normalizer.detect(byte_str: bytes) Dict[str, Optional[Union[str, float]]]
chardet legacy method Detect the encoding of the given byte string. It should be mostly backward-compatible. Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it) This function is deprecated and should be used to migrate your project easily, consult the documentation for further information. Not planned for removal.
- Parameters
byte_str – The byte sequence to examine.
- charset_normalizer.from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches
Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance.
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose.
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
- charset_normalizer.from_fp(fp: BinaryIO, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches
Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer.
- charset_normalizer.from_path(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True, explain: bool = False) charset_normalizer.models.CharsetMatches
Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError.
- charset_normalizer.normalize(path: os.PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Optional[List[str]] = None, cp_exclusion: Optional[List[str]] = None, preemptive_behaviour: bool = True) charset_normalizer.models.CharsetMatch
Take a (text-based) file path and try to create another file next to it, this time using UTF-8.