codecs.py
codecs.py sits between Python callers and the C codec machinery in
Modules/_codecsmodule.c. It provides the search-function registry,
CodecInfo named tuple, incremental codec base classes, and stream wrappers.
Map
| Lines | Symbol | Role |
|---|---|---|
| 1–60 | imports, BOM_* constants | byte-order marks for UTF-x encodings |
| 61–120 | CodecInfo | collections.namedtuple with encode/decode/streamreader/streamwriter |
| 121–180 | register() lookup() | add/find a codec search function via C extension |
| 181–260 | open() | wraps a binary file with StreamReaderWriter |
| 261–360 | EncoderWrapper DecoderWrapper | adapt incremental codecs to the stateless interface |
| 361–500 | IncrementalEncoder IncrementalDecoder | base classes for stateful codecs |
| 501–620 | StreamWriter | buffered write side, write() / writelines() / reset() |
| 621–760 | StreamReader | buffered read side, read() / readline() / readlines() |
| 761–840 | StreamReaderWriter | combines both sides for open() |
| 841–920 | StreamRecoder | re-encodes on the fly between two codecs |
| 921–1000 | charmap_encode() charmap_decode() | single-byte charmap codec helpers |
| 1001–1100 | make_encoding_map() make_identity_dict() | charmap construction utilities |
Reading
register and lookup
register() and lookup() delegate entirely to the C extension. The search
function receives a lowercase encoding name and must return a CodecInfo or
None.
# CPython: Lib/codecs.py:128 register
def register(search_function):
_codecs.register(search_function)
# CPython: Lib/codecs.py:140 lookup
def lookup(encoding):
return _codecs.lookup(encoding)
The returned CodecInfo is a named tuple with four callables: encode,
decode, streamreader, and streamwriter.
open and StreamReaderWriter
open() opens a binary file then wraps it with a StreamReaderWriter so
callers get a text-mode file object backed by an arbitrary codec.
# CPython: Lib/codecs.py:195 open
def open(filename, mode='rb', encoding=None,
errors='strict', buffering=-1):
...
file = builtins.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file,
info.streamreader,
info.streamwriter,
errors)
srw.encoding = encoding
return srw
IncrementalEncoder and IncrementalDecoder
These base classes define the contract for stateful codecs. Subclasses must
implement encode() / decode(). The reset() method is a no-op in the
base but overridden by stateful codecs such as UTF-16.
# CPython: Lib/codecs.py:385 IncrementalEncoder
class IncrementalEncoder:
def __init__(self, errors='strict'):
self.errors = errors
self.buffer = ""
def encode(self, input, final=False):
raise NotImplementedError
def reset(self):
pass
def getstate(self):
return 0
def setstate(self, state):
pass
charmap_encode
charmap_encode converts a Unicode string to bytes using a mapping dict. The
mapping must cover every character that appears in the input or the call
raises UnicodeEncodeError.
# CPython: Lib/codecs.py:940 charmap_encode
def charmap_encode(input, errors='strict', mapping=None):
return _codecs.charmap_encode(input, errors, mapping)
gopy notes
_codecsisModules/_codecsmodule.c. The Go equivalent should expose aRegisterCodec(searchFn)function and store search functions in a slice protected by async.RWMutex.CodecInfocan be a plain Go struct; the four function fields map tofunc([]byte, string) ([]byte, int, error)signatures (encode side) and their mirror for decode.StreamReaderandStreamWriterare stateful; model them as structs holding anio.Reader/io.Writerplus a pending-byte buffer.IncrementalEncoder/IncrementalDecodermap naturally to Go interfaces. Name themIncrementalEncoderIfaceor similar to avoid collision with the concrete base type.charmap_encode/charmap_decodeare thin wrappers around the C function; re-implement them as a pure Go loop over[]runefor the initial port.
CPython 3.14 changes
StreamReader.read()now raisesUnicodeDecodeErrorwith a more precise byte-offset when the codec returns partial data at EOF.- The
errorsargument is validated earlier inIncrementalEncoder.__init__using the same helper thatstr.encodeuses, giving consistent error messages. make_encoding_map()gained a fast path for identity mappings to reduce startup cost for Latin-1-family codecs.- No new public symbols were added between 3.13 and 3.14 for this module.