Mixed Raster Content encoding

internetarchivepdf.mrc.create_mrc_hocr_components(image, hocr_word_data, dpi=None, downsample=None, bg_downsample=None, fg_downsample=None, denoise_mask=None, timing_data=None, errors=None)[source]

Create the MRC components: mask, foreground and background

Args:

  • image (PIL.Image): Image to be decomposed

  • hocr_word_data: OCR data about found text on the page

  • downsample (int): factor by which the OCR data is to be downsampled

  • bg_downsample (int): if the background image should be downscaled

  • denoise_mask (bool): Whether to denoise the image if it is deemed too noisy

  • timing_data: Optional timing data to log individual timing data to.

  • errors: Optional argument (of type set) with encountered runtime errors

Returns a tuple of the components, as numpy arrays: (mask, foreground, background)

internetarchivepdf.mrc.denoise_bregman(binary_img)[source]

Denoise a binary numpy array using Bregman total variation denoising

Args:

  • binary_img (np.array): input array

Returns the denoised array

internetarchivepdf.mrc.encode_mrc_background(np_bg, bg_compression_flags, tmp_dir=None, jpeg2000_implementation=None, mrc_image_format=None, timing_data=None, threads=None, debug=False)[source]

Encode background image as JPEG2000, with the provided compression settings and JPEG2000 encoder.

Args:

  • np_bg (numpy.array): Background image array

  • bg_compression_flags (str): Compression flags

  • tmp_dir (str): path the temporary directory to write images to

  • jpeg2000_implementation (str): What JPEG2000 implementation to use

  • mrc_image_format (str): What image format to produce

  • timing_data (optional): Add time information to timing_data structure

Returns the filepath to the JPEG2000 background image

internetarchivepdf.mrc.encode_mrc_foreground(np_fg, fg_compression_flags, tmp_dir=None, jpeg2000_implementation=None, mrc_image_format=None, timing_data=None, threads=False, debug=False)[source]

Encode foreground image as JPEG2000, with the provided compression settings and JPEG2000 encoder.

Args:

  • np_bg (numpy.array): Foreground image array

  • fg_compression_flags (str): Compression flags

  • tmp_dir (str): path the temporary directory to write images to

  • jpeg2000_implementation (str): What JPEG2000 implementation to use

  • mrc_image_format (str): What image format to produce

  • timing_data (optional): Add time information to timing_data structure

Returns the filepath to the JPEG2000 foreground image

internetarchivepdf.mrc.encode_mrc_img(np_img, img_compression_flags, imgtype=None, tmp_dir=None, jpeg2000_implementation=None, mrc_image_format=None, timing_data=None, threads=False, debug=False)[source]

Encode image as JPEG2000 or JPEG, with the provided compression settings and JPEG2000/JPEG encoder.

Args:

  • np_img (numpy.array): Image array

  • img_compression_flags (str): Compression flags

  • imgtype (str: ‘bg’ or ‘fg’

  • tmp_dir (str): path the temporary directory to write images to

  • jpeg2000_implementation (str): What JPEG2000 implementation to use

  • mrc_image_format (str): What image format to produce

  • timing_data (optional): Add time information to timing_data structure

  • debug (bool, optional): Write debug info to stderr

Returns the filepath to the JPEG2000 image

internetarchivepdf.mrc.encode_mrc_mask(np_mask, tmp_dir=None, jbig2=True, embedded_jbig2=False, timing_data=None, debug=False)[source]

Encode mask image either to JBIG2 or PNG.

Args:

  • np_mask (numpy.array): Mask image array

  • tmp_dir (str): path the temporary directory to write images to

  • jbig2 (bool): Whether to encode to JBIG2 or PNG

  • embedded_jbig2 (bool): Whether to encode to JBIG2 with or without header

  • timing_data (optional): Add time information to timing_data structure

Returns a tuple: (str, str) where the first entry is the jbig2 path, if any, the second is the png path.

internetarchivepdf.mrc.partial_blur(mask, img, sigma=5, mode=None)[source]

Blur a part of the image ‘img’, where mask = 0. The actual values used by the blur are colours where mask = ‘1’, effectively ‘erasing/blurring’ parts of an image where mask = 0 with colours where mask = 1.

At the end, restore all pixels from img where mask = 1.

internetarchivepdf.mrc.threshold_image(img, dpi, k=0.34)[source]

Perform Sauvola binarisation on the given image

Args:

  • img (np.ndarray): input image array

  • dpi (int): dpi for Sauvola, used to calculate window size if not None

  • k (float): k parameter, defaults to 0.34

Returns binarised numpy.ndarray