Mixed Raster Content encoding¶
- internetarchivepdf.mrc.create_mrc_hocr_components(image, hocr_word_data, dpi=None, downsample=None, bg_downsample=None, fg_downsample=None, denoise_mask=None, timing_data=None, errors=None)[source]¶
Create the MRC components: mask, foreground and background
Args:
image (PIL.Image): Image to be decomposed
hocr_word_data: OCR data about found text on the page
downsample (int): factor by which the OCR data is to be downsampled
bg_downsample (int): if the background image should be downscaled
denoise_mask (bool): Whether to denoise the image if it is deemed too noisy
timing_data: Optional timing data to log individual timing data to.
errors: Optional argument (of type set) with encountered runtime errors
Returns a tuple of the components, as numpy arrays: (mask, foreground, background)
- internetarchivepdf.mrc.denoise_bregman(binary_img)[source]¶
Denoise a binary numpy array using Bregman total variation denoising
Args:
binary_img (np.array): input array
Returns the denoised array
- internetarchivepdf.mrc.encode_mrc_background(np_bg, bg_compression_flags, tmp_dir=None, jpeg2000_implementation=None, mrc_image_format=None, timing_data=None, threads=None, debug=False)[source]¶
Encode background image as JPEG2000, with the provided compression settings and JPEG2000 encoder.
Args:
np_bg (numpy.array): Background image array
bg_compression_flags (str): Compression flags
tmp_dir (str): path the temporary directory to write images to
jpeg2000_implementation (str): What JPEG2000 implementation to use
mrc_image_format (str): What image format to produce
timing_data (optional): Add time information to timing_data structure
Returns the filepath to the JPEG2000 background image
- internetarchivepdf.mrc.encode_mrc_foreground(np_fg, fg_compression_flags, tmp_dir=None, jpeg2000_implementation=None, mrc_image_format=None, timing_data=None, threads=False, debug=False)[source]¶
Encode foreground image as JPEG2000, with the provided compression settings and JPEG2000 encoder.
Args:
np_bg (numpy.array): Foreground image array
fg_compression_flags (str): Compression flags
tmp_dir (str): path the temporary directory to write images to
jpeg2000_implementation (str): What JPEG2000 implementation to use
mrc_image_format (str): What image format to produce
timing_data (optional): Add time information to timing_data structure
Returns the filepath to the JPEG2000 foreground image
- internetarchivepdf.mrc.encode_mrc_img(np_img, img_compression_flags, imgtype=None, tmp_dir=None, jpeg2000_implementation=None, mrc_image_format=None, timing_data=None, threads=False, debug=False)[source]¶
Encode image as JPEG2000 or JPEG, with the provided compression settings and JPEG2000/JPEG encoder.
Args:
np_img (numpy.array): Image array
img_compression_flags (str): Compression flags
imgtype (str: ‘bg’ or ‘fg’
tmp_dir (str): path the temporary directory to write images to
jpeg2000_implementation (str): What JPEG2000 implementation to use
mrc_image_format (str): What image format to produce
timing_data (optional): Add time information to timing_data structure
debug (bool, optional): Write debug info to stderr
Returns the filepath to the JPEG2000 image
- internetarchivepdf.mrc.encode_mrc_mask(np_mask, tmp_dir=None, jbig2=True, embedded_jbig2=False, timing_data=None, debug=False)[source]¶
Encode mask image either to JBIG2 or PNG.
Args:
np_mask (numpy.array): Mask image array
tmp_dir (str): path the temporary directory to write images to
jbig2 (bool): Whether to encode to JBIG2 or PNG
embedded_jbig2 (bool): Whether to encode to JBIG2 with or without header
timing_data (optional): Add time information to timing_data structure
Returns a tuple: (str, str) where the first entry is the jbig2 path, if any, the second is the png path.
- internetarchivepdf.mrc.partial_blur(mask, img, sigma=5, mode=None)[source]¶
Blur a part of the image ‘img’, where mask = 0. The actual values used by the blur are colours where mask = ‘1’, effectively ‘erasing/blurring’ parts of an image where mask = 0 with colours where mask = 1.
At the end, restore all pixels from img where mask = 1.