cornsnake.util_pdf

Functions for extracting text from a PDF file and checking if a file is a PDF.

Documentation

 1"""
 2Functions for extracting text from a PDF file and checking if a file is a PDF.
 3
 4[Documentation](http://docs.mrseanryan.cornsnake.s3-website-eu-west-1.amazonaws.com/cornsnake/util_pdf.html)
 5"""
 6
 7
 8def extract_text_from_pdf(filepath):
 9    """
10    Function to extract text from a PDF file.
11
12    Args:
13    filepath (str): The path to the PDF file.
14
15    Returns:
16    str: The extracted text from the PDF file.
17    """
18    import fitz  # try avoid forcing install of PyMuPDF unless actually used
19
20    with fitz.open(filepath) as doc:
21        FORM_FEED = 12
22        text = chr(FORM_FEED).join([page.get_text() for page in doc])
23        return text
24
25
26def is_pdf(filepath):
27    """
28    Function to check if a file is a PDF.
29
30    Args:
31    filepath (str): The path to the file.
32
33    Returns:
34    bool: True if the file is a PDF, False otherwise.
35    """
36    return filepath[-4:] == ".pdf"
def extract_text_from_pdf(filepath):
 9def extract_text_from_pdf(filepath):
10    """
11    Function to extract text from a PDF file.
12
13    Args:
14    filepath (str): The path to the PDF file.
15
16    Returns:
17    str: The extracted text from the PDF file.
18    """
19    import fitz  # try avoid forcing install of PyMuPDF unless actually used
20
21    with fitz.open(filepath) as doc:
22        FORM_FEED = 12
23        text = chr(FORM_FEED).join([page.get_text() for page in doc])
24        return text

Function to extract text from a PDF file.

Args: filepath (str): The path to the PDF file.

Returns: str: The extracted text from the PDF file.

def is_pdf(filepath):
27def is_pdf(filepath):
28    """
29    Function to check if a file is a PDF.
30
31    Args:
32    filepath (str): The path to the file.
33
34    Returns:
35    bool: True if the file is a PDF, False otherwise.
36    """
37    return filepath[-4:] == ".pdf"

Function to check if a file is a PDF.

Args: filepath (str): The path to the file.

Returns: bool: True if the file is a PDF, False otherwise.