cornsnake.util_pdf
Functions for extracting text from a PDF file and checking if a file is a PDF.
1""" 2Functions for extracting text from a PDF file and checking if a file is a PDF. 3 4[Documentation](http://docs.mrseanryan.cornsnake.s3-website-eu-west-1.amazonaws.com/cornsnake/util_pdf.html) 5""" 6 7 8def extract_text_from_pdf(filepath): 9 """ 10 Function to extract text from a PDF file. 11 12 Args: 13 filepath (str): The path to the PDF file. 14 15 Returns: 16 str: The extracted text from the PDF file. 17 """ 18 import fitz # try avoid forcing install of PyMuPDF unless actually used 19 20 with fitz.open(filepath) as doc: 21 FORM_FEED = 12 22 text = chr(FORM_FEED).join([page.get_text() for page in doc]) 23 return text 24 25 26def is_pdf(filepath): 27 """ 28 Function to check if a file is a PDF. 29 30 Args: 31 filepath (str): The path to the file. 32 33 Returns: 34 bool: True if the file is a PDF, False otherwise. 35 """ 36 return filepath[-4:] == ".pdf"
def
extract_text_from_pdf(filepath):
9def extract_text_from_pdf(filepath): 10 """ 11 Function to extract text from a PDF file. 12 13 Args: 14 filepath (str): The path to the PDF file. 15 16 Returns: 17 str: The extracted text from the PDF file. 18 """ 19 import fitz # try avoid forcing install of PyMuPDF unless actually used 20 21 with fitz.open(filepath) as doc: 22 FORM_FEED = 12 23 text = chr(FORM_FEED).join([page.get_text() for page in doc]) 24 return text
Function to extract text from a PDF file.
Args: filepath (str): The path to the PDF file.
Returns: str: The extracted text from the PDF file.
def
is_pdf(filepath):
27def is_pdf(filepath): 28 """ 29 Function to check if a file is a PDF. 30 31 Args: 32 filepath (str): The path to the file. 33 34 Returns: 35 bool: True if the file is a PDF, False otherwise. 36 """ 37 return filepath[-4:] == ".pdf"
Function to check if a file is a PDF.
Args: filepath (str): The path to the file.
Returns: bool: True if the file is a PDF, False otherwise.