Package org.apache.pdfbox.text
Class PDFMarkedContentExtractor
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.LegacyPDFStreamEngine
org.apache.pdfbox.text.PDFMarkedContentExtractor
This is an stream engine to extract the marked content of a pdf.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final Map<String,
List<TextPosition>> private final Deque<PDMarkedContent>
private final List<PDMarkedContent>
private boolean
-
Constructor Summary
ConstructorsConstructorDescriptionInstantiate a new PDFTextStripper object.PDFMarkedContentExtractor
(String encoding) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionvoid
beginMarkedContentSequence
(COSName tag, COSDictionary properties) Called when a marked content group beginsvoid
Called when a marked content group endsboolean
protected void
This will process a TextPosition object and add the text to the list of characters on a page.void
setSuppressDuplicateOverlappingText
(boolean suppressDuplicateOverlappingText) By default the class will attempt to remove text that overlaps each other.private boolean
within
(float first, float second, float variance) This will determine of two floating point numbers are within a specified variance.void
Methods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, processPage, showGlyph
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Field Details
-
suppressDuplicateOverlappingText
private boolean suppressDuplicateOverlappingText -
markedContents
-
currentMarkedContents
-
characterListMapping
-
-
Constructor Details
-
PDFMarkedContentExtractor
Instantiate a new PDFTextStripper object.- Throws:
IOException
-
PDFMarkedContentExtractor
Constructor. Will apply encoding-specific conversions to the output text.- Parameters:
encoding
- The encoding that the output will be written in.- Throws:
IOException
-
-
Method Details
-
isSuppressDuplicateOverlappingText
public boolean isSuppressDuplicateOverlappingText()- Returns:
- the suppressDuplicateOverlappingText setting.
-
setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText) By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.- Parameters:
suppressDuplicateOverlappingText
- The suppressDuplicateOverlappingText setting to set.
-
within
private boolean within(float first, float second, float variance) This will determine of two floating point numbers are within a specified variance.- Parameters:
first
- The first number to compare to.second
- The second number to compare to.variance
- The allowed variance.
-
beginMarkedContentSequence
Description copied from class:PDFStreamEngine
Called when a marked content group begins- Overrides:
beginMarkedContentSequence
in classPDFStreamEngine
- Parameters:
tag
- indicates the role or significance of the sequenceproperties
- optional properties
-
endMarkedContentSequence
public void endMarkedContentSequence()Description copied from class:PDFStreamEngine
Called when a marked content group ends- Overrides:
endMarkedContentSequence
in classPDFStreamEngine
-
xobject
-
processTextPosition
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPosition
in classLegacyPDFStreamEngine
- Parameters:
text
- The text to process.
-
getMarkedContents
-