Class ApproximateMarkdownServiceImpl
- java.lang.Object
-
- com.composum.ai.backend.slingbase.impl.ApproximateMarkdownServiceImpl
-
- All Implemented Interfaces:
ApproximateMarkdownService
public class ApproximateMarkdownServiceImpl extends Object implements ApproximateMarkdownService
Implementation forApproximateMarkdownService
.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interface
ApproximateMarkdownServiceImpl.Config
Configuration class Config that allows us to configure TEXT_ATTRIBUTES.-
Nested classes/interfaces inherited from interface com.composum.ai.backend.slingbase.ApproximateMarkdownService
ApproximateMarkdownService.Link
-
-
Field Summary
Fields Modifier and Type Field Description static Pattern
ADMISSIBLE_PATH_PATTERN
We allow generating markdown for subpaths of /content, /public and /preview .static Map<String,String>
ATTRIBUTE_TO_MARKDOWN_PREFIX
protected GPTChatCompletionService
chatCompletionService
protected Set<String>
htmltags
protected static Pattern
IGNORED_NODE_NAMES
We ignore nodes named i18n or renditions and nodes starting with rep:, dam:, cq:protected static Pattern
IGNORED_VALUE_PATTERN
Ignored values for labelled output: "true"/ "false" / single number (int / float) attributes or array of numbers attributes, or shorter than 3 digits or path, or array or type date or boolean or {Date} or {Boolean} , inherit, blank, html tags, target .protected static Pattern
IMAGE_PATTERN
protected Pattern
labeledAttributePatternAllow
A pattern which attributes have to be output with a label: the attribute name, a colon and a space and then the trimmed attribute value followed by newline.protected Pattern
labeledAttributePatternDeny
A pattern matching exceptions forlabeledAttributePatternAllow
.protected List<String>
labelledAttributeOrder
A list of labelled attributes that come first if they are present, in the given order.protected Pattern
PATTERN_HTML_TAG
protected List<ApproximateMarkdownServicePlugin>
plugins
protected List<String>
textAttributes
A list of attributes that are output (in that ordering) without any label, each on a line for itself.static Pattern
THREE_WHITESPACE_PATTERN
If that occurs in a string it has several words.protected List<Pattern>
urlBlacklist
Whitelist for URLs we can connect to get the markdown.protected List<Pattern>
urlWhitelist
Blacklist for URLs we can connect to get the markdown.protected static Pattern
VIDEO_PATTERN
-
Fields inherited from interface com.composum.ai.backend.slingbase.ApproximateMarkdownService
HEADER_IMAGEPATH
-
-
Constructor Summary
Constructors Constructor Description ApproximateMarkdownServiceImpl()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
activate(ApproximateMarkdownServiceImpl.Config config)
protected boolean
admissibleValue(Object object)
We do not print pure numbers, booleans and some special strings since those are likely attributes determining the component layout, not actual text that is printed.void
approximateMarkdown(org.apache.sling.api.resource.Resource resource, PrintWriter realOutput, org.apache.sling.api.SlingHttpServletRequest request, org.apache.sling.api.SlingHttpServletResponse response)
Generates a text formatted with markdown that heuristically represents the text content of a page or resource, mainly for use with the AI.String
approximateMarkdown(org.apache.sling.api.resource.Resource resource, org.apache.sling.api.SlingHttpServletRequest request, org.apache.sling.api.SlingHttpServletResponse response)
Generates a text formatted with markdown that heuristically represents the text content of a page or resource, mainly for use with the AI.protected String
attributeToMarkdown(@NotNull org.apache.sling.api.resource.Resource resource, String attributename, String value)
protected void
captureHtmlTags(String value)
protected URI
checkUrlAdmissible(URI uri)
protected void
collectLinks(@NotNull org.apache.sling.api.resource.Resource resource, List<ApproximateMarkdownService.Link> resourceLinks)
Collects links from a resource and its children.protected void
deactivate()
protected ApproximateMarkdownServicePlugin.PluginResult
executePlugins(org.apache.sling.api.resource.Resource resource, PrintWriter out, org.apache.sling.api.SlingHttpServletRequest request, org.apache.sling.api.SlingHttpServletResponse response)
@NotNull List<ApproximateMarkdownService.Link>
getComponentLinks(@NotNull org.apache.sling.api.resource.Resource resource)
Returns a number of links that are saved in the component or siblings of the component that could be used as a proposal for the user to be used as source for the AI via markdown generation etc.String
getImageUrl(org.apache.sling.api.resource.Resource imageResource)
Retrieves the imageURL in a way useable for ChatGPT - usually data:image/jpeg;base64,{base64_image}String
getMarkdown(String value)
Returns a markdown representation of an attribute value, which might be plain text or HTML.@NotNull String
getMarkdown(URI uri)
Retrieves the text content for an URL.protected boolean
handleCodeblock(org.apache.sling.api.resource.Resource resource, PrintWriter out, boolean printEmptyLine)
protected boolean
handleLabeledAttributes(org.apache.sling.api.resource.Resource resource, PrintWriter out, boolean printEmptyLine)
protected boolean
handleResource(@NotNull org.apache.sling.api.resource.Resource resource, @NotNull PrintWriter out, boolean printEmptyLine)
protected void
logUnhandledAttributes(org.apache.sling.api.resource.Resource resource)
protected void
traverseTreeForStructureGathering(org.apache.sling.api.resource.Resource resource, PrintWriter out, String outerResourceType, String subpath)
This is debugging code we needed to gather information for the implementation; we keep it around for now.
-
-
-
Field Detail
-
IGNORED_VALUE_PATTERN
protected static final Pattern IGNORED_VALUE_PATTERN
Ignored values for labelled output: "true"/ "false" / single number (int / float) attributes or array of numbers attributes, or shorter than 3 digits or path, or array or type date or boolean or {Date} or {Boolean} , inherit, blank, html tags, target .
-
IGNORED_NODE_NAMES
protected static final Pattern IGNORED_NODE_NAMES
We ignore nodes named i18n or renditions and nodes starting with rep:, dam:, cq:
-
IMAGE_PATTERN
protected static final Pattern IMAGE_PATTERN
-
VIDEO_PATTERN
protected static final Pattern VIDEO_PATTERN
-
ADMISSIBLE_PATH_PATTERN
public static final Pattern ADMISSIBLE_PATH_PATTERN
We allow generating markdown for subpaths of /content, /public and /preview .
-
THREE_WHITESPACE_PATTERN
public static final Pattern THREE_WHITESPACE_PATTERN
If that occurs in a string it has several words.
-
textAttributes
@Nonnull protected List<String> textAttributes
A list of attributes that are output (in that ordering) without any label, each on a line for itself.
-
labelledAttributeOrder
protected List<String> labelledAttributeOrder
A list of labelled attributes that come first if they are present, in the given order.
-
labeledAttributePatternAllow
@Nullable protected Pattern labeledAttributePatternAllow
A pattern which attributes have to be output with a label: the attribute name, a colon and a space and then the trimmed attribute value followed by newline.
-
labeledAttributePatternDeny
@Nullable protected Pattern labeledAttributePatternDeny
A pattern matching exceptions forlabeledAttributePatternAllow
.
-
urlBlacklist
protected List<Pattern> urlBlacklist
Whitelist for URLs we can connect to get the markdown. Required - the URL has to match one of the patterns.
-
urlWhitelist
protected List<Pattern> urlWhitelist
Blacklist for URLs we can connect to get the markdown. The URL must not match one of the patterns.
-
chatCompletionService
protected GPTChatCompletionService chatCompletionService
-
plugins
@Nonnull protected volatile List<ApproximateMarkdownServicePlugin> plugins
-
PATTERN_HTML_TAG
protected Pattern PATTERN_HTML_TAG
-
-
Method Detail
-
logUnhandledAttributes
protected void logUnhandledAttributes(org.apache.sling.api.resource.Resource resource)
-
approximateMarkdown
@Nonnull public String approximateMarkdown(@Nullable org.apache.sling.api.resource.Resource resource, org.apache.sling.api.SlingHttpServletRequest request, org.apache.sling.api.SlingHttpServletResponse response)
Description copied from interface:ApproximateMarkdownService
Generates a text formatted with markdown that heuristically represents the text content of a page or resource, mainly for use with the AI. That is rather heuristically - it cannot faithfully represent the page, but will probably be enough to generate summaries, keywords and so forth.- Specified by:
approximateMarkdown
in interfaceApproximateMarkdownService
- Parameters:
resource
- the resource to render to markdown. Caution: if this is not the content resource of a page but the cpp:Page, the markdown will contain all subpages as well!- Returns:
- the markdown representation
-
approximateMarkdown
public void approximateMarkdown(@Nullable org.apache.sling.api.resource.Resource resource, @Nonnull PrintWriter realOutput, @Nonnull org.apache.sling.api.SlingHttpServletRequest request, @Nonnull org.apache.sling.api.SlingHttpServletResponse response)
Description copied from interface:ApproximateMarkdownService
Generates a text formatted with markdown that heuristically represents the text content of a page or resource, mainly for use with the AI. That is rather heuristically - it cannot faithfully represent the page, but will probably be enough to generate summaries, keywords and so forth.- Specified by:
approximateMarkdown
in interfaceApproximateMarkdownService
- Parameters:
resource
- the resource to render to markdown. Caution: if this is not the content resource of a page but the cpp:Page, the markdown will contain all subpages as well!realOutput
- destination where the markdown rendering will be written.
-
handleResource
protected boolean handleResource(@NotNull @NotNull org.apache.sling.api.resource.Resource resource, @NotNull @NotNull PrintWriter out, boolean printEmptyLine)
-
attributeToMarkdown
protected String attributeToMarkdown(@NotNull @NotNull org.apache.sling.api.resource.Resource resource, String attributename, String value)
-
executePlugins
@Nonnull protected ApproximateMarkdownServicePlugin.PluginResult executePlugins(@Nonnull org.apache.sling.api.resource.Resource resource, @Nonnull PrintWriter out, @Nonnull org.apache.sling.api.SlingHttpServletRequest request, @Nonnull org.apache.sling.api.SlingHttpServletResponse response)
-
getMarkdown
@Nonnull public String getMarkdown(@Nullable String value)
Description copied from interface:ApproximateMarkdownService
Returns a markdown representation of an attribute value, which might be plain text or HTML. We determine whether it's HTML heuristically - in that case it's transformed to markdown, otherwise we just return the value.- Specified by:
getMarkdown
in interfaceApproximateMarkdownService
-
getMarkdown
@NotNull public @NotNull String getMarkdown(@Nonnull URI uri) throws MalformedURLException, IOException, IllegalArgumentException
Description copied from interface:ApproximateMarkdownService
Retrieves the text content for an URL.- Specified by:
getMarkdown
in interfaceApproximateMarkdownService
- Throws:
MalformedURLException
IOException
IllegalArgumentException
-
handleCodeblock
protected boolean handleCodeblock(org.apache.sling.api.resource.Resource resource, PrintWriter out, boolean printEmptyLine)
-
handleLabeledAttributes
protected boolean handleLabeledAttributes(org.apache.sling.api.resource.Resource resource, PrintWriter out, boolean printEmptyLine)
-
admissibleValue
protected boolean admissibleValue(Object object)
We do not print pure numbers, booleans and some special strings since those are likely attributes determining the component layout, not actual text that is printed. all "true"/ "false" / single number (int / float) attributes or array of numbers attributes, or shorter than 3 digits or path, or array or type date or boolean or {Date} or {Boolean} , inherit, blank, html tags, target .
-
activate
protected void activate(ApproximateMarkdownServiceImpl.Config config)
-
deactivate
protected void deactivate()
-
getComponentLinks
@NotNull public @NotNull List<ApproximateMarkdownService.Link> getComponentLinks(@NotNull @NotNull org.apache.sling.api.resource.Resource resource)
Returns a number of links that are saved in the component or siblings of the component that could be used as a proposal for the user to be used as source for the AI via markdown generation etc. This heuristically collects a number of links that might be interesting. We traverse the attributes of resource and all children and collect everything that starts with /content. If there are less than 5 links, we continue with the parent resource until jcr:content is reached. The link title will be the jcr:title or title attribute.- Specified by:
getComponentLinks
in interfaceApproximateMarkdownService
- Parameters:
resource
- the resource to check- Returns:
- a list of links, or an empty list if there are none.
-
collectLinks
protected void collectLinks(@NotNull @NotNull org.apache.sling.api.resource.Resource resource, List<ApproximateMarkdownService.Link> resourceLinks)
Collects links from a resource and its children. The link title will be the jcr:title or title attribute.- Parameters:
resource
- the resource to collect links fromresourceLinks
- the list to store the collected links
-
getImageUrl
public String getImageUrl(org.apache.sling.api.resource.Resource imageResource)
Description copied from interface:ApproximateMarkdownService
Retrieves the imageURL in a way useable for ChatGPT - usually data:image/jpeg;base64,{base64_image}- Specified by:
getImageUrl
in interfaceApproximateMarkdownService
-
traverseTreeForStructureGathering
protected void traverseTreeForStructureGathering(org.apache.sling.api.resource.Resource resource, PrintWriter out, String outerResourceType, String subpath)
This is debugging code we needed to gather information for the implementation; we keep it around for now. out.println("Approximated markdown for " + path); traverseTreeForStructureGathering(resource, out, null, null); out.println("DONE"); out.println("HTML tags found:" + htmltags);
-
captureHtmlTags
protected void captureHtmlTags(String value)
-
-