Class ApproximateMarkdownServiceImpl

    • Field Detail

      • ATTRIBUTE_TO_MARKDOWN_PREFIX

        public static final Map<String,​String> ATTRIBUTE_TO_MARKDOWN_PREFIX
      • IGNORED_VALUE_PATTERN

        protected static final Pattern IGNORED_VALUE_PATTERN
        Ignored values for labelled output: "true"/ "false" / single number (int / float) attributes or array of numbers attributes, or shorter than 3 digits or path, or array or type date or boolean or {Date} or {Boolean} , inherit, blank, html tags, target .
      • IGNORED_NODE_NAMES

        protected static final Pattern IGNORED_NODE_NAMES
        We ignore nodes named i18n or renditions and nodes starting with rep:, dam:, cq:
      • IMAGE_PATTERN

        protected static final Pattern IMAGE_PATTERN
      • VIDEO_PATTERN

        protected static final Pattern VIDEO_PATTERN
      • ADMISSIBLE_PATH_PATTERN

        public static final Pattern ADMISSIBLE_PATH_PATTERN
        We allow generating markdown for subpaths of /content, /public and /preview .
      • THREE_WHITESPACE_PATTERN

        public static final Pattern THREE_WHITESPACE_PATTERN
        If that occurs in a string it has several words.
      • textAttributes

        @Nonnull
        protected List<String> textAttributes
        A list of attributes that are output (in that ordering) without any label, each on a line for itself.
      • labelledAttributeOrder

        protected List<String> labelledAttributeOrder
        A list of labelled attributes that come first if they are present, in the given order.
      • labeledAttributePatternAllow

        @Nullable
        protected Pattern labeledAttributePatternAllow
        A pattern which attributes have to be output with a label: the attribute name, a colon and a space and then the trimmed attribute value followed by newline.
      • urlBlacklist

        protected List<Pattern> urlBlacklist
        Whitelist for URLs we can connect to get the markdown. Required - the URL has to match one of the patterns.
      • urlWhitelist

        protected List<Pattern> urlWhitelist
        Blacklist for URLs we can connect to get the markdown. The URL must not match one of the patterns.
      • chatCompletionService

        protected com.composum.ai.backend.base.service.chat.GPTChatCompletionService chatCompletionService
      • PATTERN_HTML_TAG

        protected Pattern PATTERN_HTML_TAG
      • htmltags

        protected final Set<String> htmltags
    • Constructor Detail

      • ApproximateMarkdownServiceImpl

        public ApproximateMarkdownServiceImpl()
    • Method Detail

      • logUnhandledAttributes

        protected void logUnhandledAttributes​(org.apache.sling.api.resource.Resource resource)
      • approximateMarkdown

        @Nonnull
        public String approximateMarkdown​(@Nullable
                                          org.apache.sling.api.resource.Resource resource,
                                          org.apache.sling.api.SlingHttpServletRequest request,
                                          org.apache.sling.api.SlingHttpServletResponse response)
        Description copied from interface: ApproximateMarkdownService
        Generates a text formatted with markdown that heuristically represents the text content of a page or resource, mainly for use with the AI. That is rather heuristically - it cannot faithfully represent the page, but will probably be enough to generate summaries, keywords and so forth.
        Specified by:
        approximateMarkdown in interface ApproximateMarkdownService
        Parameters:
        resource - the resource to render to markdown. Caution: if this is not the content resource of a page but the cpp:Page, the markdown will contain all subpages as well!
        Returns:
        the markdown representation
      • approximateMarkdown

        public void approximateMarkdown​(@Nullable
                                        org.apache.sling.api.resource.Resource resource,
                                        @Nonnull
                                        PrintWriter realOutput,
                                        @Nonnull
                                        org.apache.sling.api.SlingHttpServletRequest request,
                                        @Nonnull
                                        org.apache.sling.api.SlingHttpServletResponse response)
        Description copied from interface: ApproximateMarkdownService
        Generates a text formatted with markdown that heuristically represents the text content of a page or resource, mainly for use with the AI. That is rather heuristically - it cannot faithfully represent the page, but will probably be enough to generate summaries, keywords and so forth.
        Specified by:
        approximateMarkdown in interface ApproximateMarkdownService
        Parameters:
        resource - the resource to render to markdown. Caution: if this is not the content resource of a page but the cpp:Page, the markdown will contain all subpages as well!
        realOutput - destination where the markdown rendering will be written.
      • handleResource

        protected boolean handleResource​(@NotNull
                                         @NotNull org.apache.sling.api.resource.Resource resource,
                                         @NotNull
                                         @NotNull PrintWriter out,
                                         boolean printEmptyLine)
      • attributeToMarkdown

        protected String attributeToMarkdown​(@NotNull
                                             @NotNull org.apache.sling.api.resource.Resource resource,
                                             String attributename,
                                             String value)
      • checkUrlAdmissible

        protected URI checkUrlAdmissible​(URI uri)
      • handleCodeblock

        protected boolean handleCodeblock​(org.apache.sling.api.resource.Resource resource,
                                          PrintWriter out,
                                          boolean printEmptyLine)
      • handleLabeledAttributes

        protected boolean handleLabeledAttributes​(org.apache.sling.api.resource.Resource resource,
                                                  PrintWriter out,
                                                  boolean printEmptyLine)
      • admissibleValue

        protected boolean admissibleValue​(Object object)
        We do not print pure numbers, booleans and some special strings since those are likely attributes determining the component layout, not actual text that is printed. all "true"/ "false" / single number (int / float) attributes or array of numbers attributes, or shorter than 3 digits or path, or array or type date or boolean or {Date} or {Boolean} , inherit, blank, html tags, target .
      • deactivate

        protected void deactivate()
      • getComponentLinks

        @NotNull
        public @NotNull List<ApproximateMarkdownService.Link> getComponentLinks​(@NotNull
                                                                                @NotNull org.apache.sling.api.resource.Resource resource)
        Returns a number of links that are saved in the component or siblings of the component that could be used as a proposal for the user to be used as source for the AI via markdown generation etc. This heuristically collects a number of links that might be interesting. We traverse the attributes of resource and all children and collect everything that starts with /content. If there are less than 5 links, we continue with the parent resource until jcr:content is reached. The link title will be the jcr:title or title attribute.
        Specified by:
        getComponentLinks in interface ApproximateMarkdownService
        Parameters:
        resource - the resource to check
        Returns:
        a list of links, or an empty list if there are none.
      • collectLinks

        protected void collectLinks​(@NotNull
                                    @NotNull org.apache.sling.api.resource.Resource resource,
                                    List<ApproximateMarkdownService.Link> resourceLinks)
        Collects links from a resource and its children. The link title will be the jcr:title or title attribute.
        Parameters:
        resource - the resource to collect links from
        resourceLinks - the list to store the collected links
      • traverseTreeForStructureGathering

        protected void traverseTreeForStructureGathering​(org.apache.sling.api.resource.Resource resource,
                                                         PrintWriter out,
                                                         String outerResourceType,
                                                         String subpath)
        This is debugging code we needed to gather information for the implementation; we keep it around for now. out.println("Approximated markdown for " + path); traverseTreeForStructureGathering(resource, out, null, null); out.println("DONE"); out.println("HTML tags found:" + htmltags);
      • captureHtmlTags

        protected void captureHtmlTags​(String value)