FiveTech Support Forums

FiveWin / Harbour / xBase community
Board index FiveWin for Harbour/xHarbour Extract Text
Posts: 933
Joined: Sun Oct 09, 2005 01:05 PM
Extract Text
Posted: Tue Mar 15, 2016 07:45 PM

I am looking for a way to extract text from a text file and was wondering if something already exists to do what I need.

I was looking for something like:

cText := TextExtract( cItem1, cItem2)

cItem1 and cItem2 would be the text that is around the text I am after.

Text Example from the file:
<system_name>MYPC101</system_name>

So I would like to do something like this:
cText := TextExtract( "<system_name>", "</system_name>" )

And it would return cText as "MYPC101"

Any ideas?

Thanks,

Jeff Barnes



(FWH 16.11, xHarbour 1.2.3, Bcc730)
Posts: 6755
Joined: Wed Feb 15, 2012 08:25 PM
Re: Extract Text
Posted: Tue Mar 15, 2016 08:04 PM
Cristobal Navarro

Hay dos tipos de personas: las que te hacen perder el tiempo y las que te hacen perder la noción del tiempo

El secreto de la felicidad no está en hacer lo que te gusta, sino en que te guste lo que haces
Posts: 195
Joined: Sun Jul 22, 2012 07:01 PM
Re: Extract Text
Posted: Tue Mar 15, 2016 10:49 PM
Jeff,

Here is a function I wrote a long time ago to do exactly this. It returns what it finds in a array of strings, not a single string. I suppose this can be optimized somewhat, but other than pre-allocating the array to avoid a bunch of AAdd() calls, I've simply never had the need to improve on it. If I were going to optimize anything the first would be to keep track of the offset into the string, instead of trimming the front off the input strings after each match.

As is, I often return an array of strings of 15,000 to 20,000 at a time, parsing various logs and xml files. Some of the logs are quite large, 50+ MB. Large logs means large memory allocation. Still, I typically just read the entire file into cInputString and process it all in one pass. I do have a version that finds the first instance of matching tags and returns that single instance in a string, but I hardly every use that version.

As written it creates a local upper case copy of the input string and the tags and does an upper case match, but it returns what it finds in the original case.

Code (fw): Select all Collapse
#if ! defined( DEFAULT_MAX_RECORDS )
#define DEFAULT_MAX_RECORDS   20000
#endif

FUNCTION BETWEENTAGSARRAY( cStartTag, cEndTag, cInputString, lIncludeTags )

   LOCAL nStartPoint, nEndPoint
   LOCAL nRecords := 00, nFetchLength := 00, aFoundText := Array( DEFAULT_MAX_RECORDS )
   LOCAL cMDML
   LOCAL cInputStringUpper := Upper( cInputString )
   LOCAL cStartTagUpper    := Upper( cStartTag    )
   LOCAL cEndTagUpper      := Upper( cEndTag      )
   
   hb_Default( @lIncludeTags, .F. )
   
   DO WHILE .T.

      // Find the starting point of the starting tag.
      nStartPoint := At( cStartTagUpper, SubStr( cInputStringUpper, 01 ) )
      IF nStartPoint > 00

         // Adjust starting point to end of starting tag
         nStartPoint += Len( cStartTagUpper )

         // If the first tag is found strip off string up to and including the starting tag itself
         cInputStringUpper := SubStr( cInputStringUpper, nStartPoint )
         cInputString      := SubStr( cInputString,      nStartPoint )

         // Find the starting point of the second tag, beginning from end of first tag.
         nEndPoint := At( cEndTagUpper, cInputStringUpper )
         IF nEndPoint > 00

            // If the second tag is found calculate its position from start of string.
            nFetchLength := nEndPoint - 1

            IF lIncludeTags
               cMDML := cStartTag + LTrim( SubStr( cInputString, 01, nFetchLength ) ) + cEndTag
            ELSE
               cMDML := LTrim( SubStr( cInputString, 01, nFetchLength ) )
            ENDIF

            IF ++nRecords <= DEFAULT_MAX_RECORDS
               aFoundText[ nRecords ] := cMDML
            ELSE
               // IF we get here it is gonna be oh so slow.
               AAdd( aFoundText, cMDML )
            ENDIF

            // clip off the front of the string then loop to find the next
            cInputStringUpper := SubStr( cInputStringUpper, nFetchLength + 01 )
            cInputString      := SubStr( cInputString,      nFetchLength + 01 )

         ELSE
            EXIT
         ENDIF
      ELSE
         EXIT
      ENDIF
   ENDDO
   IF nRecords < DEFAULT_MAX_RECORDS
      aFoundText := ASize( aFoundText, nRecords )
   ENDIF

   RETURN ( aFoundText )


Robb
Posts: 4840
Joined: Fri Nov 18, 2005 04:52 PM
Re: Extract Text
Posted: Tue Mar 15, 2016 11:43 PM

Jeff,

See: FWH\samples\xmlreader.prg

This is a sample XML document reader.

James

FWH 18.05/xHarbour 1.2.3/BCC7/Windows 10
Posts: 195
Joined: Sun Jul 22, 2012 07:01 PM
Re: Extract Text
Posted: Wed Mar 16, 2016 02:36 AM

A lot of what I pull out of logs is xml, but it typically gets written as a line in the log, not clean XML. That is actually why I wrote this function, and also why it has a switch to leave the tags in place as part of the returned strings or not. With XML I typically want the tags, but with other raw logs I do not. I do use the TXMLDocument class, which is used in samples\xmlreader.prg, to parse the XML after it is extracted.

Posts: 933
Joined: Sun Oct 09, 2005 01:05 PM
Re: Extract Text
Posted: Wed Mar 16, 2016 05:02 PM

Thanks Robb. With some slight fine tuning (less than 5 minutes) it does exactly what i need :)

James, I couldn't look at the sample xmleader.prg as I don't seem to have that in my samples folder.
Maybe my FWH version didn't have that.

Thanks,

Jeff Barnes



(FWH 16.11, xHarbour 1.2.3, Bcc730)
Posts: 44158
Joined: Thu Oct 06, 2005 05:47 PM
Re: Extract Text
Posted: Wed Mar 16, 2016 09:19 PM
Jeff,

FWH\samples\xmlreader.prg

Code (fw): Select all Collapse
// Simple example for a generic XML reader

#include "FiveWin.ch"

function Main()
   
   local hFile    := FOpen( "test.xml" ) 
   Local oXmlDoc  := TXmlDocument():New( hFile )
   Local oXmlIter := TXmlIterator():New( oXmlDoc:oRoot ), oTagActual

   while .T.
      oTagActual = oXmlIter:Next()
      If oTagActual != nil
         MsgInfo( oTagActual:cName, oTagActual:cData )
         HEval( oTagActual:aAttributes, { | cKey, cValue | MsgInfo( cKey, cValue ) } )
      Else
         Exit
      Endif
   End

   FClose( hFile )

return nil
regards, saludos

Antonio Linares
www.fivetechsoft.com

Continue the discussion