Regex support


What would be the 'best' way to implement regex support in Stanza? Do we want something written in native code, or can we just cook up some bindings for PCRE (which is how R does it, if I am not wrong)?


For something like a regex library, I am perfectly happy with a set of bindings to PCRE.

A regex library is mainly just a collection of utility functions. Unlike more heavy-weight libraries, such as for GUI, there's (usually) no callbacks, or inheritance frameworks, etc. For such heavy-weight libraries it might make sense to directly code a Stanza version since there is opportunity to improve the interfaces. But this is not the case for regex.



I have used PCRE in several projects before and so I have a decent amount of experience with it; maybe this weekend I'll hack something real quick and put it out there so others can contribute as well.


That would be fantastic. Just create a new library called regex, and put all the functions there. 

Here's something that might be useful for you. I'm not sure whether I remembered to document it. Here is the syntax for creating a literal un-escaped string:

\<some random tag>This is a string containing \backslashes and arbitrary punctuation '(){}#<some random tag>

For example,


Please feel free to ask questions if you're stuck on something. 



Thanks for the tip.
The reason I asked the question in the first place is because having a library binding for a feature like this, which is usually bundled within the language as a module/package/library/whatever, introduces a dependency on that third-party library. You must then a) ask your users to install that library before using that module, or b) install that library with your own code.

But if everyone is OK with it, I suppose this is the way to go. I don't see it being too hard; once you create a regex object using the regex pattern, the rest is pretty simple. I'll keep you all updated.




That's great.

I don't see the dependencies as a big deal. In the beginning, it takes a little bit of manual installation, but once the package is stable, we'll just bundle PCRE with the Stanza installation. So it will be completely transparent to the user in the end.



Maybe I am missing something in the documentation, but I couldn't find an answer to this so I guess I'll just post it here.

So I wrote a bunch of functions in C with signatures such as this:

somestructhere *myfunc(char *a, char *b)

So I need to pass in character pointers to it. So I create a regular Stanza function that takes in two strings, and then from within that function I call a lostanza function that takes in a couple of ref<Byte> arguments. Now how do I convert the string values of these references to character pointers that I can pass to my C function? I can do val v1 = ref1.value, but I can't do val ptr1:ptr<byte> = v1.

Another question: How do I use the gcc L flag to find the PCRE library files?

Again, I am sorry if I missed something in the documentation.




I think what you're after is described in the address section but it was with regards to String types, not Byte types.

lostanza defn myfunc (a:ref<String>, b:ref<String) -> ref<somestructhere> :
    return call-c myfunc(a.chars, b.chars)

I'm not sure if chars is how Byte's internal structure has it named but that is how it works for strings.


Can you explain again what you mean by the ref<Byte> arguments? To call your function you can do as Jake suggested and use the addr! operator. The following example returns one field in a struct.

lostanza deftype SomeStruct :
   field1: int
   field2: int

extern myfunc: (ptr<byte>, ptr<byte>) -> ptr<SomeStruct>

lostanza defn myfunc (a:ref<String>, b:ref<String>) -> ref<Int> :
   val x = call-c myfunc(addr!(a.chars), addr!(b.chars))
   return new Int{x.field1}

To pass extra flags to the C compiler, you can use the -ccflags flag. For more information about flags, you can also read the Appendix in Stanza by Example.

stanza myfile.stanza -o myprogram -ccflags "-Lpcre"

Let me know if there's any more hiccups!



The byte type in LoStanza corresponds to the char type in C. So that part should be fine.

If you're seeing a Segmentation Fault, can you double-check whether you remembered to use "call-c"? 

If you're still having trouble, it would be helpful to post your C prototype, the LoStanza extern, and function call here.




@Jake: Thanks for your response (and sorry for posting the question on the channel and then disappearing; I quit accidentally).

@Patrick: Jake's suggestion worked wonderfully: I knew I it must have been documented somewhere!

So now I do have a regex search function that takes in a regex, searches a string and returns all the matched substrings. Pretty good progress for a few hours' worth of work, I suppose.


Glad to hear it! 

I think progress will also be quicker now that you've gone through the process once already.



I've noticed that throughout Chapter 11 there are various examples of just accessing a.value without needing an addr operator. When do you use addr!/addr and when can you just do access the fields directly?


Hi Jake.

The String type is declared in core like this:

lostanza deftype String :
   length: long
   hash: int
   chars: byte ...

indicating that a String value consists of a long, followed by an int, followed by a variable number of byte values.

Thus supposing that x is a reference to a String:

val x:ref<String> = ...

Then the expression x.hash will have type int and will refer to the 32-bit value stored in the hash field in x.

The expression addr!(x.hash) will have type ptr<int> and will refer to the 64-bit pointer that points to the 32-bit hash field in x.

x.chars would refer to the variable number of bytes at the end of the String value, but Stanza won't allow you to refer to it directly because it does not have a fixed size.

addr!(x.chars) refers to the 64-bit pointer that points to the first byte at the end of the string.

Does that make sense?



Yep that makes perfect sense! Thanks!


Sumit wrote:

I have a sort of related question actually.

The String method (that returns ref<String>) has a bunch of constructors, such as the one that only takes in a ptr<byte> and throws a String reference. When I try to use this particular constructor with my ptr<byte>, I get an error:

Type String requires at least 2 fields, but given 1. Cannot create array of type String with fields of type ptr<byte>. 

I see that particular constructor being used at a bunch of places in the core library though. What am I missing?

Is it because you are using the  `new String{ptr}` syntax? Stanza does not have a notion of "constructor", so I believe you're referring to a bunch of functions. The syntax to use those is just `String(ptr)`. 



My bad; I knew those were just functions (strong OOP background can have you labeling everything a constructor), but I didn't consider using String(ptr) at all. The error wasn't particularly helpful either.




Thanks for the report! Good error messages is definitely more recent art than science. We're working hard on it.

Yves Cloutier

Hi Jake,

I'm wondering if you had continued working on PCRE for stanza, and if so if you wouldn't mind sharing?

I'd like to try making a lisp following these steps:

but the first steps in the process involve using regex :(

Would love to give your implementation a try :)




Hey Yves,

I had to stop working on my regex package due to a series of unfortunate events, but I am trying to get back into it now. My current package (which is on gitHub) only has a regex 'match' function, but a regex 'split' function wouldn't be particularly hard to cook up. Is that the functionality you are looking for?