EXPLORATORY SPATIAL ANALYSIS OF HOUSING PRICES OBTAINED FROM WEB SCRAPING TECHNIQUE

The exploratory spatial analysis allows to describe patterns of spatial distribution, to identify clusters and outliers through specific techniques of spatial association and data model. The objective of the study is to verify the spatial autocorrelation between the mean prices of the housing obtained from web scraping technique in online platforms in the city of Salvador, on the coast of northeast Brazil. For this purpose, the Global Moran’s Index (which provides a general measure of association) and the Local Index of Spatial Association (LISA) were calculated. The results of Global Moran’s Index indicate positive autocorrelation between the mean prices of housing prices in the 163 districts of the municipally that are statistically significant, such as identification of clusters through LISA. Thus, the analysis allows to conclude the existence of a heterogeneous pattern in the distribution of these mean prices in the urban space of Salvador.


INTRODUCTION
The exploratory spatial data analysis represents the first step in the investigation of a phenomenon of interest, in which raw data is transformed into useful information by scientific discovery, being able to reveal, through a range of techniques, different spatial regimes and other forms of spatial instability or nonspatial stationarity (Almeida, 2012a;Anselin, 1995;Longley et al. 2015a).
The observed data about the real world is atypically distributed randomly over space (Fotheringham, 2017) if such dependence exists, it is referred to as spatial autocorrelation (Anselin, 2001). One of the analyzes that can be done is to investigate whether the factors that cause the phenomenon are statistically independent or if the presence of some quality in a location makes it more or less likely in neighboring locations (Moran, 1948;Fotheringham, 2017).
Historically, the accelerated and late urbanization in the countries of Latin America results from innumerable socio-spatial contradictions perpetuated in several points of the urban geographic space. In this complex urban space, countless issues related to housing result from the varied demands of its inhabitants.
The supply of digital data, such as residential advertisements on the web about cities, is increasing with the exponential development of technologies and has provided an interesting alternative to traditional mapping information, such as complementing official statistics or replacing them, depending on the object. (Goodchild, Glennon 2010;Goodchild, 2007;Goodchild, Li, 2012).
Due to the characteristics that configure real estate speculation in large cities and the way it currently operates, the prices of real estate developments reflect and intensify the price of urban land. "In research, time and resources are precious" (Devito et. al, 2020 * Corresponding author p. 621). The internet is a way to quickly get data, released in a high volume, variety, and velocity (Laney, 2001).
The dissemination of Web 2.0 enabled the profusion and remote contribution of several areas of science, mainly of cartography in society, in an accelerated and disseminated way, such as the dispersion of data on the web and the search for agile and valid analyses on the territory (Goodchild, 2009;2010;2007).
The data scraping is used in the case study as a reference for current data about the formal real estate market that operates in the urban space, due to the speed, low price of access, adhesion, and popularity among advertisers, which favors greater updating of online platforms.
This study analyses the pattern of the spatial distribution of land prices rate in the 163 districts of Salvador (a metropolis in the northeast of Brazil), calculated through the average price of m² of data scraped in 2020. Thus, to identify areas where the average distributions present spatial autocorrelation, Moran's Global Index and Local Index of Spatial Association (LISA) were used.

LITERATURE REVIEW
Cities are usually the subject of numerous studies. The performance of real estate financialization (Fix, 2011;Fix, Paulani, 2019) concentrated in specific areas, segregates and fragments territories (Sposito, Sposito, 2000), intensify real estate speculation, denotes an urban land market with its valuation and pricing logic, by uses and intentions in the city.
A considerable literature exists on the mapping of land prices and housing prices, some about the methodology of spatial econometrics. The purpose of mapping residential property prices, as well as land prices, becomes relevant to the fact that it generates analyzes, statistical and representative diagnoses based on the observation of the phenomenon's distribution over geographic space, as seen in (Anselin, Gallo, 2006;Cellmer et al., 2014;Fotheringham, 2017;Chica-Olmo et al., 2019;Cellmer, Cichulska, Bełej, 2020).
The real estate market ends up designing a spatial organization based on the price of locations. A well-known feature of real estate markets is that property prices exhibit extensive spatial variation. Property prices are partly a function of location. Such as the nuances of the property's attributes, such as size, quality, age of housing (Fotheringham & Park 2017). The materiality built on urban land has a high cost in time and a long time of physical depreciation (Abramo, 1989a(Abramo, , 2009b. Geographic information, although not consumed by the entire population, is created by a dense and distributed observers network Glennon, 2010 and2007;Davis, 2018). These data derived from unofficial sources produced by internet users are generally not complete or homogeneous, and their quality is questionable since they carry uncertainties about the truth of the data. (Goodchild, 2010;Davis, 2018).
Spatialization of data voluntarily released on the internet is a fast and important tool for performing spatial correlations. The extraction of these data shared at a high level of user participation in the network provides new opportunities for spatial analysis without great financial costs and is already used by scholars of urban space phenomena (Kádár, 2013;Boeing, Waddell, 2016;Wenceslau, Davis, Smarzaro, 2017).
Web scraping is a technique of extracting data through the World Wide Web (WWW), (Zhao, 2017). It can also be known as web scraping, web crawling, data scraping, data mining, text mining, which is equivalent to the activity of extracting data from websites and transporting them to a simpler and more malleable format so that they can be analyzed and crossed with easier (Andriolo, 2012). As a product of data scraping, you can expect a structured or unstructured database, available in different formats, such as JSON, CSV, XLSX, HTML, and XML (Poongodai, Suhasini 2019).
The scraping technique, as an agile source of data gain, is applied in several areas of research. (Glez-Peña et al., 2013) define that the extraction process mimics the navigation interaction that occurs between a human user and the internet server, through robots that access and search for the requested data. the focus was based on the application with minimal effort in the construction of scripts to achieve the objective of the study.
When studying the United States real estate market through the rental price list (Boeing, Waddell, 2016) obtained through web scraping, its main motivation is that the ownership of these data sets is less investigated in the scientific community. The study of the aforementioned authors comes close to the object of this research when reporting that the more traditional sources of data on real estate commercialization do not reflect the activity of the real estate market. Thus, the use of non-traditional sources of geographic information, such as web scraping, reveals spatial patterns and allows managers to estimate prices on a local and temporal scale.
Information on property prices is one of the key factors that determine decisions in various types of activities, in particular those based on spatial data. Maps showing land prices are created using various methods and tools, most of which depend on correlations between land prices associated with other variables such as centralities (Eshliki, Angherabi, 2011;Cellmer et al. 2014;Chung et al. 2018;Burian et al. 2018).
Recent studies that analyze the spatio-temporal variation of prices show the presence of spatial dependence in the real estate market since the prices of short-distance residential properties are more similar than the prices of long-distance properties. (Chica-Olmo et al. 2019). In the investigation of housing prices in Wuhan by (Liu et al., 2019), it is verified that the valuation of residential properties increases in the areas closest to the ecological lands, and the results are spatialized through the Moran Index and LISA.
The spatial analysis of the segregation dynamics caused by inflation is investigated in (Le Goix et al., 2019). By checking prices and local data for sellers and buyers, it identifies clusters of price inflation given by the dynamics of uneven residential inflation values by analyzing 159,000 residential transactions in Paris between 1996-2012, in a GIS environment.
The analysis of spatial data requires many operations to obtain new information contained in the data (Pochwatka et al., 2017). The mapping of local measures of spatial autocorrelation aims to identify patterns of spatial dependence. The analysis of spatial clusters is a pertinent method of data distributed in contiguous units of areas and with similarities.

The case study of Salvador in Brazil
Brazil is a country of continental extension, with 5.570 municipalities, but with few programs that consolidate databases on a spatial and temporal scale of real estate prices. Most municipalities find it difficult to prepare technical and financial resources for the acquisition of these data, such as for the construction and updating of Land Information Maps.
The study case is taken in Salvador, a metropolis in the northeast of Brazil. With an area of 693,453 km², the city (intra-urban scale) has the highest population density in the country, with 3.859.44 inhabitants/km², being the fourth largest city, with an estimated population of 2.886.698 inhabitants (IBGE Cities, 2020). Salvador has a rich landscape that occupies places of multiple characteristics, a peninsular city (Santos, 2008a). In this large city, the dominance of large landowners, the high migratory rates associated with low mass income conditions, and an intensive valorization of the urban soil, resulted in informal spontaneous and collective housing models (Gordilho-Souza, Monteiro, 2009a). The dispute over the formal market to the detriment of the informal market produces land use with a "confused" structure, with densified and imitated formal neighborhoods in their form (Abramo, 2009b), as show in Figure 2.

Data collection with web scraping
The data on the real state websites are available and organized in a way that can be extracted. For this purpose, it is possible to work with plugins, such as Data Miner, available on the tool's developer platform, executable in Google Chrome browser, which do not require the use of script commands, and whose basic resources are available for free. In this task, data is captured using selectors (Data Miner, 2020).
Then, the data scraping was established through the web marketing platform of the formal real estate market of national reach Imovel web, which offers data on a local scale, classified by municipality and neighborhood of the city.
The scraping process was done in the following steps: The process can be completed with the selection of the automatic navigation counting / linear pagination on the website.
The mapping uses the collection of prices of web scraping ads for the sale of 27,949 apartments and 3,726 houses as a current measure of the formal real estate market. The scraped data are sales price, address (for geolocation), and area, extracted during the period from December 2020 to January 2021. First, repeated ads were excluded, then ads without area footage data, lastly ads without address and name of the neighborhood.
The geocoding process that consists of generating pairs of geographic coordinates from addresses (Anselin, 2006) was performed using the Google Application Programming Interface (API) for further analysis in a Geography Information System (GIS). However, the localization of the residential property listings was aggregated for analysis of the spatial unit of neighborhoods.
To minimize errors referring to the generation of coordinate pairs of addresses with the same street name, which can be generated in other neighborhoods or municipalities, we seek to complement the address field by returning the consultation of the ads. However, for ads without the full address, they were corrected with aggregation of the average coordinate of the neighborhood described in the ad.

Spatial Analysis
The Exploratory Spatial Data Analysis (ESDA) investigates distribution patterns based on the location and distance of spatial data. Geographic observations tend to be dependent and correlated. Nearby data is predictably more similar than data from distant locations. In this context, statistics are used to test hypotheses and calculate the level of autocorrelation between georeferenced data. For a single analyzed variable, the statistic occurs with an expression of causal-hypothetical measure between the observations and the second related to geometricspatial. Identifying clusters is a way of evaluating spatial patterns. In turn, clusters are defined as geographic regions with similar characteristics (Anselin, 1994(Anselin, , 2006Griffith et al., 2013;Fischer and Griffith, 2008;Fotheringham, 2017;Getis, 2008).

Spatial autocorrelation
The purpose of the analysis of spatial autocorrelation is to measure the degree of spatial adherence between observations. For 1st Law of Geography, the value of an object's attribute tends to be more similar to the values of its neighbors than the values of objects located in distant locations (Tobler, 1975;Câmara et al., 2004a;Getis, 2010a).
Spatial autocorrelation is a central concept of empirical application in spatial statistics. Moran's statistic is one of the most widespread measures of spatial dependence. The tested null hypothesis is spatial randomness (Getis, 2007;Almeida, 2012;Anselin, 1995). When considering the structure of these data, it is observed which spatial contiguity matrix will define the set of neighborhoods for each observation if it is of the queen, rook, or bishop type (Anselin, 1995).
Determined from the product of the deviations from the mean values, Moran's I is on a scale of -1 to 1. The positive autocorrelation presents a grouped pattern of similar and close values. The neutral spatial correlation, on the other hand, presents a pattern of random distribution of data in space with values not significantly different from 0 (Anselin, 1995;Longley et al. 2015a). Thus, the lower the correspondence between the expressions of causal-hypothetical and geometric-spatial measurement, the lower the degree of autocorrelation, therefore, Pre scraping definition of the data to be scrapped from the real state website Selection of the area to be analyzed and selection of property type filters (apartments and houses) and operation (purchase)

Configuration of Data
Miner plugin

a) List ads and b) Check the list of CSS elements from address, neighborhood, and price. c) Data scraping results.
Post-scraping: a) Organizing data -correction of incomplete addresses and the addition of houses and apartments data b) Identification of repeated advertisements.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B4-2021 XXIV ISPRS Congress (2021 edition) a negative spatial autocorrelation occurs (Getis, 2007;2010a). For this purpose, the Global Moran's Index (which provides a general measure of association) was calculated: The Indicator of Spatial Association (LISA) is based on the Global Moran Index and is calculated from the product of deviations from the mean as a measure of covariance.
Significantly high values indicate high probabilities that there are spatial association sites. Both for areas with high associated values and those with low associated values. Defines clusters of high-values as High-High (HH) and Low-Low (LL) the concentration of the low-values. And the High-Low (HL) outliers with a high-value with a low value neighborhood, and Low-high (LH) when a low value has a high-value neighborhood (Anselin, 1995). The Local Index of Spatial Association (LISA) was calculated: where I Global is Moran's Index I Local is LISA wij = weight, wij =1 when i and j are neighbours; n = number of observations; xi is the value of the variable at location i ̅ is the rate of variable i; For exploratory data analysis, applied to the univariate of mean land price, through Moran Global Index with 1st order contiguity matrix, queen-type, and Local Index Spatial Association (LISA).

RESULTS AND DISCUSSION
Data obtained through the mentioned procedure resulted in the following representation. The distribution of mean property prices in Salvador shows a significate spatial concentration in the south of the city, and all seafront (E). Moreover, there was a dispersion of prices in the neighborhoods of the inner city and railroad periphery (W-NW), near Todos os Santos Bay, with lower prices.

Spatial pattern of mean property prices
In the univariate analysis, the Global Moran's Index identifies a significant autocorrelation of average prices per m² of properties (IG = 0.353). The distribution is not random although the similarity is small, with values close to 0, pseudo-significance test statistic (p-value = 1%), considering α = 5%.
The LISA reveals spatial grouping of similar values around observation (Anselin, 1995), identifying through colors, the average price clusters of neighborhoods in Salvador where spatial concentrations of properties average prices are significant.   Figure 5 shows the occurrence of hot spot clustering (High-High), which represents the fraction of neighborhoods with high average prices with similar neighbors, and cold spot clustering (Low-Low), low average prices with similar neighbors. The significant LISA test mapped for most neighborhoods in Salvador resulted in values > 95%, not obtaining statistical certainty that the neighbors are similar. In the neighborhoods that present the greatest significance between the interval of p = (0.01 -0.05), were obtain statistical certainty that their neighbors tend to have similar and grouped prices, as shown in figure 6.

CONCLUSION
Due to the positive spatial autocorrelation, the study identified spatial aggregates with similarities between the average prices. The spatial analysis confirms the former hypothesis that positive spatial autocorrelation occurs, showing associations of the average prices in geographic areas that demonstrate close values. The spatial dependence identified by the heterogeneous distribution of average property prices confirms the nonrandomness of the urban space occupation.
Concerning the case study of house prices in Salvador, spaces that are increasingly crystallized and homogeneous occur due to the configuration of the urban space with a different direction of growth than the city. There is a vector of real estate growth on the Atlantic Coast towards the north coast of the city with the highest prices for m², while the lowest average prices for m² are concentrated in the poorest areas, located on the coast of the Bay of Todos os Santos.
The real estate market is highly complex and the properties sold differ in terms of location. For a formal market to exist, there is an informal market in place. In this sector, the acquisition of spatial data has particularities, aiming to analyze the correspondence between government data to the unofficial source, in a more dynamic and accessible way due to the availability of data online, with low cost and reduced collection time.